ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

Authors: Berkin Akin, Zeshan A. Christi, and Alaa R. Alameldeen
Venue: MICRO, 2019

With accelerators dominating the deep learning space in architecture conferences, this paper stands out as it focuses on reducing DNN inference/training overhead while using a CPU. An obvious question is, with the wide-spread use of GPUs/TPUs for deep learning, why should we focus on optimizing CPUs for deep learning. In a recent paper published at HPCA 2019, Facebook claims that CPUs are preferred for applications where a tight integration is required between DNN and non-DNN tasks. Also, Intel's recent AVX512 has specialized support for DNNs in the form of new instructions called Vector Neural Network Instructions (VNNI).

Broadly, optimizing DNNs can be viewed from two different perspectives; computation, and communication. This paper targets reducing communication overhead, more specifically, activation or feature-map communication overhead by compressing them. Compressing activations/weights is a well-studied topic. The authors claim that all the existing works on DNN compression targets accelerators, and specialized datapaths, that applying them directly to CPUs is not possible. This paper proposes two new instructions to the AVX512 family, called ZCOMPS and ZCOMPL. ZCOMPS instruction compresses the data in a 512-bit register, and stores them, whereas ZCOMPL loads the compressed data and decompresses it to a 512-bit register. The target data structure is activations as weights compression can be done offline. They observe an activation sparsity of 49%-62% across the various networks. The main reasons for activation sparsity is the use of ReLU activation function, and dropout layers.

The compression technique is relatively simple. The ZCOMPS instruction has 3 inputs; reg2, reg1, #CCF. The data to be compressed is stored in reg1, and the compressed data and metadata are stored at the address pointed by reg2. #CCF is a flag, that can be used to define the compression needed. That is, the CCF flag can be configured to compare zero values or to compare values less than or equal to zero (ReLU). The steps involved in compression are as follows. The target data is read from reg1, a bit-mask is generated by performing a comparison as configured by CCF. The 1s in the bit-mask refer to the values that gets stored, and 0s refer to the values that need not be stored. Next, the bit-mask (header) and the compressed values are concatenated and stored at the address pointed by reg2. Finally, the address pointed by reg2 is incremented based on the size of compressed data (and metadata) stored, so that the next instruction can use reg2 as the pointer to the destination address. ZOMPL works very similar to ZCOMPS. ZCOMPL has 2 inputs, reg1, and reg2. It doesn't need #CCF flag. ZOMPL loads the compressed data from the address pointed by reg2, decompresses it based on the header, and stores the decompressed data in reg1.

They compare ZCOMP against AVX512 without compression, and AVX512 with compression (generic compression instructions, not specific to DNNs like ZCOMP). On a variety of networks, AVX-comp achieves an average of 4% speed-up for training and a 2% slowdown for inference. ZOMP on the other hand, 11% speed-up for training and 3% speed-up for inference on average.   

Comments

Popular posts from this blog

Fundamental Latency Trade-offs in Architecting DRAM Caches (Alloy Cache)

Continuous Control with Deep Reinforcement Learning (DDPG)

Communist, Utilitarian, and Capitalist Cache Policies on CMPs: Caches as a Shared Resource