Prediction based Execution on Deep Neural Networks

November 05, 2018

Authors: Mingcong Song, Jiechen Zhao, Yang Hu, Jiaqi Zhang, and Tao Li
Venue: ISCA 2018

The authors of this paper present a technique to further reduce computations within deep-neural network computation, and present a scale-out design which is able to achieve 2.5X speedup compared to traditional accelerators, and 1.9-2.0X compared to Cnvlutin/Stripes. The concept is based on removing the computations of ineffectual neurons (iEONs). In order to do so, they use first compute the upper bits to determine if the result is likely to have a non-negative value. This is exceedingly elegant, as this means the predictor for computation is actually part of the computation itself--no extra work is being performed. Not only that, the technique requires no retraining and incurs no accuracy loss. The caveat is that the number of upper bits for prediction must be experimentally determined (still no retraining though).

The challenge is that the iEONs are typically randomly dispersed throughout OFMs, creating bubbles in the pipeline. The bubbles mean energy savings, but latency isn't improved. To circumvent this, the authors realize that the number of iEONs typically has low variance for Max-Pooling layers within a network, and similarly they are able to leverage input sharing in ReLU layers. Tracking these iEONs (OFM and IFM depending on layer type) allows for near-full utilization of an accelerator architecture. The benefit can be stacked on top of other techniques as well (e.g. Cnvlutin and Stripes) and nets an overall positive benefit. In summary, this work does dynamic "pruning" based on partial computations, and is able to significantly improve both performance and energy compared to prior art.

Search This Blog

Karl Taht's Research Paper Blog

Prediction based Execution on Deep Neural Networks

Comments

Post a Comment

Popular posts from this blog

Fundamental Latency Trade-offs in Architecting DRAM Caches (Alloy Cache)

ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications