Prediction based Execution on Deep Neural Networks
Authors: Mingcong Song, Jiechen Zhao, Yang Hu, Jiaqi Zhang, and Tao Li
Venue: ISCA 2018
The authors of this paper present a technique to further reduce computations within deep-neural network computation, and present a scale-out design which is able to achieve 2.5X speedup compared to traditional accelerators, and 1.9-2.0X compared to Cnvlutin/Stripes. The concept is based on removing the computations of ineffectual neurons (iEONs). In order to do so, they use first compute the upper bits to determine if the result is likely to have a non-negative value. This is exceedingly elegant, as this means the predictor for computation is actually part of the computation itself--no extra work is being performed. Not only that, the technique requires no retraining and incurs no accuracy loss. The caveat is that the number of upper bits for prediction must be experimentally determined (still no retraining though).
The challenge is that the iEONs are typically randomly dispersed throughout OFMs, creating bubbles in the pipeline. The bubbles mean energy savings, but latency isn't improved. To circumvent this, the authors realize that the number of iEONs typically has low variance for Max-Pooling layers within a network, and similarly they are able to leverage input sharing in ReLU layers. Tracking these iEONs (OFM and IFM depending on layer type) allows for near-full utilization of an accelerator architecture. The benefit can be stacked on top of other techniques as well (e.g. Cnvlutin and Stripes) and nets an overall positive benefit. In summary, this work does dynamic "pruning" based on partial computations, and is able to significantly improve both performance and energy compared to prior art.
Venue: ISCA 2018
The authors of this paper present a technique to further reduce computations within deep-neural network computation, and present a scale-out design which is able to achieve 2.5X speedup compared to traditional accelerators, and 1.9-2.0X compared to Cnvlutin/Stripes. The concept is based on removing the computations of ineffectual neurons (iEONs). In order to do so, they use first compute the upper bits to determine if the result is likely to have a non-negative value. This is exceedingly elegant, as this means the predictor for computation is actually part of the computation itself--no extra work is being performed. Not only that, the technique requires no retraining and incurs no accuracy loss. The caveat is that the number of upper bits for prediction must be experimentally determined (still no retraining though).
The challenge is that the iEONs are typically randomly dispersed throughout OFMs, creating bubbles in the pipeline. The bubbles mean energy savings, but latency isn't improved. To circumvent this, the authors realize that the number of iEONs typically has low variance for Max-Pooling layers within a network, and similarly they are able to leverage input sharing in ReLU layers. Tracking these iEONs (OFM and IFM depending on layer type) allows for near-full utilization of an accelerator architecture. The benefit can be stacked on top of other techniques as well (e.g. Cnvlutin and Stripes) and nets an overall positive benefit. In summary, this work does dynamic "pruning" based on partial computations, and is able to significantly improve both performance and energy compared to prior art.
Comments
Post a Comment