Scaling Datacenter Accelerators with Computation Reuse Architectures

Authors: Adi Fuchs, David Wentzalff. Princeton University
Venue: ISCA 2018

Being the third paper at ISCA-18 that exploits input redundancy in one way or other (after EVA2 and Euphrates), COREx (COmputation-REuse Accelerators) proposes an effective idea to improve speedup and energy efficiency of datacenters. The paper is motivated by the manifestation of Zipf's law in data center workloads such as internet traffic and data compression.  As the paper title suggests, COREx stores the outputs and inputs of common kernels, and skips computation by sending the stored output to the host, if the current input is the same as stored input. They define the storing step as "memorization" Trading communication for computation, this work is the exact opposite of AMNESIAC (published at ASPLOS-17), which trades computation for communication.

They define 3 constraints that needs to be satisfied in-order for memorization to be successful. (1) Correct results: Memorization must produce the same result as a non-memorized system. For example, a search query like "where am I" depends not just on the input, but also depends on the location of the user, (2) Constrained storage: understanding the variance of inputs is important in designing the storage system required for memorization, (3) Cost-worthy reuse: for memorization to be effective, the cost of storing, looking-up and fetching should be less than doing the actual computation.

The COREx architecture consists of 3 main components: 1. Input Hashing Unit (IHU), 2. Input Look-up Unit (ILU), and 3. Computation History Table (CHT). IHU constructs two hashes from input blocks, index hash and pointer hash. ILU acts as a cache for CHT. ILU indexed based on index hash, and the pointer hash is stored in the data array, which acts as index to the CHT. CHT stores the <input, output> tuple. The input part is compared, and if it's a hit, the output is sent to the host, thereby avoiding computation and hence saving energy and time.


Comments

Popular posts from this blog

Fundamental Latency Trade-offs in Architecting DRAM Caches (Alloy Cache)

ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications