Post-Silicon CPU Adaptation Made Practical Using Machine Learning

September 06, 2019

Authors: Stephen J. Tarsa, Gautham Chinya, Hong Wang, et. al.
Venue: ISCA 2019

Preface: Before I begin, I'd like to preface that this is one of my favorite papers of 2019; it is well-written, shows poise in the application of machine learning techniques, and consideration of real-world applicability. I read this paper and produced my own slides for it, which can be found here.

Overview: This paper presents an adaptive architecture controlled by a machine learning solution. Adaptive architecture itself is not a novel idea, there have been several works regarding tile-based clock gating, heterogenous core scheduling, pipeline gating, etc. This core chooses a simple adaptive piece of hardware, a binary decision to enable to disable a cluster. In this case, a cluster comprises of instruction cache, a decoder, memory execution unit, register file, ROB, and execution units. The authors hint at this being something like a modern SMT core, which can use all it's resource for a single thread, or optionally disable roughly half of the resources for some performance penalty. Depending on the nature of the workload, it may benefit more or less from enabling the second cluster.

The authors formulate the problem as trying to achieve the best performance per watt while meeting a minimum service level agreement. This agreement is define as achieving 90% of the IPC fully enabled core, and they measure the number of violations of this agreement. To predict whether the cluster should be enabled or disabled, the authors must map the current system state to future behavior. A number of performance counters are selected for predictor inputs via an adapted version of the Perona-Freeman algorithm. The authors then explore a variety of machine learning models, bound to a certain number of operations. Because all models are not equal, they allow models with fewer ops to reconfigure the system at a finer granularity. All models are with in the range once adaptation per 10-100K instructions, with the exception of one additional baseline comparison.

The models are trained with more than 2000 traces, and evaluated with over 500 traces. The end result utilizes a random forest, which achieves ~22% PPW improvements with just 0.3% violations of the SLA, significantly better than prior work which they consider. The authors also note that by directly training on a specific application, they can typically gain 2-10% more PPW, and reduce the SLA violations to virtually zero. In other words, a cloud provider running a few select applications on servers could collect telemetry data of their specific applications, retain a model, and update the firmware specifically for their application.

Discussion: Overall, this is a very well executed work which shows great understanding of machine learning process in the context of computer architecture. However, there a few potential issues worth noting. The used to produce results is built on top of two simulation traces, one in each mode. Because of this, transient effects of enabling/disabling clusters may not be accurately represented. While the authors do account for the latency penalty of enabling/disabling a cluster, there will likely be other transient effects which come into play. The other potential issue is the granularity. For post-silicon data collection, monitoring the system in the order of 10-100K instruction intervals can cause significant perturbations. This would cause telemetry data to be inaccurate. A more realistic value would be in the order of 10M instruction intervals.

Finally, at ~10-100K intervals, it is likely that code will be executing in relatively similar code (prior work suggest workload phases are typically 10-100M instructions). This means that while the proposed predictors are effective, it is highly possible that there are few changes in state (i.e., a 5M instruction trace may allow for the same prediction 99% of the time). In the more realistic granularity of 10-100M instructions, phases would change frequently, and prediction of the dynamic environment may be more difficult.

Full Text

Search This Blog

Karl Taht's Research Paper Blog

Post-Silicon CPU Adaptation Made Practical Using Machine Learning

Comments

Post a Comment

Popular posts from this blog

Fundamental Latency Trade-offs in Architecting DRAM Caches (Alloy Cache)

ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications