Contention-Aware Scheduling on Multi-core Systems

Authors: Sergey Blagodurov, Sergey Zhuravlev, Alexandra Fedorova
Venue:    ACM Transactions on Computer Systems 2010

In format, this is not a traditional paper and reads more like a master's thesis. The work argues that to perform contention-aware scheduling, there must exist a classification scheme and scheduling policy. The first section of the paper dives into classification schemes. Classification in this context measures a workload's sensitivity (how much an application suffers when it receives less cache) and intensity (how much an application will harm others by utilizing the cache). The authors develop a new "Pain" scheme which characterizes both of these metrics and is able to predict contention utilizing stack distance profiles and hardware counters (LLC_LINES_IN). While the work develops procedures specifically focuses on characterizing the interactions in the LLC, the work then provides results demonstrating that LLC contention is only a small factor. Interconnect and memory controller contention often far more of an overhead than LLC itself, according to this work.

The scheduling policy used is referred to as Distributed Intensity (DI) which employs Centralized Sort, which sorts applications via their miss rate. DI is a recursive, hierarchical algorithm which greedily spreads work across different scheduling domains. The scheduling hierarchy is arranged in similar fashion to CFS, which uses Core, Pairs of Cores, and then NUMA-domains. However, unlike CFS, the scheduling all happens globally, meaning that each assignment traverses the entire hierarchy, instead of allowing the individual hierarchies to perform scheduling within their domain. This is likely an implementation/explanation simplification--there's no algorithmic reason this could not be done and maintain the same policy. The authors utilize two versions -- one which measures characterizes applications offline, and another which performs characterizations online. The online characterization is overall marginally worse than the offline, a positive result. The end result of the algorithm is evenly spreading high-miss rate workloads around the system, which results in an overall performance improvement of ~2% for workloads tested (Spec2006 mixes). The authors also show DI can be adapted to target lower energy as well.

Interesting Notes:

  • Despite an earlier section proving that the newly proposed "Pain" metric outperforms using miss rates, the authors offer that obtaining stack distance profiles (SDP) is not desirable. This seems intuitively, until it is pointed out that RapidMRC provides a methodology to obtain SDPs online using performance counters. 
  • Pre-emption is used only in the DIO (DI-online) scenario at 1B cycle intervals (~2Hz).
  • Non-LLC contention is very relevant, but hardly used in the algorithm. Clearly more opportunity exists, especially if we consider other resources like I-cache thrashing, memory bandwidth, and by extension, memory latency.
  • Assumes # threads < # cores
  • No notion of fairness


Comments

Popular posts from this blog

Fundamental Latency Trade-offs in Architecting DRAM Caches (Alloy Cache)

ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications