Posts

Multi-Resource Packing for Cluster Schedulers

Authors: Robert Grandl, Ganesh Anathanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella Venue: SIGCOMM 2014 Cluster level scheduling is a complex topic in which performance, fairness, and hard constraints must all be considered. Fundamentally, a perfectly fair solution sacrifices performance. This work presents a resource-aware cluster scheduling scheme which maximizes performance and includes additional parameters to balance fairness requirements. For simplicity, I will divide the discussion into two sections: the central idea and additional heuristics. Tetris performs scheduling by analyzing jobs resource requirements in terms of CPUs, memory, disk I/O, and network usage. Each job, task (a subset of a job), and machine is assigned a resource vector. To determine the optimal positioning of a task, a heuristic is used which takes the dot product of the job's resource requirements vs a candidates available resources. The machine with the maximum dot product is selected to p...

Sparrow: Distributed, Low Latency Scheduling

Authors: Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Venue:    SOSP 2013 This work presents Sparrow, a stateless, decentralized scheduler for cluster scheduling. The scheduling component uses two key ideas: batch sampling and late binding. Batch sampling is an extension of the power of two choices [1], which shows that the "tail" can quickly be cut off by simply sampling between two machines versus randomly selecting one. Batch sampling generalizes this by sampling dm machines, and placing the m   tasks on the machine with the lowest load. Late binding delays the actual task transfer until the machine is ready to process the request. This can be thought of as having a place holder in the worker's queue, and when the worker is finally ready to process it, the actual task is transferred from the scheduler to the worker. This avoids having to rely on inaccurate metrics such as queue depth. Each worker maintains its "instance" of Sparrow, which us...

ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

Authors: Berkin Akin, Zeshan A. Christi, and Alaa R. Alameldeen Venue: MICRO, 2019 With accelerators dominating the deep learning space in architecture conferences, this paper stands out as it focuses on reducing DNN inference/training overhead while using a CPU. An obvious question is, with the wide-spread use of GPUs/TPUs for deep learning, why should we focus on optimizing CPUs for deep learning. In a recent paper published at HPCA 2019, Facebook claims that CPUs are preferred for applications where a tight integration is required between DNN and non-DNN tasks. Also, Intel's recent AVX512 has specialized support for DNNs in the form of new instructions called Vector Neural Network Instructions (VNNI). Broadly, optimizing DNNs can be viewed from two different perspectives; computation, and communication. This paper targets reducing communication overhead, more specifically, activation or feature-map communication overhead by compressing them. Compressing activations/weights...

SOSA: Self-Optimizing Learning with Self-Adaptive Control for Hierarchical System-on-chip Management

Authors: Bryan Donyanavard, Tiago Muck, Amir M. Rahmani, Nikil Dutt, Armin Sadighi, Florian Mauer, Andreas Herkersdorf Venue: MICRO 2019 This work presents a control theory / reinforcement learning hybrid approach to solve online parameter tuning for SoC's called SOSA. While controllers are typically known for being light weight, and RL expensive, the authors build the hierarchy opposite from what you might expect. The RL models, Learning Classifier Tables (LCTs), are used as low-level controllers, and high level supervisor controller uses Supervisory Control Theory (SCT). The SCT controls a high-level system model abstraction, which must be consistent with the low-level system "as defined in the Ramadge-Wonham control mechanism" [1]. This assumption requires further investigation. LCTs are a simpler RL algorithm compared to today's deep neural network approaches. They utilize rule-based learning to to target an objective function, which may be multi-variate. The...

Contention-Aware Scheduling on Multi-core Systems

Authors: Sergey Blagodurov, Sergey Zhuravlev, Alexandra Fedorova Venue:    ACM Transactions on Computer Systems 2010 In format, this is not a traditional paper and reads more like a master's thesis. The work argues that to perform contention-aware scheduling, there must exist a classification scheme and scheduling policy. The first section of the paper dives into classification schemes. Classification in this context measures a workload's sensitivity (how much an application suffers when it receives less cache) and intensity (how much an application will harm others by utilizing the cache). The authors develop a new "Pain" scheme which characterizes both of these metrics and is able to predict contention utilizing stack distance profiles and hardware counters (LLC_LINES_IN). While the work develops procedures specifically focuses on characterizing the interactions in the LLC, the work then provides results demonstrating that LLC contention is only a small factor. I...

Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks

Authors: Kay Outerhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker Venue: SOSP 2017 This paper presents a data analytics framework which focuses or providing performance clarity first. Consider a user running there analytics framework on EC2. They need to improve performance. Do they invest in more vCPUs, more memory, more disks per node or more network bandwidth? If they do upgrade, what will the expected performance improvement be? To solve this, Monotasks centralizes on building a framework which decomposes all tasks into single-resource utilization: disk use, network I/O, or CPU. The framework is integrated into Spark, and referred to as "MonoSpark". By simply decomposing tasks at the worker level, the existing Spark API is maintained. By dividing tasks into individual units of disk/network/CPU, the framework can track the precise total amount of work for each, and then determine the performance changes based on the new resource constraints. Most queries are ...

Continuous Control with Deep Reinforcement Learning (DDPG)

Authors: Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicholas Heess, Tom Erez, Yuval Tassa, David Silver, & Daan Wierstra Venue: ICLR 2016 This work focuses on solving the problem of an complex environment AND complex actions. Formally, the work presents "an actor-critic, model-free algorithm based on the deterministic policy gradient (DPG) that can operate over continuous action spaces". This work builds on to two prior works, namely the Deep Q Network (DQN) and DPG algorithm. While DQN proposed using a deep neural network to enable RL to perform well in more complex tasks, it suffers from instability in large action spaces. Orthogonally, DPG offers a solution to large action spaces, but cannot support use of a deep network. This work extends DPG such to fix the instability issues by adding batch normalization and a target network. Batch normalization normalizes each dimension such that samples in a minibatch have a unit mean and variance. The target ne...