The Linux Scheduler: A Decade of Wasted Cores
Authors: Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, Vivien Quema, Alexandra Fedorova
Venue: EuroSys 2016
Before diving into this paper, it's worth mentioning that this paper is presented more like a technical report rather than a research paper. Its clearly very important work, but the authors present software "bugs" within Linux, and fixes for those bugs. The paper is not trying to establish something completely new (other than a set of tools), but rather analysis and fixes.
. . .
The paper presents findings that the Linux CFS scheduler breaks a fundamental invariant: make sure that ready threads are scheduled if cores are idle. Due to increased complexity within the scheduler to deal with multiprocessors and NUMA domains, the scheduler has issues which prevent this invariant from being met. The CFS scheduler relies on a hierarchy of "scheduling groups" of cores and NUMA domains (scheduling domains). As an aside, it is not clear if the word "node" is used to mean a NUMA domain or a core, and some parts are muddied because of this.
Venue: EuroSys 2016
Before diving into this paper, it's worth mentioning that this paper is presented more like a technical report rather than a research paper. Its clearly very important work, but the authors present software "bugs" within Linux, and fixes for those bugs. The paper is not trying to establish something completely new (other than a set of tools), but rather analysis and fixes.
. . .
The paper presents findings that the Linux CFS scheduler breaks a fundamental invariant: make sure that ready threads are scheduled if cores are idle. Due to increased complexity within the scheduler to deal with multiprocessors and NUMA domains, the scheduler has issues which prevent this invariant from being met. The CFS scheduler relies on a hierarchy of "scheduling groups" of cores and NUMA domains (scheduling domains). As an aside, it is not clear if the word "node" is used to mean a NUMA domain or a core, and some parts are muddied because of this.
- Group Inbalance bug. Balancing between groups looks at average load. Because averaging the load formulation can obfuscate the presence of idle cores, the scheduler can fail to balance the cores. Fix: use minimum load within a group instead of average. Potential issue: lots of ping-ponging. Though it is said this does not occur, it would be of interest to see the net impacts across all workloads and/or the changes in number of migrations.
- Scheduling Group Construction Bug. The way scheduling groups are created can cause cores to be contained within multiple groups, and cause cores in separate NUMA domains to be grouped together. This can lead to work not getting spread across cores evenly, particularly when used in conjunction with taskset commands for core masking. The fix changes how the groups are constructed... though it is not clear how (only 1 short paragraph on this fix).
- Overload-on-Wakeup Bug. When a sleeping thread wakes up, it wakes up where it slept. The idea is to preserve cache locality. However, depending on sleep time scales, this is not ideal and can cause sleeping threads to interrupt busy cores while idle cores remain present. A good example is when multiple threads are sleeping on a synchronization point, and then all wake up together. Fix: schedule wake ups on cores idle for longer amounts of time.
- Missing Scheduling Domain. This example is truly a code bug. Users have the ability to disable cores via /proc. When a core is re-enabled after this, it is not added back to the scheduler properly and thus receives no work.
The examples in this paper are strong and clearly illustrate performance issues. However, some do feel a bit contrived, such as the 2, which occurs when taskset is used, or 4 which occurs when cores are disabled then re-enabled. Nevertheless, the corrections are important. The paper also presents a tooling methodology to monitor the scheduler and track the invariant. Under violations, the tool can then perform better tracing. The tool itself is the novel component, but receives little attention in the paper. How long does the tool have to run to detect the bugs? How is performance impacted with and without the tool? Does running the tool actually make sense -- is the overhead of the tool better than the gains?
Overall, a good read to get into the scheduler and very interesting violations by the scheduler -- almost 10 years after its deployment!
Overall, a good read to get into the scheduler and very interesting violations by the scheduler -- almost 10 years after its deployment!
Comments
Post a Comment