FACT: A Framework for Adaptive Contention-aware Thread Migrations

Authors: Kishore Kumar Pusukuri, David Vengerov, Alexandra Fedorova, Vana Kalogeraki
Venue: Computing Frontiers (CF) 2011

This paper presents one of the first applications of machine learning to solve the thread scheduling problem on multi-core systems. In 2011 (and I believe still today, in 2019), OS's do not factor in resource sharing effects, such as cache, prefetcher, memory bus, memory controller. To effectively schedule tasks, the OS needs to understand how different workloads utilize resources and the overall effects of resource sharing. This paper uses a machine learning approach to predict the effects of potential thread migrations. The work finds that fuzzy rule-based predictors work best, and out performs the default scheduler by ~11% and the prior art by ~2%. The remainder of this post discusses the algorithm and problem setup. This discussion may come across as critical, but it is meant only to be though provoking and counter arguments are welcome. 

The base algorithm searches through core groups, and for each thread in the core group, predicts the average IPC of all threads a thread was migrated. While not explicitly stated, it seems that this prediction step would also benefit from predicting migrating that thread to any potential core group, and swapping it with any potential candidate in that group. With 2 groups and 2 threads each, there are only 4 possible swaps. However, with 4 groups and 4 threads per group, I believe this is 96 possible configurations. I think fundamentally this algorithm suffers from a scaling issue in predicting potential candidates, as a modern processor has at least 2 levels of hierarchy (core pairs, NUMA domains), and typically 16-20 cores per NUMA domain (typically 1-4). However, removing the redistribution step would drastically reduce this number, and is used in practice in the Linux CFS scheduler load balances via 1-way migration. 

The paper also only considers scenarios where there are precisely as many threads as cores. In a multi-core system, under or over subscription is common. Simply balancing the queues may not work, because internally queues would then be a mix of low- and high-resource requirements which may get arbitrarily co-scheduled due to independent queues on separate core groups. Another option would be to pair "high load" in one group and "low load" in another. However, this is not clear how to maintain when the notion of high and low load are imbalanced, or when the notion of high/low contains multiple interacting resources. 

Full Text

Comments

Popular posts from this blog

Fundamental Latency Trade-offs in Architecting DRAM Caches (Alloy Cache)

ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications