Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
Authors: Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Hachol-Balter
Venue: MICRO 2010
This paper presents Thread Cluster Memory Scheduling (TCM), a memory scheduling algorithm that targets optimizing both system throughput and fairness. To achieve this, three key ideas are employed.
Firstly, threads are clustered as either bandwidth intensive or non-intensive. The idea here is that low-bandwidth threads are more sensitive to latency. While an example is presented, an easier way to reason about this is that a low-bandwidth thread needs only a small fraction of memory service time to see a significant performance increase. In other words, it has a high ROI with minimal impact to other threads. As such, non-BW-intensive threads are always given the highest priority.
The second observation to be exploited is the disparity in behavior among high-bandwidth threads. Specifically, a metric niceness measures a threads susceptibility to interference and impact on other threads. Threads with high row-buffer locality will starve of other threads, thus threads with high RBL are "hostile" and have a lower nice value. Similarly, threads with high bank level parallelism is more susceptible to hostile threads, and thus given a higher nice value.
The final piece of the puzzle is in the scheduling algorithm, Insertion Shuffle, which employs niceness values. Insertion shuffle minimizes the amount of time hostile threads are given priority and maximizes the time nice threads given priority. There is an interesting auxiliary effect which the authors refer to as "memory service leaks", which allows nice threads of high priority actually allow lower priority threads to be serviced. Finally, in the event that all threads have relatively similar behavior, a randomized shuffle is used.
In all, TCM provides more throughput and higher fairness compared to existing state-of-the-art such as FR-FCFS, STFM, PAR_BS, and ATLAS. In addition, TCM's thresholds allow for a predictable trade-off between throughput and fairness. The separate clusters also allow for OS thread prioritization (weighting) without losing complete fairness by only applying prioritization in relation to the thread's respective cluster.
Full Text
Venue: MICRO 2010
This paper presents Thread Cluster Memory Scheduling (TCM), a memory scheduling algorithm that targets optimizing both system throughput and fairness. To achieve this, three key ideas are employed.
Firstly, threads are clustered as either bandwidth intensive or non-intensive. The idea here is that low-bandwidth threads are more sensitive to latency. While an example is presented, an easier way to reason about this is that a low-bandwidth thread needs only a small fraction of memory service time to see a significant performance increase. In other words, it has a high ROI with minimal impact to other threads. As such, non-BW-intensive threads are always given the highest priority.
The second observation to be exploited is the disparity in behavior among high-bandwidth threads. Specifically, a metric niceness measures a threads susceptibility to interference and impact on other threads. Threads with high row-buffer locality will starve of other threads, thus threads with high RBL are "hostile" and have a lower nice value. Similarly, threads with high bank level parallelism is more susceptible to hostile threads, and thus given a higher nice value.
The final piece of the puzzle is in the scheduling algorithm, Insertion Shuffle, which employs niceness values. Insertion shuffle minimizes the amount of time hostile threads are given priority and maximizes the time nice threads given priority. There is an interesting auxiliary effect which the authors refer to as "memory service leaks", which allows nice threads of high priority actually allow lower priority threads to be serviced. Finally, in the event that all threads have relatively similar behavior, a randomized shuffle is used.
In all, TCM provides more throughput and higher fairness compared to existing state-of-the-art such as FR-FCFS, STFM, PAR_BS, and ATLAS. In addition, TCM's thresholds allow for a predictable trade-off between throughput and fairness. The separate clusters also allow for OS thread prioritization (weighting) without losing complete fairness by only applying prioritization in relation to the thread's respective cluster.
Full Text
Comments
Post a Comment