Posts

Showing posts from November, 2018

Heracles: Improving Resource Efficiency at Scale

Authors: David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, Christos Kozyrakis Venue:    ISCA 2015 This work presents a resource controller and scheduler that works to improve the throughput of best-effort tasks while preserving the SLO for latency-sensitive applications. The work combines tuning by addressing many fronts: core isolation (taskset), LLC isolation (CAT), power isolation (DVFS), and network traffic isolation (qdisc). They show that because of the unique property that these form a convex function, each can be optimized individually by understanding the current load of the system and available slack, which is polled by the top-level controller every 15 seconds. Overall, they increase machine utilization to 90% without sacrificing SLO agreements, which is defined for 60-second windows. The authors demonstrate three latency-critical workloads: websearch, ml_cluster, and memkeyval, which each stress different combinations of cache, bandwidth, pow...

Prediction based Execution on Deep Neural Networks

Authors: Mingcong Song, Jiechen Zhao, Yang Hu, Jiaqi Zhang, and Tao Li Venue:   ISCA 2018 The authors of this paper present a technique to further reduce computations within deep-neural network computation, and present a scale-out design which is able to achieve 2.5X speedup compared to traditional accelerators, and 1.9-2.0X compared to Cnvlutin/Stripes. The concept is based on removing the computations of ineffectual neurons (iEONs). In order to do so, they use first compute the upper bits to determine if the result is likely to have a non-negative value. This is exceedingly elegant, as this means the predictor for computation is actually part of the computation itself--no extra work is being performed. Not only that, the technique requires no retraining and incurs no accuracy loss. The caveat is that the number of upper bits for prediction must be experimentally determined (still no retraining though). The challenge is that the iEONs are typically randomly dispersed throug...

Predicting inter-thread cache contention on a chip multi-processor architecture

Authors: Dhruba Chandra, Fei Guo, Seongbeom Kim, Yan Solihin Venue:    HPCA 2005 The authors present Prob , a model which is able to predict the performance implications of co-locating multiple threads on CMP. The model uses the stack distance profiles / circular sequence profiles as input. Using probability theory, Prob is able to accurately predict the cache miss rates of co-locating programs with an average of ~3.8% accuracy. Moreover, the models accuracy is only significantly off when the performance implications are predicted to be very large, and the real implications are even larger. This is the first work to model the effects of co-locating threads on a CMP, yet is exceedingly accurate. However, the model does not propose a solution to co-locating threads, only a prediction model of the effects. Moreover, the study is done on a two core system -- which was state-of-the-art at time of publication. However, in many-core era the study would be interesting to re-exami...