Karl Taht's Research Paper Blog

Posts

Showing posts from October, 2018

Gaining Insights into Multicore cache Partitioning: Bridging the Gap between Simulation and Real Systems

October 30, 2018

Authors: Jian Lin, Qingda Lu, .. P. Sadayappan et al. Venue: HPCA 2008 The authors of this paper present an in-depth analysis and optimization of cache partitioning on a real-system. They accomplish this by using OS-page coloring, which induces only ~2% overhead. Since they cite the goal of their study primarily as analysis and potential, they subtract out this overhead. The authors show significant discrepancy compared to previous studies, which they cite as an artifact of simulations which are too small in length, and use too small of datasets. The real system approach allows for much longer runs with larger datasets. Benchmarks are divided into 4 categories: Red : Highly sensitive to cache size (bzip2, mcf, omnetpp, astar, sphinx3, xalanc) Yellow : Moderately sensitive (gcc, leslie3d, soplex, Gems, tonto, lbm, perl, catcus, h264) Green : Marginally sensitive (bwaves, zeus, gromacs, povray, libq, wrf) They create 27 workloads which each comprise of two benchm...

Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management

October 23, 2018

Authors: Canturk Isci, Gilberto Contreras, and Margaret Martonosi Venue: MICRO 2006 The authors of this paper present a real-system framework which enables phase detection, phase prediction, and system reconfiguration. The phase detection is done using performance counters, more specifically, phases are classified based on their ratio of memory bus transitions to micro-ops retired. This is mapped to how compute vs. memory bound an application is, and thus, the DVFS can be adjusted accordingly. Phase prediction is done in a similar fashion to the TAGE branch predictor, using a global history table which tracks 1024 entries and a history of 8. The framework achieves an 18% EDP improvement with a 4% performance loss on average across SPEC 2000 benchmarks. Note that their phase detection framework and performance counter selection is geared specifically toward DVFS optimization, and is justified through analysis in the paper which demonstrates a specific relationship present....

SHARP Control: Controlled Shared Cache Management in Chip Multiprocessors

October 22, 2018

Authors: Shekhar Srikantaiah, Mahmut Kandemir, Qian Wang Venue: MICRO 2009 This paper presents a new scheme for dynamic cache partition of a shared LLC. SHARP control leverages control their and separates the optimization into two layers: a local, per-core decision and a global, system-wide decision. Formal control theory provides performance guarantees, is resilient to minor inaccuracies, offers quick adaptive response and allows for high-level objects to be easily specified. The authors even provide a sketch of a proof which includes time-varying behavior. Each per-core controller is a reinforced oscillation resistant controller, which dynamically adjusts it's parameters based on the phase-behavior of applications. The global decision in managed in two steps, the PAN controller allocates addition cache ways to prevent under utilization, whereas the SHARP controller makes the decision of where to remove cache ways when the system is over subscribed. Significant experiment...

Modeling Performance Variation Due to Cache Sharing

October 16, 2018

Authors: Andreas Sandberg, Andreas Sembrant, Erik Hagersten and David Black-Schaffer Venue: HPCA 2013 The authors of this paper present a modeling framework to predict cache contention when co-locating applications. The model is much lighter weight than previous work, and accurate within 0.41% on average. The authors utilize a three-fold approach: A cache sharing model - Predicts how much cache is used by an application A cache analysis tool (Cache Pirating) - Artificially reduces cache size A phase detection framework: (Scarphase) - Divides applications into phases It should be noted that (1) and (2) can be done directly with what is now Intel RDT, which was not available at the time of publication. The authors show that co-location of applications exhibits extensive performance variability depending on alignment, particularly when applications exhibit extensive phase behavior. Therefore to predict performance of co-location, a u...

VM^3: Measuring, modeling and managing VM shared resources

October 15, 2018

Authors: Ravi Iyer, Ramesh Illikkal, Omesh Tickoo, Li Zhao, Padma Apparo, Don Newell Venue: Computer Networks 2009 The authors of this paper seek to understand the importance of resource allocation in a VM/cloud environment. At the time of publishing, only time-multiplexing and core allocation isolated VMs from a performance standpoint, which they refer to as a Virtual Platform Architecture (VPA) . The authors suggest that cache space, memory, bandwidth and power equally need to be virtualized as well. They focus on memory bandwidth and cache allocation. They motivate the problem by performing measurements and effects of resource contention and show significant performance degradation. Then then show that a simplistic model can perform fairly accurate predictions of cache occupancy, MPI, and cache contention. Perhaps the most elegant part of the paper is the description of the cache and memory bandwidth monitoring and allocation technology, which I assume laid foundatio...

Long Term Parking (LTP): Critically-aware Resource Allocation in OOO Processors

October 09, 2018

Authors: Andreas Sembrant, Trevor Carlson, Erik Hagersten, David Black-Shaffer, Arthur Perais, Andre Seznec, and Pierre Michaud Venue: MICRO 2015 The authors of this paper explore the utility of large instruction queues, load-store queues, register files, and other processor structures. These resources significantly boost performance by leveraging ILP and MLP. However, when resources are allocated to instructions that are not yet ready to be executed, it wastes significant energy. The authors spend significant effort to determine that a IQ of half-size (64->32), with a "Long Term Parking" structure for non-ready and non-urgent instructions, has negligible impacts on performance. Furthermore, the authors find that a majority of this benefit can be acquired via non-urgent instructions only. The authors then propose a solution to leverage this benefit, and find a design which is 1% slower, but 40% lower E(D^2)P for MLP-sensitive applications, and 3% slower but 38% l...

A Phase Behavior Aware Dynamic Cache Partitioning Scheme for CMPs

October 09, 2018

Authors: Xiaofei Liao, Rentong Guo, Danping Yu Venue: International Journal of Parallel Programming 2016 The authors present a novel dynamic cache partitioning mechanism based on the phase behavior of program. They use a similar phase detection to Sembrant et al. to detect phases. To reduce the overhead further, they make an assumption that the current phase will continue, and trigger only a phase change when the IPC deviates more than a threshold. To partition the cache, they utilize their FractalMRC algorithm, which predicts the optimal cache partitioning via a miss-rate-curve. If the phase is already seen, then the MRC will be stored in the table. They show that overall their approach nets up to 21.4% performance improvement using Spec2006 benchmarks. The authors cite low overhead ~1%-2% on average in various configurations. However, it is unclear if this also factors in the overhead of the FractalMRC algorithm, which they state has an overhead of "less than 1s to c...

Yukta: Multi-layer Resource Controllers to Maximize Efficiency

October 04, 2018

Authors: Raghavendra Pothukuchi, Sweta Pothukuchi, Petros Voulgaris, Josep Torrellas Venue: ISCA 2018 This work targets optimization of difference resources within a computer. The specific example used targets minimizing the energy-delay product via thread scheduling and DVFS on an Arm big.LITTLE board. This paper is done by the same authors of "Using MIMO Formal Control to Maximize Resource Efficiency in Architectures". While the prior work synthesizes many simultaneous optimization problems into a single controller, this work separates out the controllers into coordinated multi-layer formal controllers, specifically Structured Singular Value controllers. The SSV controllers offer the benefits of uncertainty guardbands for safety, max and min settings, discrete value support, and allow for passing information between multiple controllers. They call their generic framework Yukta (possibly named after the 1999 Miss World winner). The key idea is that this ap...