Taming Performance Variability

Authors: Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, Robert Ricci
Venue: OSDI 2018

This paper performs in-depth statistical analysis to understand the performance variability present in real-systems. The goal is to quantify variability, and understand how to tame it from both a researcher's and cloud provider's perspective. To do so, the authors collect nearly 900,000 data points over the course of 10 months on real systems.

A key insight is that the distribution of runs is a non-normal distribution, as such, typical parameterized analysis with closed form solutions should not be applied. In fact, typical analysis using CoV yields significantly different results than those which make no assumptions about the distribution. Thus, the authors make use of non-parameterized techniques to establish confidence intervals and error tolerance. From a researcher's perspective, the authors build a tool which performs such analysis on a given dataset and recommends the required number of trials needed to establish certain bounds.  From a cloud provider's perspective, the same analysis can be used to determine the number of machines (and which) to omit in order to achieve much tighter bounds on performance.

Full Paper

Comments

Popular posts from this blog

Fundamental Latency Trade-offs in Architecting DRAM Caches (Alloy Cache)

ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications