Karl Taht's Research Paper Blog

Posts

Showing posts from January, 2019

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

January 25, 2019

Authors: Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Hachol-Balter Venue: MICRO 2010 This paper presents Thread Cluster Memory Scheduling (TCM), a memory scheduling algorithm that targets optimizing both system throughput and fairness. To achieve this, three key ideas are employed. Firstly, threads are clustered as either bandwidth intensive or non-intensive. The idea here is that low-bandwidth threads are more sensitive to latency. While an example is presented, an easier way to reason about this is that a low-bandwidth thread needs only a small fraction of memory service time to see a significant performance increase. In other words, it has a high ROI with minimal impact to other threads. As such, non-BW-intensive threads are always given the highest priority. The second observation to be exploited is the disparity in behavior among high-bandwidth threads. Specifically, a metric niceness measures a threads susceptibility to interference and impact on ot...

AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications

January 18, 2019

Authors: Dehao Chen, David Xinliang Li, Tipp Moseley Venue: CGO 2016 This paper presents AutoFDO, a system used which profiles warehouse-scale applications, and applies feedback to the compilation for the next release. AutoFDO works by profiling and storing profiles in an aggregate database, annotating the profiles via an intermediate representation, and finally providing feedback. On average, this technique boosts performance by around 10% and works well even with stale releases. For profiles to be useful, they must be in an intermediate representation. This is built from binary-level profile, which uses LBR to map instruction frequencies. By using these frequencies, program counters, and the source, a source profile can be generated. The source profile can be used to build an annotated call-graph with edge frequencies, with some inaccuracy. These annotated call-graphs can then be used as feedback to the compiler. Because of the size of the applications and scale of d...

The SQL++ Query Language: Configurable, Unifying and Semi-Structured

January 15, 2019

Authors: Kian Win Ong, Yannis Papakonstantinous, Romain Vernoux Venue: arXiv 2015 This work presents specifications for a new unified query language: SQL++. Modern databases have evolved beyond structured tables of SQL into semi-structured formats comprising of heterogenous structures. These structures may include different scalar types, tuples, ordered and unordered collections, which are common in JSON files. SQL++ provides backwards compatibility with SQL, but expands it with JSON features. SQL++ is flexible in that may operations including equivalence, comparison, and null/missing references can have explicitly configurable behaviors. Such configuration allows for portability to both SQL and NoSQL databases. The paper covers FROM, WHERE, GROUP BY, and SELECT. It is noted that the NoSQL space is relatively new and thus likely to continue to develop. Thus, the key contributions of the work are itemizing and rationalizing the options of a language designer, and by exten...

Dynamic Partitioning of Shared Cache Memory

January 15, 2019

Authors: G. E. Suh, L. Rudolph, S. Devadas Venue: SuperComputing 2004 This paper was released around the same era in which multi-core CPUs began to go mainstream. As a result, this is one of the first works to address resource partitioning, specifically, LLC partitioning. The authors utilize a framework to partition the cache based of marginal gains by allocating more cache. The work proposes a framework to allocate cache chunks (groups of blocks) by this scheme. However, to minimize hardware overhead, they are only able to sample marginal gains by way-granularity. They mention that this is one of the reasons in which their scheme performs sub-optimally. The results show a few outliers with significant gains (30%+), but excluding these, the results are lackluster. The significance of this work is primarily that it addresses the subject and emphasizes the importance in the future. Full Text

Attack of the Killer Microseconds

January 15, 2019

Authors: Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan Venue: CACM (Magazine) This work shows that while architects have tackled problems and the nanosecond scale, and software has optimized for the millisecond scale, there exists a significant void in microsecond operation optimization. The result is that many operations which take in the order of a few microseconds degenerate into millisecond operations, or at least, reduce efficiency by up to or more than an order of magnitude in many cases. One such example is an RDMA, which by itself is only a ~2us operation. However, after dispatching, kernel-scheduling, interrupt-based notifications, and a TCP/IP stack, the transportation blows up to ~75us. The authors imply that this area is prime for research to optimize warehouse-scale computing. Such areas include reduced lock contention, lower-overhead interrupt handling, improved scheduling, and hardware-offloading in the mi...

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

January 11, 2019

Authors: Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica Venue: NSDI 2011 Mesos is a thin management layer that allows various cluster computing frameworks to efficiently share resources. The two key principles leveraged by Mesos are its fine-grained resource sharing model at the level of tasks, and a distributed (decentralized) scheduling mechanism. The result is a framework which offers better overall system utilization, scalability to at least 50,000 nodes, and flexibility to port to many different (and future) frameworks. Fine-grain resource sharing is done via a notion of resource-offers, which are each a list of free resources on multiple slaves. The master decides how many resources to offer each framework, this distribution is done via a pluggable allocation module. A scheduler registers with the master to be offered resources, and an executor process is launched on slave notes to run the framework'...

Real Time Power Estimation and Thread Scheduling via Performance Counters

January 09, 2019

Authors: Karan Singh, Major Bhadauria, Sally A. McKee Venue: ACM SIGARCH Computer Architecture News 2009 This paper presents a methodology for real-time power estimation via performance counters. The study is does completely on real hardware. The work characterizes power usage into four buckets: FP Units, Memory, Stalls, and Instructions Retired. This is based on the overall area of the chip itself. They utilize Spearman's rank correlation on the data to choose the best performance counter from each bucket. Four counters (this paper seems to be before perf event multiplexing) are selected as inputs to piecewise functions to approximate power utilization, obtaining a median error from 3.9-7.2% on different benchmark suites. The work then utilizes the power estimates to build a proof-of-concept thread scheduler as a user space program. Each application's current power draw can be estimated, and thus if the target is exceeded, an application can be unscheduled or replaced...