Project News

Many-Core Programs for Single-Brained Programmers

Print
Created on Tuesday, 31 January 2012

The many-core era is changing the way we write programs. In fact, by multiplying the number of cores and no longer their clock frequency to provide higher performance capabilities without increasing energy consumption [1], many-cores make concurrency a de facto principle that programmers must take into account.

Hitherto, concurrency was only considered an optional source of performance improvement, especially appealing for high-performance computing experts or for embarrassingly parallel applications that could be easily ported to parallel machines without much efforts. Yet, manufacturers have just completed the radical shift from sequential to parallel computing by turning all mainstream computing devices into multi-core platforms. In 2011, even manufactured tablets and phones became multi-core: the Kindle Fire embeds dual-core from Texas Instruments, the iPad2, iPhone 4S and Galaxy SII embed a dual-core from ARM, all running at a frequency close to the GHz.

Towards Contention-Friendly Programming

One of the biggest challenges software engineers have to face is to limit the tremendous rise of contention when turning daily applications into concurrent ones. These applications differ from regular scientific ones where data could be easily segregated among threads (or processes). Various existing applications have thus to be adapted to leverage many-cores, for example recent web browsers (Firefox, Chrome) run the scripting of a webpage into a separate thread for enhanced fault-tolerance. This concurrency has severe drawbacks when concurrent threads have to exchange information or aggregate results as the corresponding synchronization induces contention. Choosing the right data structure for synchronization is thus crucial to minimize the effect of contention.

The impact of contention on performance of data structures is generally underestimated. The big-oh notation has been an important metric of consideration to upper-bound the amount of computational steps necessary to pessimistically access a certain data structure but does not apply to speculative executions [2]. In short, the reason is that the amount of steps necessary to complete a speculative access depends on the unsuccessful iterations of such an access. Contention is perhaps an even more important metric of consideration in many-cores. For example, contention impacts performance so much that it can make the use of data structures with the smallest big-oh complexity---O(1) access time---irrelevant in the presence of concurrency: achieving lower throughput than if they were accessed sequentially [3].

To leverage concurrency in applications, we have first to rethink existing data structures to tolerate the contention induced by multiple cores running irregular applications. Think about a balanced search tree, why would one care about enforcing the balance invariant to guarantee O(log n) complexity when rebalancing it may induce even more contention than what is sufficient to kill its performance. In many-cores, binary search trees should ideally trade this O(log n) complexity under contention peak and eventually self-stabilize to this logarithmic complexity under low contention. The same contention-friendly methodology should apply to all sorts of structures to benefit rather than to suffer from concurrency.

Democratizing Concurrent Programming

A second requirement is to give a chance to Joe the programmer to derive its own concurrent application. The challenge here is more about simplicity of concurrent programming than contention-friendliness. Much efforts were spent to derive new programming techniques allowing for synchronization simplification. Transactional memory is a programming paradigm already supported in most common programming languages including C/C++, Java [4] but also Scala, Chapel, etc. to overcome the difficulties of debugging deadlocks when using locks or the limitation of using only a single-word compare-and-swap primitive for synchronization.

Transactional memory is known to be promising for cache-coherent mutli-cores since a while now. Not only was it proved efficient when implemented in software [5], it has even been adopted in hardware for the new generation of IBM supercomputers, namely Blugene/Q, and Intel has just announced publicly the release of its TM-oriented instruction set extension [6].

Only recently was the TM paradigm ported to many-cores [7]. This promising result indicates that TM programming does not require cache-coherence but can exploit fast network-on-chip communication for scaling irregular applications. In this context, the TM not only hides complex synchronization primitives from the programmers but also emulates an asynchronous message-passing environment behind a simple read/write interface. What it will really mean for Joe the programmer is that he has simply to:

  • write a sequential application,
  • delimit regions of the code that should execute atomically in "__transaction{ }" compound statements and
  • compile it with its TM-compliant compiler.

...and his application is ready to execute on a shared-memory multi-core, a non-cache-coherent many-core, a cluster of workstation or a huge data center.

What's Next?

The internals of future operating systems and their daily applications will be crucially different than what we are used to. This difference is necessary to preserve the performance rise when switching to more heterogeneous and more concurrent architectures. In particular, we believe that contention-friendliness and simplifying synchronization paradigms are two research directions it is essential to pursue if we target performance and democratization of concurrent programming. These two research directions can be accomplished hand-in-hand by defining programming methodology for writing modular and efficient concurrent programs, another programmer can rely upon as an easy-to-reason upon software library to write her own application.

Vincent Gramoli

References

[1] S. Borkar. Thousand core chips: a technology perspective. DAC 2007.

[2] T. Crain, V. Gramoli, M. Raynal. A Speculation-Friendly Binary Search Tree. PPoPP 2012.

[3] N. Shavit. Data Structures in the Multi-Core Age. Commun. ACM 2011.

[4] Y. Afek et al. The Velox Transactional Memory Stack. IEEE Micro 2011.

[5] A. Dragojevic, P. Felber, V. Gramoli, R. Guerraoui. Why STM can be more than a Research Toy. Commun. ACM 2011.

[6] http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell

[7] V. Gramoli, R. Guerraoui. V. Trigonakis. TM2C: A Software Transactional Memory for Many-Cores. EuroSys 2012.

Saturday the 25th. Sponsored under FP7-ICT-2009.8.1, Grant Agreement No. 248465. This website is monitored by Google Analytics. IP addresses are anonymized.
Copyright 2012

©