This week, my student Nadesh Ramanathan presents our FPGA 2017 paper “Hardware Synthesis of Weakly Consistent C Concurrency”, a piece of work jointly done with John Wickerson and Shane Fleming.
High-Level Synthesis, the automatic mapping of programs – typically C programs – into hardware, has had a lot of recent success. The basic idea is straightforward to understand, but difficult to do: automatically convert a vanilla C program into hardware, extracting parallelism, making memory decisions, etc., as you go. As these tools gain industry adoption, people will begin using them not only for code originally specified as sequential C, but for code specified as concurrent C.
There are a few tricky issues to deal with when mapping concurrent C programs into hardware. One approach, which seems modular and therefore scalable, has been adopted by LegUp: schedule threads independently and then build a multithreaded piece of hardware out of multiple hardware threads. This all works fine, indeed there is an existing pthreads library for LegUp. The challenge comes when there’s complex interactions between these threads. What if they talk to each other? Do we need to use locks to ensure synchronisation?
In the software world, this problem has been well studied. The approach proposed by Lamport was to provide the programmer with a view of memory known as “sequentially consistent” (SC). This is basically the intuitive way you would expect programs to execute. Consider the two threads below, one on the left and one on the right, each synthesised by an HLS tool. The shared variables
y are both initialised to zero. The assertion is not an unreasonable expectation from a programmer: if
r0 = 0, it follows that Line 2.3 has been executed (as otherwise
r0 = -1). We can therefore conclude that Line 1.2 executed before Line 2.2. It’s reasonable for the programmer to assume, therefore that Line 1.1 also executed before Line 2.3, but then x
= 1 when it is read on Line 2.3, not zero! Within a single thread, dependence analysis implemented as standard in high-level synthesis would be enough to ensure consistency with the sequential order of the original code, by enforcing appropriate dependences. But not so in the multi-threaded case! Indeed, putting this code into an HLS tool does indeed result in cases where the assertion fails.
My PhD student’s paper shows that we can fix this issue neatly and simply within the modular framework of scheduling threads independently, by judicious additional dependences before scheduling. He also shows that you can improve the performance considerably by supporting the modern (C11) standard for atomic memory operations, which come in a variety of flavours from full sequential consistency to the relaxed approach natively supported by LegUp pthreads already. In particular, he shows for the first time that on an example piece of code chaining circular buffers together that you can get essentially near-zero performance overhead by using the so-called acquire / release atomics defined in the C11 standard as part of a HLS flow, opening the door to efficient synthesis of lock-free concurrent algorithms on FPGAs.
As FPGAs come of age in computing, it’s important to be able to synthesise a broad range range of software, including those making use of standard concurrent programming idioms. We hope this paper is a step in that direction.