Arithmetic Workshop at Imperial

This week my research group had the pleasure of hosting Prof Miloš Ercegovac from UCLA, one of the giants of the field of computer arithmetic. Prof Ercegovac had come at my invitation to deliver a short three-day course on computer arithmetic to approx 40 delegates from Imperial College and elsewhere, including a significant delegation from UK industry.

IMG_0537.jpg

I first encountered Miloš’s work when I was a PhD student reading around my topic, but I didn’t have the opportunity to make use of his work directly until 2014, when I published a paper with my former PhD student Kan Shi (now with Intel), demonstrating that a digit parallel form of Online Arithmetic can be used in the context of approximate computing, to provide an unexplored trade-off between clock frequency and arithmetic error. I also later developed an architecture based on online arithmetic for arbitrary precision with my BEng student Aaron Zhao (now at Cambridge).

The title of Miloš’s lectures was “An Enduring Pillar of Computer Arithmetic: Redundancy in Representation and its Uses in Algorithms and Implementations”.  I hope that in this brief blog post, I can provide you with an overview of the topics that were discussed at our workshop, aimed at a general technical reader. The best reference for most of the material, should you wish to dig further, is the book by Ercegovac and Lang.

Day 1 kicked off with a discussion of redundancy in representation (the idea that you can have more than one representation of a given number), its realisation in positional radix number systems, and the implications for addition. As a simple example, imagine working in decimal, but allowing your digits to range from -9 to 9 instead of 0 to 9 as usual. This is clearly redundant because, for example, 18 = 2\bar{2} where the bar over a digit indicates a negative digit. But it also has the remarkable property that addition can be performed without inducing carries, leading to the potential for very significant parallelism. I particularly enjoyed that Miloš was able to trace this idea back to the 1700s, in a paper by John Colson, published in the Philosophical Transactions of the Royal Society, entitled “A Short Account of Negativo-affirmative Arithmetick”.

Day 2 covered Miloš’s Online Arithmetic, referred to earlier. As I tell my students: when adding or multiplying two numbers, if the two numbers are big, you expect a big result, right? So why not use that information – why wait until you’ve added all the partial products from least-significant digits before producing the most significant digit. Usually, we must wait because a carry from the least significant digits can throw out the most significant. Not so with online arithmetic – taking advantage of redundancy allows one to produce multiplication and addition (and other) results most-significant-digit-first.

Day 3, the final day of our workshop, ended with a discussion of high-radix division, explaining some interesting ways that performance can be squeezed out of division by performing the division in high radices.

I would recommend anyone interested in arithmetic to get hold of Ercegovac and Lang and explore these ideas in more depth. With the rise of custom and accelerator architectures, there hasn’t been a better time to begin to re-explore some of the assumptions underlying our arithmetic hardware. Inspired readers, feel free to contact me about PhD or postdoc opportunities in this area!

 

DATE 2017: Some Personal Highlights

This week I was away at the 2017 Design, Automation and Test in Europe Conference (DATE) in Lausanne, Switzerland, where my collaborator and former staff member Kevin Murray was presenting our joint work with Andrea Suardi and Vaughn Betz.

IMG_9542

In this blog post I note a couple of personal highlights from the conference.

Apart from Kevin’s paper, which I describe in accessible form in a previous blog post, Jie Han and Marc Riedel‘s tutorial on stochastic computation was thought provoking. It’s an old idea, which I’ve seen a few times, but it was interesting to catch up with the recent thinking in this area: results on synthesis using Bernstein polynomials and a de-randomisation of the logic.

The theme of approximate computing pervaded the conference. Newcastle had an interesting paper on approximate multiplication where multiple partial products are compressed by replacing (implicit) summation with Boolean or operations. This complements more traditional approaches like I’ve worked on, where we just throw away partial products. It seems like there could be a few interesting generalisations of this approach.

It was also, of course, great to see the many colleagues I often meet at these conferences, and to spend some time in the laboratory of Paolo Ienne and meeting his PhD student Lana Josipovic, who is doing some very interesting work on high-level synthesis.

EPFL also has possibly the best underpass in the world:

DATE remains one of the best conferences to catch up with the various diverse directions in which EDA researchers have gone in recent years.

Overclocking For Fun and Profit

This week at the Design, Automation and Test in Europe (DATE) conference, Kevin Murray is presenting some exciting work I’ve done in collaboration with Kevin, his supervisor Vaughn Betz at the University of Toronto, and Andrea Suardi at Imperial College.

I’ve been working for a while on the idea that one form of approximate computing derives from circuit overclocking. The idea is that if you overclock a circuit then this may induce some error. However the error may be small or rare, despite a very significant performance enhancement. We’ve shown, for example, that such tradeoffs make sense for image processing hardware and – excitingly – that the tradeoffs themselves can be improved by adopting “overclocking-friendly” number representations.

In the work I’ve done on this topic to date, the intuition that a given circuit is “overclocking friendly” for a certain set of input data has been a human one. In this latest paper we move to an automated approach.

Once we accept the possibility of overclocking, our circuit timing analysis has to totally change – we can’t any longer be content with bounding the worst-case delay of a circuit, because we’re aiming to violate this worst case by design. What we’re really after is a histogram of timing critical paths – allowing us to answer questions like “what’s the chance that we’ll see a critical path longer than this in any given clock period?” Different input values and different switching activities give rise to the sensitization of different paths, leading to different timing on each clock cycle.

This paper’s fundamental contribution is to show that the #SAT probem can be efficiently used to quantify these probabilities, giving rise to the first method for determining at synthesis time the affinity of a given circuit to approximation-by-overclocking.

Concurrent Programming in High-Level Synthesis

This week, my student Nadesh Ramanathan presents our FPGA 2017 paper “Hardware Synthesis of Weakly Consistent C Concurrency”, a piece of work jointly done with John Wickerson and Shane Fleming.

High-Level Synthesis, the automatic mapping of programs – typically C programs – into hardware, has had a lot of recent success. The basic idea is straightforward to understand, but difficult to do: automatically convert a vanilla C program into hardware, extracting parallelism, making memory decisions, etc., as you go. As these tools gain industry adoption, people will begin using them not only for code originally specified as sequential C, but for code specified as concurrent C.

There are a few tricky issues to deal with when mapping concurrent C programs into hardware. One approach, which seems modular and therefore scalable, has been adopted by LegUp: schedule threads independently and then build a multithreaded piece of hardware out of multiple hardware threads. This all works fine, indeed there is an existing pthreads library for LegUp. The challenge comes when there’s complex interactions between these threads. What if they talk to each other? Do we need to use locks to ensure synchronisation?

In the software world, this problem has been well studied. The approach proposed by Lamport was to provide the programmer with a view of memory known as “sequentially consistent” (SC). This is basically the intuitive way you would expect programs to execute. Consider the two threads below, one on the left and one on the right, each synthesised by an HLS tool. The shared variables x and y are both initialised to zero. The assertion is not an unreasonable expectation from a programmer: if r0 = 0, it follows that Line 2.3 has been executed (as otherwise r0 = -1). We can therefore conclude that Line 1.2 executed before Line 2.2. It’s reasonable for the programmer to assume, therefore that Line 1.1 also executed before Line 2.3, but then x = 1 when it is read on Line 2.3, not zero! Within a single thread, dependence analysis implemented as standard in high-level synthesis would be enough to ensure consistency with the sequential order of the original code, by enforcing appropriate dependences. But not so in the multi-threaded case! Indeed, putting this code into an HLS tool does indeed result in cases where the assertion fails.

 

mp-fix1

My PhD student’s paper shows that we can fix this issue neatly and simply within the modular framework of scheduling threads independently, by judicious additional dependences before scheduling. He also shows that you can improve the performance considerably by supporting the modern (C11) standard for atomic memory operations, which come in a variety of flavours from full sequential consistency to the relaxed approach natively supported by LegUp pthreads already. In particular, he shows for the first time that on an example piece of code chaining circular buffers together that you can get essentially near-zero performance overhead by using the so-called acquire / release atomics defined in the C11 standard as part of a HLS flow, opening the door to efficient synthesis of lock-free concurrent algorithms on FPGAs.

As FPGAs come of age in computing, it’s important to be able to synthesise a broad range range of software, including those making use of standard concurrent programming idioms. We hope this paper is a step in that direction.

POPL and VMCAI 2017: Some Highlights

Last week I was in Paris attending POPL 2017 and co-located events such as VMCAI. This was my first time at POPL, prompted by acceptance of work with John Wickerson. Here are some highlights and observations from my own, biased, position as an outsider to the programming languages community.

Polyhedral analysis

Due to my prior work with linear programming and polyhedral analysis, the polyhedral talks were particularly interesting to me.

Maréchal presented an approach to condense polyhedral representations through ray tracing, which seems to be of general interest.

While in general, polyhedral methods represent some set as \{x | Ax \leq b\} where A and b are parameters, there are various restricted approaches which more efficiently fix A and only allow $b$ as parameters. Seladji presented a technique to try to discover such good “templates” via Principal Component Analysis. I suspect that this is equivalent to an ellipsoidal approximation of the polyhedron in some sense.

Singh, Püschel, and Vechev presented an interesting method to speed up polyhedral abstract domain computations by partitioning, resulting in a much faster tool.

SMT Solvers

An interesting paper from Jovanović presented a nonlinear integer arithmetic solver which looks genuinely very useful for a broad class of applications.

 

Heroics with Theorem Provers

Lots of work was presented making use of Coq, some of which seemed fairly heroic to me. Of particular interest, given my work on non-negative polynomials for bounding roundoff error, was a Coq tactic from Martin-Dorel and Roux that allows the use of (necessarily approximate) floating-point computation in ways I’ve not seen before, essentially writing a sum-of-squares polynomial p \approx z^T Q z as p = z^T Q z + z^T E z and showing that Q+E is positive semidefinite for some small E.

Weak Memory

John presented our work on automatically comparing memory models, which seemed to be well received.

img_9144

 

Numerics

Chiang et al. presented a paper on rigorous floating-point precision tuning. The idea is simple, but elegant: perform error analysis with individual precisions as symbolic variables and then use a global optimization solver to look for a high-performance implementation. Back in 2001, I did a very similar thing but for a very limited class of algorithm (LTI systems) in fixed-point rather than floating-point and using noise power rather than worst-case instantaneous error as the metric of accuracy. However, the general setting (individual precisions, an explicit optimization) are there, and it is great to see this kind of work now appearing in a very different context.

 

Cost Analysis

There were a couple of papers [Madhavan et al., Hoffmann et al.] I found particularly interesting on automated cost analysis.

 

Community differences

As an outsider, I had the opportunity to ponder some interesting cultural differences from which the various research communities I’m involved with could potentially learn.

Mentoring: One of the co-located workshops was PLMW, the Programming Languages Mentoring Workshop. This had talks ranging from “Time management, family, and quality of life” to “Machine learning and programming languages”. Introductory material is mixed with explicitly nontechnical content valuable to early career researchers. Co-locating with the major conference of the field, I would imagine makes it relatively easy to pull in very high quality speakers.

Reproducibility: POPL, amongst other CS conferences, encourages the submission of artifacts. I am a fan of this process. While it doesn’t quite hit the holy grail of research reproducibility, it takes a definite step in this direction.

Time is Precision

Most modern FPGA arithmetic designs use bit-parallel binary arithmetic – typically two’s complement for signed computation. This generally makes for fast arithmetic, but has the distinct disadvantage that silicon area scales with precision of computation. Occasionally – much more often in the distant past when FPGA area was a precious resource – people compute with bit serial binary arithmetic. In this case, time scales with precision.

One problem with digit serial binary arithmetic for multiplication and division is that you need to know, in advance, the location of your least significant digit: while precision unfolds as time, precision is fixed a priori.

Milos Ercegovac‘s online arithmetic forms an interesting counterpoint to this: in this arithmetic, digits are produced most-significant-digit-first. This suggests that the longer we compute, the more precise our answer will be – and we can terminate whenever we’ve got a good enough answer. Or run out of time. The problem with this approach is that the hardware arithmetic units generally implemented for digit-serial multiplication or division still scale with precision requirements, placing an unwanted a priori bound on computational precision.

Enter my former BEng project student Aaron Zhao, who presented our paper “An Efficient Implementation of Online Arithmetic” at the IEEE International Conference on Field-Programmable Technology in Xi’an, China. Aaron’s contribution – for his BEng final year project – was to design a library of online arithmetic units whose logic area is constant with precision (of course, they still need more RAM for more precision.) This opens up a lot of practical possibilities for our research.

Precision (and energy) can scale elastically with time in FPGA-based compute.

Memory Consistency Models, and how to compare them automatically

John Wickerson explains our POPL 2017 paper on memory consistency

John Wickerson's avatarWickopedia

My POPL 2017 paper with Mark Batty, Tyler Sorensen, and George Constantinides is all about memory consistency models, and how we can use an automatic constraint solver to compare them. In this blog post, I will discuss:

  • What are memory consistency models, and why do we want to compare them?
  • Why is comparing memory consistency models difficult, and how do we get around those difficulties in our work?
  • What can we achieve, now that we can compare memory consistency models automatically?

What is a memory consistency model?

Modern computers achieve blistering performance by having their computational tasks divided between many ‘mini-computers’ that run simultaneously. These mini-computers all have access to a collection of shared memory locations, through which they can communicate with one another.

It is tempting to assume that when one of the mini-computers stores some data into one of these shared memory locations, that data will immediately become visible to any other mini-computer that subsequently loads from that location…

View original post 4,172 more words

Rounding Error and Algebraic Geometry

We’re often faced with numerical computation expressed as arithmetic over the reals but implemented as finite precision arithmetic, for example over the floats. A natural question arises: how accurate is my calculated answer?

For many years, I’ve studied the closely related problem “how do I design a machine to implement a numerical computation most efficiently to a given level of accuracy?” for various definitions of “numerical computation”, “efficiently” and “accuracy”.

The latest incarnation of this work is particularly exciting, in my opinion. ACM Transactions on Mathematical Software has recently accepted a manuscript reporting joint work of my former post-doc, Victor Magron, Alastair Donaldson and myself.

The basic setting is borrowed from earlier work I did with my former PhD student David Boland. Imagine that you’re computing a + b in floating-point arithmetic. Under some technical assumptions, the answer you get will not actually be a + b, but rather (a + b)(1 + \delta) where |\delta| << 1 is bounded by a constant determined only by the precision of the arithmetic involved. The same goes for any other fundamental operator *, /, etc. Now imagine chaining a whole load of these operations together to get a computation. For straight-line code consisting only of addition and multiplication operators, the result will be polynomial in all your input variables as well as in these error variables \delta.

Once you know that, we can view the problem of determining worst-case accuracy as bounding a polynomial subject to constraints on its variables. There are various ways of bounding polynomials, but David and I were particularly keen to explore options that – while difficult to calculate – might lead to easily verifiable certificates. We ended up using Handelman representations, a particular result from real algebraic geometry.

The manuscript with Magron and Donaldson extends the approach to a more general semialgebraic setting, using recently developed ideas based on hierarchies of semidefinite programming problems that converge to equally certifiable solutions. Specialisations are introduced to deal with some of the complexity and numerical problems that arise, through a particular formulation that exploits sparsity and the numerical properties of the problem at hand. This has produced an open source tool as well as the paper.

This work excites me because it blends some amazing results in real algebraic geometry, some hard problems in program analysis, and some very practical implications for efficient hardware design – especially for FPGA datapath. Apart from the practicalities, algebraic sets are beautiful objects.

KAPow: Online Instrumentation of Power

Great news from the IEEE International Symposium on Field-Programmable Custom Computing Machines this week – our paper KAPow: A System Identification Approach to Online Per-module Power Estimation in FPGA Designs won the best paper award!

KAPow is all about trying to understand where your FPGA design is burning power, at run time, so that some higher level control entity can make intelligent decisions based on that information.

The problem is that we just don’t have the power sensors – we only know the total power consumed by the device, not the power consumed by each module in the design. So what to do?

Enter KAPow. Our tool (soon to be released publicly as an output from the PRiME project) will take RTL, instrument it and return back RTL that estimates its own power consumption.

The idea is fairly straight-forward: Automatically identify at synthesis (compile) time a subset of nets that you think are going to have a good correlation with power consumption of a module. Insert cheap counters to monitor their activity at run-time, and then use an online recursive least squares model to adaptively learn a model of power consumption of each module, based on the overall power consumption of the chip.

Results are good: power estimates are good down to about 5mW, and the infrastructure itself adds less than 4% to the total power consumed by the device.

Now, of course, the question is what to do with all this information that can now be collected at run-time. Watch this space!

Super Fast Loops

On Tuesday, my PhD student Junyi Liu presented our work (joint with John Wickerson) on Loop Splitting for Efficient Pipelining in High-Level Synthesis to the assembled audience at the IEEE International Symposium on Field-Programmable Custom Computing Machines in Washington DC.

A primary way in which FPGA applications tend to get their blindingly fast performance is through overlapping loop iterations in time – known variously as loop pipelining or software pipelining. These days, you can expect high-level synthesis tools to do this for you. Sometimes.

Unfortunately there are cases where the tools can’t get squeeze out performance. This paper addresses two such cases in a unified framework. The first is the case where some pesky loop iterations get in the way. Consider this trivial example:


for( int i=0; i&amp;lt;N; i++ )
  A[2*i] = A[i] + 0.5f;

In this case the early loop iterations are the problematic ones because A[2] depends on A[1], A[4] on A[2], etc. These tight dependences hinder pipelining, leading existing HLS tools to throw in the towel.

The second case is where there are loop invariant parameters that are not known until the loop executes. Consider the case:


for( int i=0; i&amp;amp;lt;N; i++ )
  A[i+m] = A[i] + 0.5f;

Without knowing the value of m at compile time, the dependence structure is unknown – we might have no read-after-write dependences, tight read-after-write dependences, or the dependences might be so many iterations away that we just don’t care and can pipeline away to our heart’s content. A limited version of this latter issue was addressed in Junyi’s earlier paper, the predecessor of this work.

In the new FCCM 2016 paper, we show that both these cases can be analysed using a parametric polyhedral framework, and show that we can automatically derive source-to-source transformations to significantly accelerate the loops in these cases. The end result? A push button approach that could gain you a factor of more than 4x in performance if your pipelining is being stymied by pesky dependences.