Floating-Point Workshop

This week I attended a very enjoyable workshop at Dagstuhl on Analysis and Synthesis of Floating-Point Programs, organised by Eva Darulova, Ally Donaldson, Zvonimir Rakamarić, and Cindy Rubio González. As I noted in my talk on Algorithm-Architecture Codesign, this is one of the very few meetings I’ve been to at which delegates spanned some of – what I consider to be – the most exciting research areas at the moment: computer architecture, high-level synthesis, computer arithmetic, numerical analysis, programming languages, and formal methods. Hats off to the organisers for getting all these people in the room at once!

Workshop Attendees

Concluding my Talk

Below, I just choose a few of the many great talks I heard as some personal highlights of the workshop for me. Presentations and – more importantly! – debates during and after presentations were of uniformly excellent quality.

Rubio González, Hollingsworth, and Rakamarić all presented work on precision tuning. This is a topic I did some of the early work on back in 2001, in the context of fixed-point arithmetic for DSP algorithms in hardware, and have maintained an interest in ever since [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16], over the years slowly migrating from the very special class of LTI algorithms implemented in fixed-point arithmetic to much broader classes of algorithm, including proving termination of loops under finite precision floating-point and making forays into real algebraic geometry. The topic is currently exhibiting a resurgence of interest, especially for floating-point software. Of the tools presented, I only personally have some experience with FPTuner.

Damouche discussed a tool for automatically rewriting floating-point code for accuracy improvement (joint work with Martel). I’ve been aware of the interesting work of Martel for a while, and it inspired our own SOAP tool and the associated papers [17,18,19] which extend the capability to hardware where one is concerned with performance, area, and numerical error. This theme was also picked up by Panchekha, who has developed some very interesting tools for diagnosing numerical instability and correcting it.

Another common theme of the workshop was reproducibility. Many researchers (and developers!) are unhappy about the non-reproducible nature of floating-point code: change compiler or platform, and you might get a different result – more insidiously, run again on the same platform and you still might get a different result. My colleague Miriam Leeser, Michaela Taufer, Ganesh Gopalakrishnan and Thomas Wahl all spoke eloquently on this topic. Wahl’s work considered the idea of stabilising programs against platform uncertainty. The work auto-inserts pragmas in order to only determinise certain “key” properties, like ensuring the same control flow path is taken each time the program is run.

Donaldson spoke about detecting compiler bugs by inserting (precise) semantics-preserving transformations, and highlighted several such bugs his group has found. A similar theme was picked up by Nagarakatte, who has found bugs in LLVM floating-point optimisations and is proposing a DSL to specify such optimisations precisely.

Jim Demmel gave an interesting summary of proposed changes being discussed by the IEEE-754 floating-point standards body (a new rounded addition useful as a component within reproducible summation), the BLAS standards body, and progress made since his outstanding paper with Dumitriu and Holtz. This paper, when first written, inspired me to pursue the implications of this research for hardware design with my former PhD student, Theo Drane, now at Cadence. For Theo’s thesis, we used Demmel’s work to develop a design flow for hardware implementation of polynomial evaluation, given desired relative error bounds.

Titolo discussed an abstract interpretation approach to proving numerical properties in floating-point, which is also the conceptual framework utilised by our SOAP tool. Aiming at a similar goal, my former postdoctoral researcher Victor Magron presented the approach we derived together (jointly with Donaldson) for bounding error in floating-point computation, closely aligned with the approach I initially kick-started with my former PhD student David Boland back in 2010. I’ve blogged informally about this approach before – see here. Rakamarić discussed the tool FPTaylor which also targets this problem, within the context of the SMACK toolflow developed at Utah. While she didn’t give a talk at the seminar, Eva Darulova, one of the organisers, has developed an excellent paper and tool in the same area, and it was a pleasure discussing her work with her.

There was much discussion at the workshop on the topic of tool inter-operability. Tatlock and Panchekha presented a format for numerical benchmarking, and are urging the research community to cohere around this – it could be very interesting.

There was industrial representation from both Imagination Technologies and Cadence. In Drane’s talk, he made – I believe – an important observation that the research community should take note of “in my experience, if a customer has the time to do in depth verification of their numerical hardware, they also have the time to customise their hardware.”

I had a very enjoyable few days at Dagstuhl, and I hope that we find a way to keep this community together and talking to each other.

ARITH24: Some Highlights

Imperial College had the privilege of hosting the 24th IEEE Symposium on Computer Arithmetic (ARITH24) this week, thanks to the local arrangements work of my colleague Dr David Thomas. ARITH is the main conference dedicated solely to questions of computer arithmetic. It’s a small but active and dedicated community with – I would argue – increasing significance in the world of custom and FPGA-based accelerator computation, and I’ve enjoyed being on the Technical Programme Committee of the conference in recent years.

In this post, I describe my personal highlights from the conference.

Neil Burgess (ARM), opening the conference

Nick Higham from the University of Manchester gave the first keynote talk on the rise of mixed precision algorithms. This was an exciting tour de force. Nick highlighted the various floating-point precisions available in modern machines, as well as the move to low-precision computation in areas such as machine learning and ultra-high precision requirements in other areas. The key problem studied in the talk was how to solve systems of linear equations while taking advantage of the availability of mixed precision. Nick traced the idea of solution via iterative refinement of an initial approximate solution back to Wilkinson, and traced its development since then. Nick’s own recent work on this problem, in collaboration with Erin Carson, has been to introduce a method involving three different floating-point precisions. He went on to show how such an approach can produce numerical results equivalent to high precision while the bulk of the work is done in low precision. The approach uses approximate LU factorisation, followed by GMRES iterations. I found the whole approach – especially the analysis – fascinating. Moreover, it was an outstanding example of the link between algorithm development and data type selection, a link which was the topic of my own invited talk at ARITH, which I summarise below.

Gonzalez-Navarro and Hormigo presented an interesting floating-point arithmetic where normalisation is limited for efficiency reasons, and presented some empirical results from applying this to DSP tasks.

Oscar Gustafsson presented a very interesting fixed-point implementation of complex rotations, avoiding roundoff errors for low-complexity computation.

In my talk (extended abstract available here), I acted as devil’s advocate in a session organised by Martin Langhammer from Intel. Martin had invited me to give a talk putting a different perspective than “faster / better floating point.” I decided to do this by (semi-) formalising the joint problem of algorithm / data type design for numerical computation, in order to draw out the main differences between design for general purpose processors and for custom or FPGA implementations. After highlighting these from an abstract perspective, I gave a concrete example due to my PhD student Juan Jerez from a few years ago, before discussing work that is trying to automate the kind of design problems faced in this context. Such work includes bounding numerical errors, refactoring code, and in general synthesising numerical code.

Rocca, Dang, and Magron had a very interesting paper on Certified Roundoff Error Bounds using Bernstein Expansions and Sparse Krivine-Stengle Representations. This is the latest incarnation of work Magron, Donaldson and myself started together when Magron was a postdoc in my group, based in turn on the early work I did with Boland bringing automated roundoff error analysis and real algebraic geometry together. It is exciting to see this work branching off in new ways.

Pasca (Intel) and Istoan (INSA) presented a very nice approach to fixed-point function generation for FPGAs. Istoan will be joining my research group as a postdoctoral researcher in September, and I’m excited to be welcoming his expertise.

I organised a special session on Arithmetic in Digital Signal Processing on the second day of the conference. This featured interesting papers from Serre and Püschel (ETHZ) on optimal streamed linear permutations – I have been a fan of Püschel’s work since the first days of SPIRAL – as well as from Imagination Technologies (Rovers and Elliott), Linköping (Gustafsson, et al.), Intel (both Langhammer and Pasca from Intel PSG and Krishnamurthy.)

Unfortunately, I had to miss more than half of the presentations at ARITH due to an unscheduled hospital trip. So there are probably a huge number of exciting talks I have missed out on discussing here – my apologies to the authors. I will definitely be sure to prioritise ARITH attendance in future years.

Arithmetic Workshop at Imperial

This week my research group had the pleasure of hosting Prof Miloš Ercegovac from UCLA, one of the giants of the field of computer arithmetic. Prof Ercegovac had come at my invitation to deliver a short three-day course on computer arithmetic to approx 40 delegates from Imperial College and elsewhere, including a significant delegation from UK industry.

I first encountered Miloš’s work when I was a PhD student reading around my topic, but I didn’t have the opportunity to make use of his work directly until 2014, when I published a paper with my former PhD student Kan Shi (now with Intel), demonstrating that a digit parallel form of Online Arithmetic can be used in the context of approximate computing, to provide an unexplored trade-off between clock frequency and arithmetic error. I also later developed an architecture based on online arithmetic for arbitrary precision with my BEng student Aaron Zhao (now at Cambridge).

The title of Miloš’s lectures was “An Enduring Pillar of Computer Arithmetic: Redundancy in Representation and its Uses in Algorithms and Implementations”. I hope that in this brief blog post, I can provide you with an overview of the topics that were discussed at our workshop, aimed at a general technical reader. The best reference for most of the material, should you wish to dig further, is the book by Ercegovac and Lang.

Day 1 kicked off with a discussion of redundancy in representation (the idea that you can have more than one representation of a given number), its realisation in positional radix number systems, and the implications for addition. As a simple example, imagine working in decimal, but allowing your digits to range from -9 to 9 instead of 0 to 9 as usual. This is clearly redundant because, for example, $18 = 2\bar{2}$ where the bar over a digit indicates a negative digit. But it also has the remarkable property that addition can be performed without inducing carries, leading to the potential for very significant parallelism. I particularly enjoyed that Miloš was able to trace this idea back to the 1700s, in a paper by John Colson, published in the Philosophical Transactions of the Royal Society, entitled “A Short Account of Negativo-affirmative Arithmetick”.

Day 2 covered Miloš’s Online Arithmetic, referred to earlier. As I tell my students: when adding or multiplying two numbers, if the two numbers are big, you expect a big result, right? So why not use that information – why wait until you’ve added all the partial products from least-significant digits before producing the most significant digit. Usually, we must wait because a carry from the least significant digits can throw out the most significant. Not so with online arithmetic – taking advantage of redundancy allows one to produce multiplication and addition (and other) results most-significant-digit-first.

Day 3, the final day of our workshop, ended with a discussion of high-radix division, explaining some interesting ways that performance can be squeezed out of division by performing the division in high radices.

I would recommend anyone interested in arithmetic to get hold of Ercegovac and Lang and explore these ideas in more depth. With the rise of custom and accelerator architectures, there hasn’t been a better time to begin to re-explore some of the assumptions underlying our arithmetic hardware. Inspired readers, feel free to contact me about PhD or postdoc opportunities in this area!

DATE 2017: Some Personal Highlights

This week I was away at the 2017 Design, Automation and Test in Europe Conference (DATE) in Lausanne, Switzerland, where my collaborator and former staff member Kevin Murray was presenting our joint work with Andrea Suardi and Vaughn Betz.

In this blog post I note a couple of personal highlights from the conference.

Apart from Kevin’s paper, which I describe in accessible form in a previous blog post, Jie Han and Marc Riedel‘s tutorial on stochastic computation was thought provoking. It’s an old idea, which I’ve seen a few times, but it was interesting to catch up with the recent thinking in this area: results on synthesis using Bernstein polynomials and a de-randomisation of the logic.

The theme of approximate computing pervaded the conference. Newcastle had an interesting paper on approximate multiplication where multiple partial products are compressed by replacing (implicit) summation with Boolean or operations. This complements more traditional approaches like I’ve worked on, where we just throw away partial products. It seems like there could be a few interesting generalisations of this approach.

It was also, of course, great to see the many colleagues I often meet at these conferences, and to spend some time in the laboratory of Paolo Ienne and meeting his PhD student Lana Josipovic, who is doing some very interesting work on high-level synthesis.

EPFL also has possibly the best underpass in the world:

DATE remains one of the best conferences to catch up with the various diverse directions in which EDA researchers have gone in recent years.

Overclocking For Fun and Profit

This week at the Design, Automation and Test in Europe (DATE) conference, Kevin Murray is presenting some exciting work I’ve done in collaboration with Kevin, his supervisor Vaughn Betz at the University of Toronto, and Andrea Suardi at Imperial College.

I’ve been working for a while on the idea that one form of approximate computing derives from circuit overclocking. The idea is that if you overclock a circuit then this may induce some error. However the error may be small or rare, despite a very significant performance enhancement. We’ve shown, for example, that such tradeoffs make sense for image processing hardware and – excitingly – that the tradeoffs themselves can be improved by adopting “overclocking-friendly” number representations.

In the work I’ve done on this topic to date, the intuition that a given circuit is “overclocking friendly” for a certain set of input data has been a human one. In this latest paper we move to an automated approach.

Once we accept the possibility of overclocking, our circuit timing analysis has to totally change – we can’t any longer be content with bounding the worst-case delay of a circuit, because we’re aiming to violate this worst case by design. What we’re really after is a histogram of timing critical paths – allowing us to answer questions like “what’s the chance that we’ll see a critical path longer than this in any given clock period?” Different input values and different switching activities give rise to the sensitization of different paths, leading to different timing on each clock cycle.

This paper’s fundamental contribution is to show that the #SAT probem can be efficiently used to quantify these probabilities, giving rise to the first method for determining at synthesis time the affinity of a given circuit to approximation-by-overclocking.

Concurrent Programming in High-Level Synthesis

This week, my student Nadesh Ramanathan presents our FPGA 2017 paper “Hardware Synthesis of Weakly Consistent C Concurrency”, a piece of work jointly done with John Wickerson and Shane Fleming.

High-Level Synthesis, the automatic mapping of programs – typically C programs – into hardware, has had a lot of recent success. The basic idea is straightforward to understand, but difficult to do: automatically convert a vanilla C program into hardware, extracting parallelism, making memory decisions, etc., as you go. As these tools gain industry adoption, people will begin using them not only for code originally specified as sequential C, but for code specified as concurrent C.

There are a few tricky issues to deal with when mapping concurrent C programs into hardware. One approach, which seems modular and therefore scalable, has been adopted by LegUp: schedule threads independently and then build a multithreaded piece of hardware out of multiple hardware threads. This all works fine, indeed there is an existing pthreads library for LegUp. The challenge comes when there’s complex interactions between these threads. What if they talk to each other? Do we need to use locks to ensure synchronisation?

In the software world, this problem has been well studied. The approach proposed by Lamport was to provide the programmer with a view of memory known as “sequentially consistent” (SC). This is basically the intuitive way you would expect programs to execute. Consider the two threads below, one on the left and one on the right, each synthesised by an HLS tool. The shared variables x and y are both initialised to zero. The assertion is not an unreasonable expectation from a programmer: if r0 = 0, it follows that Line 2.3 has been executed (as otherwise r0 = -1). We can therefore conclude that Line 1.2 executed before Line 2.2. It’s reasonable for the programmer to assume, therefore that Line 1.1 also executed before Line 2.3, but then x = 1 when it is read on Line 2.3, not zero! Within a single thread, dependence analysis implemented as standard in high-level synthesis would be enough to ensure consistency with the sequential order of the original code, by enforcing appropriate dependences. But not so in the multi-threaded case! Indeed, putting this code into an HLS tool does indeed result in cases where the assertion fails.

mp-fix1

My PhD student’s paper shows that we can fix this issue neatly and simply within the modular framework of scheduling threads independently, by judicious additional dependences before scheduling. He also shows that you can improve the performance considerably by supporting the modern (C11) standard for atomic memory operations, which come in a variety of flavours from full sequential consistency to the relaxed approach natively supported by LegUp pthreads already. In particular, he shows for the first time that on an example piece of code chaining circular buffers together that you can get essentially near-zero performance overhead by using the so-called acquire / release atomics defined in the C11 standard as part of a HLS flow, opening the door to efficient synthesis of lock-free concurrent algorithms on FPGAs.

As FPGAs come of age in computing, it’s important to be able to synthesise a broad range range of software, including those making use of standard concurrent programming idioms. We hope this paper is a step in that direction.

POPL and VMCAI 2017: Some Highlights

Last week I was in Paris attending POPL 2017 and co-located events such as VMCAI. This was my first time at POPL, prompted by acceptance of work with John Wickerson. Here are some highlights and observations from my own, biased, position as an outsider to the programming languages community.

Polyhedral analysis

Due to my prior work with linear programming and polyhedral analysis, the polyhedral talks were particularly interesting to me.

Maréchal presented an approach to condense polyhedral representations through ray tracing, which seems to be of general interest.

While in general, polyhedral methods represent some set as $\{x | Ax \leq b\}$ where $A$ and $b$ are parameters, there are various restricted approaches which more efficiently fix $A$ and only allow $b$ as parameters. Seladji presented a technique to try to discover such good “templates” via Principal Component Analysis. I suspect that this is equivalent to an ellipsoidal approximation of the polyhedron in some sense.

Singh, Püschel, and Vechev presented an interesting method to speed up polyhedral abstract domain computations by partitioning, resulting in a much faster tool.

SMT Solvers

An interesting paper from Jovanović presented a nonlinear integer arithmetic solver which looks genuinely very useful for a broad class of applications.

Heroics with Theorem Provers

Lots of work was presented making use of Coq, some of which seemed fairly heroic to me. Of particular interest, given my work on non-negative polynomials for bounding roundoff error, was a Coq tactic from Martin-Dorel and Roux that allows the use of (necessarily approximate) floating-point computation in ways I’ve not seen before, essentially writing a sum-of-squares polynomial $p \approx z^T Q z$ as $p = z^T Q z + z^T E z$ and showing that $Q+E$ is positive semidefinite for some small $E$ .

Weak Memory

John presented our work on automatically comparing memory models, which seemed to be well received.

Numerics

Chiang et al. presented a paper on rigorous floating-point precision tuning. The idea is simple, but elegant: perform error analysis with individual precisions as symbolic variables and then use a global optimization solver to look for a high-performance implementation. Back in 2001, I did a very similar thing but for a very limited class of algorithm (LTI systems) in fixed-point rather than floating-point and using noise power rather than worst-case instantaneous error as the metric of accuracy. However, the general setting (individual precisions, an explicit optimization) are there, and it is great to see this kind of work now appearing in a very different context.

Cost Analysis

There were a couple of papers [Madhavan et al., Hoffmann et al.] I found particularly interesting on automated cost analysis.

Community differences

As an outsider, I had the opportunity to ponder some interesting cultural differences from which the various research communities I’m involved with could potentially learn.

Mentoring: One of the co-located workshops was PLMW, the Programming Languages Mentoring Workshop. This had talks ranging from “Time management, family, and quality of life” to “Machine learning and programming languages”. Introductory material is mixed with explicitly nontechnical content valuable to early career researchers. Co-locating with the major conference of the field, I would imagine makes it relatively easy to pull in very high quality speakers.

Reproducibility: POPL, amongst other CS conferences, encourages the submission of artifacts. I am a fan of this process. While it doesn’t quite hit the holy grail of research reproducibility, it takes a definite step in this direction.

Time is Precision

Most modern FPGA arithmetic designs use bit-parallel binary arithmetic – typically two’s complement for signed computation. This generally makes for fast arithmetic, but has the distinct disadvantage that silicon area scales with precision of computation. Occasionally – much more often in the distant past when FPGA area was a precious resource – people compute with bit serial binary arithmetic. In this case, time scales with precision.

One problem with digit serial binary arithmetic for multiplication and division is that you need to know, in advance, the location of your least significant digit: while precision unfolds as time, precision is fixed a priori.

Milos Ercegovac‘s online arithmetic forms an interesting counterpoint to this: in this arithmetic, digits are produced most-significant-digit-first. This suggests that the longer we compute, the more precise our answer will be – and we can terminate whenever we’ve got a good enough answer. Or run out of time. The problem with this approach is that the hardware arithmetic units generally implemented for digit-serial multiplication or division still scale with precision requirements, placing an unwanted a priori bound on computational precision.

Enter my former BEng project student Aaron Zhao, who presented our paper “An Efficient Implementation of Online Arithmetic” at the IEEE International Conference on Field-Programmable Technology in Xi’an, China. Aaron’s contribution – for his BEng final year project – was to design a library of online arithmetic units whose logic area is constant with precision (of course, they still need more RAM for more precision.) This opens up a lot of practical possibilities for our research.

Precision (and energy) can scale elastically with time in FPGA-based compute.

Memory Consistency Models, and how to compare them automatically

John Wickerson explains our POPL 2017 paper on memory consistency

Wickopedia

My POPL 2017 paper with Mark Batty, Tyler Sorensen, and George Constantinides is all about memory consistency models, and how we can use an automatic constraint solver to compare them. In this blog post, I will discuss:

What are memory consistency models, and why do we want to compare them?
Why is comparing memory consistency models difficult, and how do we get around those difficulties in our work?
What can we achieve, now that we can compare memory consistency models automatically?

What is a memory consistency model?

Modern computers achieve blistering performance by having their computational tasks divided between many ‘mini-computers’ that run simultaneously. These mini-computers all have access to a collection of shared memory locations, through which they can communicate with one another.

It is tempting to assume that when one of the mini-computers stores some data into one of these shared memory locations, that data will immediately become visible to any other mini-computer that subsequently loads from that location…

View original post 4,172 more words

Rounding Error and Algebraic Geometry

We’re often faced with numerical computation expressed as arithmetic over the reals but implemented as finite precision arithmetic, for example over the floats. A natural question arises: how accurate is my calculated answer?

For many years, I’ve studied the closely related problem “how do I design a machine to implement a numerical computation most efficiently to a given level of accuracy?” for various definitions of “numerical computation”, “efficiently” and “accuracy”.

The latest incarnation of this work is particularly exciting, in my opinion. ACM Transactions on Mathematical Software has recently accepted a manuscript reporting joint work of my former post-doc, Victor Magron, Alastair Donaldson and myself.

The basic setting is borrowed from earlier work I did with my former PhD student David Boland. Imagine that you’re computing $a + b$ in floating-point arithmetic. Under some technical assumptions, the answer you get will not actually be $a + b$ , but rather $(a + b)(1 + \delta)$ where $|\delta| << 1$ is bounded by a constant determined only by the precision of the arithmetic involved. The same goes for any other fundamental operator *, /, etc. Now imagine chaining a whole load of these operations together to get a computation. For straight-line code consisting only of addition and multiplication operators, the result will be polynomial in all your input variables as well as in these error variables $\delta$ .

Once you know that, we can view the problem of determining worst-case accuracy as bounding a polynomial subject to constraints on its variables. There are various ways of bounding polynomials, but David and I were particularly keen to explore options that – while difficult to calculate – might lead to easily verifiable certificates. We ended up using Handelman representations, a particular result from real algebraic geometry.

The manuscript with Magron and Donaldson extends the approach to a more general semialgebraic setting, using recently developed ideas based on hierarchies of semidefinite programming problems that converge to equally certifiable solutions. Specialisations are introduced to deal with some of the complexity and numerical problems that arise, through a particular formulation that exploits sparsity and the numerical properties of the problem at hand. This has produced an open source tool as well as the paper.

This work excites me because it blends some amazing results in real algebraic geometry, some hard problems in program analysis, and some very practical implications for efficient hardware design – especially for FPGA datapath. Apart from the practicalities, algebraic sets are beautiful objects.