POPL and VMCAI 2017: Some Highlights

Last week I was in Paris attending POPL 2017 and co-located events such as VMCAI. This was my first time at POPL, prompted by acceptance of work with John Wickerson. Here are some highlights and observations from my own, biased, position as an outsider to the programming languages community.

Polyhedral analysis

Due to my prior work with linear programming and polyhedral analysis, the polyhedral talks were particularly interesting to me.

Maréchal presented an approach to condense polyhedral representations through ray tracing, which seems to be of general interest.

While in general, polyhedral methods represent some set as $\{x | Ax \leq b\}$ where $A$ and $b$ are parameters, there are various restricted approaches which more efficiently fix $A$ and only allow $b$ as parameters. Seladji presented a technique to try to discover such good “templates” via Principal Component Analysis. I suspect that this is equivalent to an ellipsoidal approximation of the polyhedron in some sense.

Singh, Püschel, and Vechev presented an interesting method to speed up polyhedral abstract domain computations by partitioning, resulting in a much faster tool.

SMT Solvers

An interesting paper from Jovanović presented a nonlinear integer arithmetic solver which looks genuinely very useful for a broad class of applications.

Heroics with Theorem Provers

Lots of work was presented making use of Coq, some of which seemed fairly heroic to me. Of particular interest, given my work on non-negative polynomials for bounding roundoff error, was a Coq tactic from Martin-Dorel and Roux that allows the use of (necessarily approximate) floating-point computation in ways I’ve not seen before, essentially writing a sum-of-squares polynomial $p \approx z^T Q z$ as $p = z^T Q z + z^T E z$ and showing that $Q+E$ is positive semidefinite for some small $E$ .

Weak Memory

John presented our work on automatically comparing memory models, which seemed to be well received.

Numerics

Chiang et al. presented a paper on rigorous floating-point precision tuning. The idea is simple, but elegant: perform error analysis with individual precisions as symbolic variables and then use a global optimization solver to look for a high-performance implementation. Back in 2001, I did a very similar thing but for a very limited class of algorithm (LTI systems) in fixed-point rather than floating-point and using noise power rather than worst-case instantaneous error as the metric of accuracy. However, the general setting (individual precisions, an explicit optimization) are there, and it is great to see this kind of work now appearing in a very different context.

Cost Analysis

There were a couple of papers [Madhavan et al., Hoffmann et al.] I found particularly interesting on automated cost analysis.

Community differences

As an outsider, I had the opportunity to ponder some interesting cultural differences from which the various research communities I’m involved with could potentially learn.

Mentoring: One of the co-located workshops was PLMW, the Programming Languages Mentoring Workshop. This had talks ranging from “Time management, family, and quality of life” to “Machine learning and programming languages”. Introductory material is mixed with explicitly nontechnical content valuable to early career researchers. Co-locating with the major conference of the field, I would imagine makes it relatively easy to pull in very high quality speakers.

Reproducibility: POPL, amongst other CS conferences, encourages the submission of artifacts. I am a fan of this process. While it doesn’t quite hit the holy grail of research reproducibility, it takes a definite step in this direction.

Time is Precision

Most modern FPGA arithmetic designs use bit-parallel binary arithmetic – typically two’s complement for signed computation. This generally makes for fast arithmetic, but has the distinct disadvantage that silicon area scales with precision of computation. Occasionally – much more often in the distant past when FPGA area was a precious resource – people compute with bit serial binary arithmetic. In this case, time scales with precision.

One problem with digit serial binary arithmetic for multiplication and division is that you need to know, in advance, the location of your least significant digit: while precision unfolds as time, precision is fixed a priori.

Milos Ercegovac‘s online arithmetic forms an interesting counterpoint to this: in this arithmetic, digits are produced most-significant-digit-first. This suggests that the longer we compute, the more precise our answer will be – and we can terminate whenever we’ve got a good enough answer. Or run out of time. The problem with this approach is that the hardware arithmetic units generally implemented for digit-serial multiplication or division still scale with precision requirements, placing an unwanted a priori bound on computational precision.

Enter my former BEng project student Aaron Zhao, who presented our paper “An Efficient Implementation of Online Arithmetic” at the IEEE International Conference on Field-Programmable Technology in Xi’an, China. Aaron’s contribution – for his BEng final year project – was to design a library of online arithmetic units whose logic area is constant with precision (of course, they still need more RAM for more precision.) This opens up a lot of practical possibilities for our research.

Precision (and energy) can scale elastically with time in FPGA-based compute.

Memory Consistency Models, and how to compare them automatically

John Wickerson explains our POPL 2017 paper on memory consistency

Wickopedia

My POPL 2017 paper with Mark Batty, Tyler Sorensen, and George Constantinides is all about memory consistency models, and how we can use an automatic constraint solver to compare them. In this blog post, I will discuss:

What are memory consistency models, and why do we want to compare them?
Why is comparing memory consistency models difficult, and how do we get around those difficulties in our work?
What can we achieve, now that we can compare memory consistency models automatically?

What is a memory consistency model?

Modern computers achieve blistering performance by having their computational tasks divided between many ‘mini-computers’ that run simultaneously. These mini-computers all have access to a collection of shared memory locations, through which they can communicate with one another.

It is tempting to assume that when one of the mini-computers stores some data into one of these shared memory locations, that data will immediately become visible to any other mini-computer that subsequently loads from that location…

View original post 4,172 more words

Rounding Error and Algebraic Geometry

We’re often faced with numerical computation expressed as arithmetic over the reals but implemented as finite precision arithmetic, for example over the floats. A natural question arises: how accurate is my calculated answer?

For many years, I’ve studied the closely related problem “how do I design a machine to implement a numerical computation most efficiently to a given level of accuracy?” for various definitions of “numerical computation”, “efficiently” and “accuracy”.

The latest incarnation of this work is particularly exciting, in my opinion. ACM Transactions on Mathematical Software has recently accepted a manuscript reporting joint work of my former post-doc, Victor Magron, Alastair Donaldson and myself.

The basic setting is borrowed from earlier work I did with my former PhD student David Boland. Imagine that you’re computing $a + b$ in floating-point arithmetic. Under some technical assumptions, the answer you get will not actually be $a + b$ , but rather $(a + b)(1 + \delta)$ where $|\delta| << 1$ is bounded by a constant determined only by the precision of the arithmetic involved. The same goes for any other fundamental operator *, /, etc. Now imagine chaining a whole load of these operations together to get a computation. For straight-line code consisting only of addition and multiplication operators, the result will be polynomial in all your input variables as well as in these error variables $\delta$ .

Once you know that, we can view the problem of determining worst-case accuracy as bounding a polynomial subject to constraints on its variables. There are various ways of bounding polynomials, but David and I were particularly keen to explore options that – while difficult to calculate – might lead to easily verifiable certificates. We ended up using Handelman representations, a particular result from real algebraic geometry.

The manuscript with Magron and Donaldson extends the approach to a more general semialgebraic setting, using recently developed ideas based on hierarchies of semidefinite programming problems that converge to equally certifiable solutions. Specialisations are introduced to deal with some of the complexity and numerical problems that arise, through a particular formulation that exploits sparsity and the numerical properties of the problem at hand. This has produced an open source tool as well as the paper.

This work excites me because it blends some amazing results in real algebraic geometry, some hard problems in program analysis, and some very practical implications for efficient hardware design – especially for FPGA datapath. Apart from the practicalities, algebraic sets are beautiful objects.

KAPow: Online Instrumentation of Power

Great news from the IEEE International Symposium on Field-Programmable Custom Computing Machines this week – our paper KAPow: A System Identification Approach to Online Per-module Power Estimation in FPGA Designs won the best paper award!

KAPow is all about trying to understand where your FPGA design is burning power, at run time, so that some higher level control entity can make intelligent decisions based on that information.

The problem is that we just don’t have the power sensors – we only know the total power consumed by the device, not the power consumed by each module in the design. So what to do?

Enter KAPow. Our tool (soon to be released publicly as an output from the PRiME project) will take RTL, instrument it and return back RTL that estimates its own power consumption.

The idea is fairly straight-forward: Automatically identify at synthesis (compile) time a subset of nets that you think are going to have a good correlation with power consumption of a module. Insert cheap counters to monitor their activity at run-time, and then use an online recursive least squares model to adaptively learn a model of power consumption of each module, based on the overall power consumption of the chip.

Results are good: power estimates are good down to about 5mW, and the infrastructure itself adds less than 4% to the total power consumed by the device.

Now, of course, the question is what to do with all this information that can now be collected at run-time. Watch this space!

Super Fast Loops

On Tuesday, my PhD student Junyi Liu presented our work (joint with John Wickerson) on Loop Splitting for Efficient Pipelining in High-Level Synthesis to the assembled audience at the IEEE International Symposium on Field-Programmable Custom Computing Machines in Washington DC.

A primary way in which FPGA applications tend to get their blindingly fast performance is through overlapping loop iterations in time – known variously as loop pipelining or software pipelining. These days, you can expect high-level synthesis tools to do this for you. Sometimes.

Unfortunately there are cases where the tools can’t get squeeze out performance. This paper addresses two such cases in a unified framework. The first is the case where some pesky loop iterations get in the way. Consider this trivial example:


for( int i=0; i&amp;lt;N; i++ )
  A[2*i] = A[i] + 0.5f;

In this case the early loop iterations are the problematic ones because A[2] depends on A[1], A[4] on A[2], etc. These tight dependences hinder pipelining, leading existing HLS tools to throw in the towel.

The second case is where there are loop invariant parameters that are not known until the loop executes. Consider the case:


for( int i=0; i&amp;amp;lt;N; i++ )
  A[i+m] = A[i] + 0.5f;

Without knowing the value of m at compile time, the dependence structure is unknown – we might have no read-after-write dependences, tight read-after-write dependences, or the dependences might be so many iterations away that we just don’t care and can pipeline away to our heart’s content. A limited version of this latter issue was addressed in Junyi’s earlier paper, the predecessor of this work.

In the new FCCM 2016 paper, we show that both these cases can be analysed using a parametric polyhedral framework, and show that we can automatically derive source-to-source transformations to significantly accelerate the loops in these cases. The end result? A push button approach that could gain you a factor of more than 4x in performance if your pipelining is being stymied by pesky dependences.

FPGA 2016: Some Highlights

I’ve just returned from the ACM International Symposium on FPGAs, held – as usual – in sunny Monterey, California. This year I was Finance Chair of the conference, which meant I had less running about to do than the last two years, when I was Programme Chair in 2014 and then General Chair in 2015. I therefore took my time to enjoy the papers presented, which were generally of a high quality. Intel’s acquisition of Altera last year provided an interesting backdrop to the conference, and genuinely seems to have fired the community up, as more and more people outside the FPGA “usual suspects” are becoming very interested in the potential for this technology.

Below, I provide my impressions of the highlights of the conference, which necessarily form a very biased view based on what I find particularly interesting.

Architectures and Low-Level CAD

There were a couple of very interesting papers from Altera on their Stratix 10 device. These devices come with customisable clock trees, allowing you to keep the clock distribution local to your clock regions. The results showed that for regions consisting of fewer than 1.6M LEs, it was better to use a configurable clock region rather than any fixed clock region, as the latter needs greater margining. In a separate paper, Dave Lewis presented a detailed look at the Stratix 10 pipelined routing architecture, which gave a good insight into industrial architecture exploration. They had explored a range of circuits and tried to identify the maximum retiming performance that could be achieved around loops as a function of the number of pipelining registers they need to insert in the routing muxes. The circuit designs were discussed, but for me the most interesting thing about this innovation is the way it changes the tools: focus can be placed on P&R for loops rather than feed-forward portions of the design in the case that the design can tolerate latency; this is exactly where tools should be focusing effort. The kind of timing feedback to the user also changes significantly, and for the better.

Zgheib et al from Paolo Ienne’s group at EPFL presented their FPRESSO tool available at fpresso.epfl.ch which harnesses standard cell tools to explore FPGA architectures. Their tool is open source, and this paper won the best paper prize (becoming something of a habit for Paolo!). This tool could be an interesting basis for future academic architecture exploration.

Safeen Huda from Jason Anderson’s group at U of T presented an interesting suggestion for how suppress glitches for power reasons in FPGA circuits in the presence of PVT variation.

Davis demonstrated our work on run-time estimation of power on a per-module basis, part of the PRiME project, at the relevant poster session, and I was pleased by the number of people who could see the value in run-time monitors for this purpose.

Nadesh Ramanathan

James Davis

High Level Tools

Gao’s talk from my group on automatic optimisation of numerical code for HLS using expression rewriting was very well received, especially since he has made his tool available online at https://admk.github.io/soap/. We would be delighted to receive feedback on this tool.

Ramanathan’s presentation from my group on the case for work-stealing on FPGAs using atomic memory operations in OpenCL also seemed to generate quite a buzz at the discussion session afterwards.

Applications

There were quite a few application papers this year targeting Convolutional Neural Networks (CNNs), the application of the moment. Both of the full papers in this area (from Tsinghua and from Arizona State) emphasised the need to use low precision fixed-point datapath, an approach I’ve been pushing in the FPGA compute space for the last 15 years or more. This application seems to be particularly suited to the problem, allowing computation with little impact on classification down as low as 8 bits. The work from Tsinghua university also took advantage of an SVD approach to reduce the amount of compute required. I think there’s some promise to combine this with the fixed-point quantization, as first pointed out by Bouganis in his FCCM 2005 paper.

Overlays

The conference was preceded by the OLAF workshop run by John Wawrzynek and Hayden So. I must admit that I am not a huge fan of the idea of a hardware-implemented FPGA overlay architecture. I can definitely see the possible advantages of an overlay architecture as a conceptual device, a kind of intermediate format for FPGA compilation. I find it harder to make a compelling case for implementing that architecture in the actual hardware. However, if overlay architectures make programming FPGAs easier in those hard-to-reach areas (until we’ve caught up with our HLS technology!) and therefore expand the user base, then bring it on! A bit like floating point, in fact!

Stealing Work for Your FPGA

Tomorrow, my PhD student Nadesh Ramanathan gives his first conference presentation, at the ACM International Symposium on FPGAs, claiming a place for work-stealing on FPGAs (joint work with John Wickerson and Felix Winterstein). The short paper on which the presentation is based can be found here.

Nadesh argues that we should pursue lock free approaches to load balancing for FPGAs, and shows that this can be implemented within Altera’s OpenCL framework. Initial work from an efficient K-means clustering algorithm which manipulates dynamic data structures demonstrates that this approach shows promise for the future. As we move to put more and more irregular applications on FPGAs, this kind of methodology will become increasingly important.

Fields Can Make Your Loops Run Faster

On Tuesday, my PhD student Xitong Gao will present our work (joint with John Wickerson) on Automatically Optimizing the Latency, Area, and Accuracy of C Programs for High-Level Synthesis at the ACM International Symposium on Field-Programmable Gate Arrays.

Our starting point for this work is that people often want to express their algorithms in terms of real numbers, but then want to have an implementation is some finite precision arithmetic on an FPGA after passing the results through a High-Level Synthesis tool.

One of the key limiting factors of the performance of an algorithm is the ability of the HLS tool to pipeline loops in the design in order to maximise throughput. It turns out that numerical properties of real numbers can come to the rescue!

It’s a well known (but often forgotten) fact that computer-represented numbers are not – in fact are very far from – real numbers. In particular, the main equivalence properties that the real numbers exhibit: closure, associativity, commutativity, distributivity, etc. simply do not apply for a large number of practically used data representations.

In our paper we take account of this discrepancy by automatically refactoring code to make it run faster (while tracking area and accuracy too). As a trivial example for the recurrence $x_n = 3 x_{n-1} + 1$ , a multiply and an add must be performed before executing the next iteration. But transforming this to $x_n = 9 x_{n-2} + 4$ gives much more time to execute. Combining this with expression balancing, etc., leads to a wide variety of possible implementations, which our tool can explore automatically while also proving the numerical properties of the transformed code. This latter point makes it quite unlike so-called “unsafe” optimisations commonly used in compilers such as -funsafe-math-optimizations in gcc.

HiPEAC 2016

I have just returned from HiPEAC, a major “roadshow” of primarily European research in high performance and embedded computing. As always, it was good to catch up with old friends and colleagues from around the world. I briefly describe below some of the highlights for me, as well as my contributions as part of a panel discussion held on Tuesday.

The keynote talk at PARMA-DITAM from my old friend Pedro Diniz on compiler transformations for resilience. A theme that kept coming up throughout the conference, and is also a key topic of the large project I have with Southampton, Manchester and Newcastle Universities.

Kirsten Eder’s talk at the ICT-ENERGY workshop on the work at Bristol on modelling power consumption in the XMOS architecture. An interesting choice, it turns out, because the communication in XMOS is transparent to the programmer, which makes power modelling at instruction level much more appealing.

Another old friend, Juergen Teich, gave a great keynote at IMPACT on extending scheduling to the parametric / symbolic case, in order to allow for the number and shape of a rectangular array of processing units to be determined at run time rather than at compile time. This aligns very nicely with our own drive to parametricity, in our case for run-time pipelining decisions in high-level synthesis.

Cristiano Malossi from IBM gave an interesting keynote at WAPCO on mixed precision conjugate gradient solvers, amongst other things. My former PhD student Antonio Roldao Lopes did an early investigation into how the precision of such a solver impacts its performance (lower precision, more iterations, but each iteration runs faster). It was interesting to see this extended to a mixed precision setting, where expensive parts of the algorithm are executed in low precision while cheap parts are executed in high precision.

I was invited to participate in a panel discussion in front of the audience at the WRC workshop. The other panelists were Aaron Smith, who is writing compilers for the Microsoft FPGA system now in the Bing search engine, Ronan Keryell, who does high-level design R&D at Xilinx, Peter Hofstee, from IBM, and Koen Bertels from TU Delft. The panel was moderated by another old friend, Juergen Becker. We were asked to discuss whether “the embedded revolution is dead” or whether it’s “just the microprocessor” whose death has come (!), and what the future holds. My perspective was that the embedded revolution has not started yet (how much computational intelligence is embedded in most things around us, and how much better could our lives be if there were more?), that the microprocessor is far from dead (though future microprocessors will look far more ‘FPGA-like’ as the number of cores expand beyond the point at which it becomes unreasonable to ignore the fact we’re computing in space), and that the future belongs to high performance embedded systems based on heterogeneous architectures. I don’t think there was much disagreement from other members of the panel. Juergen did his tongue-in-cheek best to start an IBM-and-Microsoft-versus-others fight by suggesting that it was actually datacentres which are dead, though I think there is a general consensus that the balance between local and cloud-based compute needs to be carefully evaluated in future system designs and that neither will kill the other.

Questions from the audience were interesting because people seem mostly worried about how to program these damn things! Which is good for those of us working on exactly this question, of course. The panel’s view was that with today’s compiler technology, the two main options were Domain Specific Languages and OpenCL. I made the point that architecture specialisation, and FPGA design in particular, is naturally aligned with static analysis – you get automatic specialisation by automatically understanding what you’re going to implement; I therefore ventured that future languages for heterogeneous computing, whatever they are, will be designed to make static analysis a simpler task.