Fields Can Make Your Loops Run Faster

On Tuesday, my PhD student Xitong Gao will present our work (joint with John Wickerson) on Automatically Optimizing the Latency, Area, and Accuracy of C Programs for High-Level Synthesis at the ACM International Symposium on Field-Programmable Gate Arrays.

Our starting point for this work is that people often want to express their algorithms in terms of real numbers, but then want to have an implementation is some finite precision arithmetic on an FPGA after passing the results through a High-Level Synthesis tool.

One of the key limiting factors of the performance of an algorithm is the ability of the HLS tool to pipeline loops in the design in order to maximise throughput. It turns out that numerical properties of real numbers can come to the rescue!

It’s a well known (but often forgotten) fact that computer-represented numbers are not – in fact are very far from – real numbers. In particular, the main equivalence properties that the real numbers exhibit: closure, associativity, commutativity, distributivity, etc. simply do not apply for a large number of practically used data representations.

In our paper we take account of this discrepancy by automatically refactoring code to make it run faster (while tracking area and accuracy too). As a trivial example for the recurrence x_n = 3 x_{n-1} + 1, a multiply and an add must be performed before executing the next iteration. But transforming this to x_n = 9 x_{n-2} + 4 gives much more time to execute. Combining this with expression balancing, etc., leads to a wide variety of possible implementations, which our tool can explore automatically while also proving the numerical properties of the transformed code. This latter point makes it quite unlike so-called “unsafe” optimisations commonly used in compilers such as -funsafe-math-optimizations in gcc.

Sizing Custom Parallel Caches

On Wednesday, my PhD student Felix Winterstein will present his paper Custom-Sized Caches in Application-Specific Memory Hierarchies at FPT 2015 in New Zealand. This is joint work with MIT and Intel, specifically the team who developed LEAP.

Imagine that your FPGA-based algorithm needs to access data. Lots of data. What do you do? Store that data in SDRAM and spend ages carefully thinking about how to make efficient use of scarce on-chip memory to get the best performance out of your design. What if that whole process could be automated?

Felix has been working for a while on a method to do just that. He’s been able to produce some incredibly exciting work, making use of the latest developments in software verification to devise an automatic flow that analyses heap manipulating software and how it should be efficiently implemented in hardware. At FPGA in February this year, we were able to show how our analysis could automatically design a parallel caching scheme, and know – prove, even! – when costly coherence protocols could be avoided. The remaining part of the puzzle was to automatically know – in a custom parallel caching scheme – how much on-chip memory you should allocate to which parallel cache. This is the last piece of the puzzle we will present on Wednesday.

 

Run Fast! (When You Can): Data-Dependent Pipelining

This week sees the IEEE International Symposium on Field-Programmable Custom Computing Machines hosted in Vancouver, BC, Canada. I have sent my PhD student, Junyi Liu, who is presenting some work we’ve done together with my former research fellow, now a member of staff at Xilinx, Sam Bayliss.

Junyi and I are interested in memory accesses about which we know something at compile time, but not quite enough to be able to have a complete picture of what they’ll be at run time. This means that existing analyses of code containing these accesses will typically result in suboptimal behaviour at run time.

The case we study in this paper is that of affine loop nests containing memory accesses whose addresses involve indeterminate variables that do not change value as the loop executes. The value of these variables can completely change the dependence structure of the loop, meaning that for some values, the loop can be highly pipelined and run extremely fast whereas for other values, the loop must run very slowly. We give the trivial example below in the paper:


for( int i=LB; i<=UB; i++ )
  A[i] = A[i+m] + 1;

The variable m does not change as the loop executes. For some values of m, each iteration of the loop must start execution when the previous one finishes. For other values of m, we can pipeline the loop so that iterations overlap in time. So how to discover this and take advantage of it automatically?

We present an approach based on an extension of the polyhedral model to parametric polyhedra, where the parameters are these parameters, like m, unknown at compile time. My former PhD student Qiang Liu (no relation), colleague Kostas Masselos and I were some of the first people to bring the polyhedral model to FPGA-based computation in our FCCM 2007 paper (Qiang is now an associate professor at Tianjin University) but to the best of my knowledge, parametric polyhedra are relatively understudied in this area.

The basic idea of this new work is simple: use static analysis to identify regions of the parameter space where you can run the loop fast, and synthesise light-weight detectors of these regions for runtime checks on entry to the loop nest. Since my original work with the polyhedral model, high level synthesis has become a commercial reality, so we build on top of Xilinx VivadoHLS, implementing our tool as a source-to-source transformation engine. This automatically inserts the appropriate VivadoHLS pragmas as well as the extra code for the runtime checks.

The results are very promising: some of our benchmarks show a 10x speedup in initiation interval (the time you need to wait before executing another iteration of the loop).

Coherence When You Need It

This week, my PhD student Felix Winterstein presented our work (joint with Intel and MIT) on how to customise memory systems to support parallel applications, at the FPGA 2015 conference.

If you are working with FPGAs, you have a huge freedom to develop an arbitrary on-chip memory system to support your application. Tools are getting quite good at developing such systems for regular array based code. But pointer manipulating programs that build, operate on, and destroy memory structures on the fly will not pass through commercial HLS tools, let alone produce good quality results.

We’ve shown that this issue can be addressed using a tool we’ve developed based on the theoretical foundation of Separation Logic. The tool automatically figures out when functional units can have private on-chip caches, which functional units need shared coherent caches, and – in the latter case – when commands can be reordered to aid parallelisation. It uses this analysis to automatically produce C source code that will pass through commercial HLS tools, providing the necessary hints and pragmas to parallelise the code, and automatically generates the on chip caches to support the parallel datapath.

I’m really very excited about this work, I view it as a big step towards efficient automatic hardware implementation of full-featured C code.

Overclocking-Friendly Arithmetic Needn’t Cost the Earth

This week, the FPT conference is being held in Shanghai. My PhD student, Kan Shi, will be presenting our work (joint with David Boland), entitled Efficient FPGA Implementation of Digit Parallel Online Arithmetic Operators.

In a previous paper, we’ve shown how the ideas of Ercegovac‘s “online arithmetic” – an arithmetic where computation proceeds from most significant digit to least significant digit, in contrast to the usual way we think about adding and multiplying – can be applied in the brave new world of clocking circuits faster than they “should” be able to run. The basic idea is simple: although sometimes beneficial, overclocking normal arithmetic – when it hurts – hurts bad, because errors tend to occur in the more significant digits. But if the flow of information were reversed, from more to less significant digits, then the error should hurt less too.

And so it did. But with one problem: to allow most significant digits to be generated first requires a redundant number system – a way of representing numbers where each number has more than one representation, for example, where there are two different ways of writing “10”. This redundancy cost a lot of silicon area.

This new paper shows that, in modern FPGA architectures, with careful design, the cost can be reduced significantly for adders. For multipliers, most significant digit first arithmetic has the important benefit that if you only want the most significant digits of the result, you don’t need to compute the least significant digits. In multiplication this is often the case, often in regular binary arithmetic we compute the 2n-bit result of an n by n-bit multiplication only to throw away the bottom n bits. We show in this paper that by judicious design, the area penalties of the redundant arithmetic can be eliminated in this case.

This work removes one of the last remaining hurdles that stops online arithmetic being industrially viable.

Insure yourself against your own inaccuracy

Last week saw the 19th World Congress of the International Federation of Automatic Control in Cape Town, South Africa. Three of my collaborators, Andrea Suardi, Stefano Longo and Eric Kerrigan, were there to present our joint paper Robust explicit MPC design under finite precision arithmetic.

The basic idea of this work is simple but interesting. Since we know we make mistakes, can we make decisions in a way that insures ourselves against our own faulty decision making? In a control system, we typically want to control a “real world thing” – an aircraft, a gas turbine, etc. Ourselves and others have proposed very sophisticated ways to do this, but since with each finite precision operation we might drift further away from the correct result, can we develop our algorithm in a way that provides a guaranteed behaviour?

We show that since control engineering provides tools to control systems with uncertain behaviour, it’s possible to incorporate the uncertainty of the control algorithm itself into the model of the system we’re controlling, to produce a kind of self-aware system design.

While the setting of this paper is in so-called explicit model predictive control, there’s no reason why this general philosophy should not extend to other decision making processes. It provides a rigorous way to think about the impact of decision quality on the overall behaviour of a system, since we can generally make decisions in any number of ways ranging from “quick and dirty” to “slow and thoughtful”, we could decide how to decide based on ideas like this.