Book Review: Complexity and Real Computation

Over several holiday and train journeys, I’ve been slowly reading and digesting this book by Blum, Cucker, Shub, and Smale. It’s one of a number of books I’ve read that are definitively not in my research field – hence holiday reading – and yet touches it tangentially enough to get me excited. I’ve been interested in computation defined in terms of real numbers ever since my own PhD thesis back in 1998-2001, which looked at ways to take a specific class of computation defined on the reals and synthesise hardware circuits operating in very efficient fixed-point arithmetic. While my own work has often involved specifying computation on the reals, including recently exploiting the differences between real and finite precision computation for circuits, this book takes a fundamental look at computation over the reals themselves. In particular, it looks at what happens to notions of computability and complexity if you consider real addition, multiplication, division, and subtraction as basic operators in a computation model: what is computable, what resulting complexity bounds can be proved? Ever since my first exposure to algebraic geometry, especially real algebraic geometry, I see it everywhere – and have even made some small forays [1,2] into harnessing its power for hardware design. It’s a delight to see computability and algebraic geometry coming together so neatly in this text.

Part I: Basic Development

The authors define a computational model that is very natural for analysis: a computation consists of a control-flow graph consisting of input nodes, computation nodes (polynomial or rational maps), branch nodes (checking for equality or inequality, depending on whether the ring / field being computed over is ordered), and output nodes. For the remainder of this review, I will consider only computation over the reals, though some of the results are more general. Theorems naturally flow from this setting. My favourites: the halting set of a machine is a countable union of semialgebraic sets; there are “small” systems of polynomial inequalities, of low degree, describing behaviour of a machine up to T time steps. As a by product, the first practical uses of this theory are illustrated, e.g. a proof that the Mandelbrot set is not decidable over {\mathbb R}. The interplay between the uncountable (program state) and the countable (program counter state) leads to some intricate and enjoyable analysis.

Chapter 3 generalises classical results for universal Turing machines, typically expressed in terms of unbounded tapes with binary storage at each location, to the case where an element of an arbitrary ring is stored at each location (imagine a real number stored at each location). Additional shift operations are introduced to deal with moving along the tape. It is shown that operation over unbounded sequences allows for uniformity of algorithm across dimensionality but does not add additional computational power for fixed dimensionality. In finite time we can only explore a finite portion of the tape, so we still retain the semi-algebraic (or countable union of semi-algebraic) sets for the halting sets derived in Chapter 2.

Chapter 4 extends the usual complexity-theoretic classes P and EXP to computation over a ring. Computation over {\mathbb Z}_2 corresponds to the classical case. There are no great surprises here for a reader who has read a basic text in computational complexity together with Chapters 1-3 of this book. The authors round off the chapter by using this new machinery to cast results from Goedel and Tarski from an interesting perspective. For example, Goedel becomes there exist definable undecidable sets over {\mathbb Z}, while Tarski becomes every definable set over {\mathbb R} is decidable over {\mathbb R}. Questions of definability are then put off until the very end of this book – and this review.

Chapter 5 shows how classical NP, NP-complete and NP-hard definitions carry forward into computation over a general ring. Then, specialising the results to particular rings and particular costs of elementary operations (unit? dependent on size of numbers?), the NP-completeness of various key problems are shown. In particular, it is shown that the classical 3-SAT NP-completeness result can be obtained by working over the ring {\mathbb Z}_2. The chapter ends with an interesting discussion of the P ?= NP question over various rings and with various costs. It’s interesting, for example, that the problem of determining whether there is a zero of a degree-4 polynomial is NP-complete over {\mathbb R} while that of a degree 3 is in P; analogous to the situation with 3-SAT versus 2-SAT in the classical setting.

Chapter 6 moves on to look specifically at machines over {\mathbb Z} with bit cost, i.e. the bigger the number, the more it costs to operate on it. It is shown that these machines are polynomially equivalent to those operating over {\mathbb Z}_2 with unit cost, i.e. anything you can write in terms of algebraic operations over {\mathbb Z} you can also write under the classical Turing model of computation, and vice versa, with only polynomial blow up in execution steps. This is not too surprising, and ultimately derives from the standard possibility of writing an integer as its binary expansion. The chapter then takes a detour via Farkas’s Lemma to show that linear programming feasibility (LPF) is in NP, a rather intricate result I first saw in Schrijver’s encyclopaedic book when working with my former PhD student Qiang Liu on discrete linear algebra for hardware memory optimization. It is noted that LPF’s status depends strongly on the choice of computation model and ring: in the (equivalent) classic setting of integers with bit cost it is NP-complete, over rationals with bit cost it is in P, while its status over {\mathbb R} is apparently open.

Part II: Some Geometry of Numerical Algorithms

This part of the book turns to look at specific algorithms such as Newton’s method and a homotopy method for nonlinear optimization. In each case, complexity results are given in terms of the number of basic real arithmetic operations required. Some fun analysis and quite a clear exposition of homotopy methods, which I’ve worked with before in the context of convex optimization, especially when using barrier methods for linear and convex quadratic programming, something I looked at quite carefully when trying to design FPGA-based computer architectures for this purpose.

The chapter on Bézout’s Theorem assumed a rather more thorough grounding in differential topology than I have, and I wasn’t entirely sure how it sat with the rest of the work, as the constructive / algorithmic content of the theorems was less apparent to me.

The final chapters of this part of the book took me back to more familiar territory – first introduced to me by the outstandingly clear encyclopaedic textbook by Nick Higham – the discussion of condition numbers for linear equations and the relationship to the loss of precision induced by solving such systems (roughly, the relative error in the right hand side of Ax = b gets amplified by the condition number of the matrix A). What was interesting here is the authors’ probabilistic analysis – looking at the expected condition number if the matrix elements are iid Gaussian variables, for example. I have skimmed another book also co-authored by Cucker – on this very topic, and I’m waiting to have the time to read it. Condition for polynomial problems is covered, leading to complexity results for logarithmic barrier interior point methods for linear equations over \mathbb{Q} (spoiler: it’s in \mathbb{P}).

Part III: Complexity Classes over the Reals

After Part II, which was only interesting to me in parts, I started to get excited again by Part III. Lower bounds on the computational complexity of various problem are derived here based on some very interesting ideas. The basic scheme is to bound the number of different connected components in a (semi-)algebraic set in terms of the degree of the polynomials and the number of variables involved. The number of different connected components can then be used to bound the number of steps an (algebraic) machine takes to decide membership of that set.

In Chapter 17, the framework is extended to probabilistic machines, and the standard definition of BPP is extended to computation over an arbitrary ring, giving BPP_{R}. The chapter then goes on to show that probabilistic machines over \mathbb{R} can be simulated by deterministic machines over \mathbb{R} without blowing up execution time, i.e. that P_\mathbb{R} = BPP_\mathbb{R}, whereas this is unknown for the classical setting. The essence of the argument makes use of a special real number (shown, non constructively, to exist), encoding all the different “coin flips” that a probabilistic program makes during its execution, which is then encoded as a machine constant in the simulating machine.

Parallelism – close to my heart – is covered in Chapter 18, through the introduction of complexity classes PL^k_\mathbb{R}, parallel polylogarithmic time, for sets that can be decided in time O(\log^k n) using a polynomial number of processors, and PAR_\mathbb{R}, parallel polynomial time, for sets that can be decided in polynomial time using an exponential number of processors (definitions following a SPMD computational model). Algorithms are derived for matrix inversion and product and a result is cited for parallel quantifier elimination over the reals.

Chapter 22 looks at the relative power of machines over \mathbb{Z}_2, i.e. the classical model, compared to real machines which take inputs restricted to be drawn from \{0,1\}. It’s fairly obvious that the latter setting can provide a lot of additional power, for example deciding membership of a set in \mathbb{Z}^\infty_2 can be done by encoding the characteristic function of the set as the digits of some real constant in the program. It is then shown that additive real machines (a restricted form of the machines discussed above, containing no multiplication or division) that additionally contain branching only on equality can solve exactly the same problems in polynomial time as in the classical model. This is unlike the case where branching can be over inequality, where some additional computational power is provided by this model.

Descriptive Complexity, Chapter 23 is interesting: the descriptive power of (a particular) first order logic extended with fixed point and maximisation operations and of existential second order logic are shown to correspond to a particular subclass of P_\mathbb{R} problems and EXP_\mathbb{R}, respectively. In the latter case, I think this essentially says that known results in the classical case carry forward to computation over the reals despite the move away from a finite ring. I am not familiar enough with computational complexity literature to comment on how the former result compares with the classical setting, and I didn’t feel that this point is was very well described in the book.


Overall, I found this book a little eclectic, especially Part II, reflective of the fact that it has clearly been put together drawing from a variety of different primary sources spanning several decades. This should not be seen as a criticism. The choice of material was wonderful for me – as holiday reading at least! – tangentially touching so many different topics I’ve seen during my career and drawing me in. In places, the development was over an arbitrary ring, in places over \mathbb{C} and in places over \mathbb{R}, and although for some sections it was absolutely clear why, for others it was less clear. Overall, though, I would very much recommend anyone who likes to keep a foot in both the continuous and discrete worlds to take a look at this book.

Book Reviews: Popular Science

A friend recently reintroduced me to the genre of popular science – specifically, popular physics – something I had left behind in my mid teens. I remember voraciously reading the books of John Gribbin as a teenager, sitting in bed in my parents’ house thinking about black holes and quantum physics. Since then I’ve been more focused on – and interested in – my particular field.

However, following prompting, I’ve recently read two popular science books: Seven Brief Lessons on Physics by Carlo Rovelli, and A Beautiful Question by Frank Wilczek. Both are good books, though very different. The former is a brief, almost poetically written waltz through relatively, quantum physics, and more recent unification efforts – it can easily be read in a single afternoon. The second is a more weighty tome, a very personal view from a Nobel Prize winner of beauty and its embodiment in the symmetries present in nature. The chapters of the latter vary in terms of how much I enjoyed them, primarily because many I felt some had too much information not to take a mathematical approach, yet this was not forthcoming. Meanwhile the former was a joy to read because it its brevity skimmed the surface and left me wanting to dig further, specifically into ideas of quantum loop gravity and into what  modern neuroscience may or may not be able to tell us about the emergence of notions of “self”.

Book Review: Essential Topology

Topology is an area of mathematics with which I have no prior experience. I guess it was felt historically that engineers don’t really need topology, and this has filtered down the generations of study. Yet I’ve always found the ideas of topology intriguing, even if I have not deeply understood them. They seem to pop up in the most wide variety of places. While running a math circle for primary school kids, we ended up discussing Euler’s Polyhedron Formula, and I realised that if I wanted to explore these ideas more deeply, I would need a proper grounding in topology. From the wonderful lectures of Tadashi Tokieda which I watch with my son, to the more theoretical end of computer science I bump up against in my day-to-day research, topology seems to be everywhere. I found this book, Essential Topology by Martin D. Crossley, while browsing in a bookshop and decided to take it on holiday as holiday reading.

As presented, the book naturally falls into three parts: an introductory section, a section on basic topology, and a section on more advanced algebraic topology. Although short, the book is written as a sequence of a good number of chapters, 11 in total, which make the material more digestible. Moreover, I found the two “interludes” between sections of the book to be a great innovation – invaluable as a mechanism for orienting my reading, providing the appropriate informal guidance about the journey on which the text was taking me.

The introductory section begins with looking at the question of continuity from a rigorous perspective, bringing in both the epsilon-delta approach to continuity with which I am familiar, and an approach based on open sets with which I was not – but which is easily generalised to functions other than from \mathbb{R} to \mathbb{R}. It then moves on to axiomatically defines topological spaces, some standard topologies, and bases for topologies.

The second section begins to explore some of the properties that can be proved using the equipment set up in the previous chapters: connectedness, compactness, the Hausdorff property, homeomorphisms, disjoint unions, product and quotient spaces. I enjoyed this section greatly.

The third section then goes on to discuss various more advanced topics, looking at various topological invariants that have been well studied. Some highlight for me included:

Chapter 6 discusses the idea of homotopy as an equivalence between maps: two maps f,g: S \rightarrow T are homotopic iff there is a continuous function F: S \times [0,1] \rightarrow T allowing for a kind of continuous deformation of one function into the other, i.e. with F(s,0) = f(s) and F(s,1) = g(s). This is shown to be rather a broad equivalence, for example all continuous functions on \mathbb{R} are homotopic. However, working with other topological spaces, things get quite interesting. A big part of the chapter is given over to working with circles \mathbb{S}^1, where it is shown that there is a countable set of homotopy classes of functions from \mathbb{S}^1 to \mathbb{S}^1. This is very clearly described, and I seem to remember through the mists of time some of these issues cropping up informally in the context of my control theory courses as an undergraduate (or, for that matter, fathoming {\tt unwrap} in Matlab as an undergraduate!) The chapter ends with a proof of the fantastically named “Hairy ball theorem,” from which – amongst other things – it follows that at any given point in time, there’s a part of the world with zero wind speed!

Chapter 7 discusses the Euler characteristic as a topological invariant and introduces some interesting theorems. After introducing the idea of a ‘triangulable space’, it is stated that if two such spaces are homotopy equivalent then they have the same Euler number. More remarkable (to me!) is that when restricted to surfaces, it is sufficient for two such surfaces to have the same Euler number and the same orientability in order for them to be homotopy equivalent. Unfortunately, the proofs of both these results are omitted – apparently they are rather involved – references are given. I certainly appreciated the results collected in this chapter, but I found the exercises quite hard compared to other chapters, possibly partly because of the proof issue, but also because I found it hard to visualise some aspects, e.g. triangulation of a surface of genus 2, and since such surfaces had only been defined informally in the preceding chapters I could not (easily) fall back on a purely algebraic approach to the geometry. I did find it particularly interesting to see the Euler characteristic rigorously defined for a general class of spaces – the very definition in the primary math circle that had brought me to this book in the first place!

Chapter 8 discusses homotopy groups. The basic idea is that one can work with homotopy classes of maps from {\mathbb S}^n, the n-dimensional sphere, to a space X (actually pointed homotopy classes of pointed maps, but lets keep things simple) and it turns out that these classes form a group structure: they have an identity element (constant maps), an addition operation, etc. I guess the purpose here is to bring out the topological structure of X, though the special role played by {\mathbb S}^n was not totally apparent to me – why is this particular space so fruitful to study? I wonder if any readers can comment on this point for me.

Chapter 9 provides an introduction to Simplicial Homology, the idea – roughly – that those combinations of simplices within a simplicial complex which ‘could’ form boundaries of other simplices but ‘do’ not, tell us something about the topology of the simplicial complex. Homology is then introduced in a different form (Singular Homology) in Chapter 10, and it is stated that the two approaches coincide for triangulable spaces.

In these latter chapters of the book, some theorems tend to be stated without proof. The author is always careful to refer to other textbooks for proof, and I guess this is a necessary aspect of trying to introduce a very large subject in a brief textbook, but for me this made them somewhat less appealing than those in the first two sections. Nevertheless, I now feel suitably armed to begin looking at the world through a more topological lens. I wonder what I will see. For now I’m off to eat a coffee cup.

2016 SATS: Scaled Scores

On the 3rd of June England’s Department for Education released information about how to turn children’s answers in their KS1 tests into a scaled score. I am profoundly disappointed by the inconsistency between this document and the fanfare produced over the abolition of levels.

By introducing these scaled scores, the DfE has produced a new level of achievement known in their paper as “the expected standard on the test”. Note that this is quite a different thing to “the expected standard” defined in the Interim Teacher Assessment Frameworks at the End of KS1. Confused? You should be.

When moving from a level-based “best fit” assessment to the new assessment framework (see my earlier blog post for my concerns on this), a key element of the new framework was that a pupil is only assessed as having met the expected standard if they have attained “all of the statements within that standard and all the statements in the preceding standards” (boldface in original). As schools up and down the country struggle to produce systems capable of tracking pupil progress, I’ve been waiting to see how the Department intends to square this assessment approach with testing. Now the answer is in: they don’t.

Let me explain why. To simplify matters, let’s look at a stripped down version of the “expected standard” for KS1 mathematics. Let’s imagine it just consists of the first two statements:

  • The pupil can partition two-digit numbers into different combinations of tens and ones. This may include using apparatus (e.g. 23 is the same as 2 tens and 3 ones which is the same as 1 ten and 13 ones).
  • The pupil can add 2 two-digit numbers within 100 (e.g. 48 + 35) and can demonstrate their method using concrete apparatus or pictorial representations.

Leaving aside the apparatus question (guidance here states that children were not allowed apparatus in the test, so quite how that’s supposed to measure the expected standard is a mystery), the question remains – how do you convert assessment of each individual strand into an assessment of whether the expected standard is met. Let’s assume our test has a question to test each statement. The teacher assessment guidance is straightforward, if flawed: assess each strand individually and only if all strands have been reached has the “expected standard” been reached. Translating this into our imaginary test, this would mean: mark each question individually, and only if all questions are above their individual pass mark, the standard has been met. Is this the approach taken? Not at all. The approach taken is exactly that used under levels: add up all the marks for all the questions, and if the total is above a threshold then the “expected standard on the test” has been met, i.e. it is a best fit judgement. Yes, that’s right, exactly the kind of judgement railed against by the Department for Education and the Commission on Assessment without Levels – we are back to levels. For better or for worse.

The total mismatch between the approach enforced in testing and the approach enforced in teacher assessment has obviously been spotted by the DfE because they themselves say:

The tests are also compensatory: pupils score marks from any part of the test and pupils with the same total score can achieve their marks in different ways. The interim teacher assessment frameworks are different.

Frankly, this is a mess.

Key Stage 2 test results are out on the 5th of July. I expect a similar approach then, except this time those results form the basis of the school accountability system.

The NAHT is quite right to call for school level data from the flawed 2016 assessments not to be used for external purposes and to question the whole approach of “secure fit”.

KAPow: Online Instrumentation of Power

Great news from the IEEE International Symposium on Field-Programmable Custom Computing Machines this week – our paper KAPow: A System Identification Approach to Online Per-module Power Estimation in FPGA Designs won the best paper award!

KAPow is all about trying to understand where your FPGA design is burning power, at run time, so that some higher level control entity can make intelligent decisions based on that information.

The problem is that we just don’t have the power sensors – we only know the total power consumed by the device, not the power consumed by each module in the design. So what to do?

Enter KAPow. Our tool (soon to be released publicly as an output from the PRiME project) will take RTL, instrument it and return back RTL that estimates its own power consumption.

The idea is fairly straight-forward: Automatically identify at synthesis (compile) time a subset of nets that you think are going to have a good correlation with power consumption of a module. Insert cheap counters to monitor their activity at run-time, and then use an online recursive least squares model to adaptively learn a model of power consumption of each module, based on the overall power consumption of the chip.

Results are good: power estimates are good down to about 5mW, and the infrastructure itself adds less than 4% to the total power consumed by the device.

Now, of course, the question is what to do with all this information that can now be collected at run-time. Watch this space!

Super Fast Loops

On Tuesday, my PhD student Junyi Liu presented our work (joint with John Wickerson) on Loop Splitting for Efficient Pipelining in High-Level Synthesis to the assembled audience at the IEEE International Symposium on Field-Programmable Custom Computing Machines in Washington DC.

A primary way in which FPGA applications tend to get their blindingly fast performance is through overlapping loop iterations in time – known variously as loop pipelining or software pipelining. These days, you can expect high-level synthesis tools to do this for you. Sometimes.

Unfortunately there are cases where the tools can’t get squeeze out performance. This paper addresses two such cases in a unified framework. The first is the case where some pesky loop iterations get in the way. Consider this trivial example:

for( int i=0; i<N; i++ )
  A[2*i] = A[i] + 0.5f;

In this case the early loop iterations are the problematic ones because A[2] depends on A[1], A[4] on A[2], etc. These tight dependences hinder pipelining, leading existing HLS tools to throw in the towel.

The second case is where there are loop invariant parameters that are not known until the loop executes. Consider the case:

for( int i=0; i<N; i++ )
  A[i+m] = A[i] + 0.5f;

Without knowing the value of m at compile time, the dependence structure is unknown – we might have no read-after-write dependences, tight read-after-write dependences, or the dependences might be so many iterations away that we just don’t care and can pipeline away to our heart’s content. A limited version of this latter issue was addressed in Junyi’s earlier paper, the predecessor of this work.

In the new FCCM 2016 paper, we show that both these cases can be analysed using a parametric polyhedral framework, and show that we can automatically derive source-to-source transformations to significantly accelerate the loops in these cases. The end result? A push button approach that could gain you a factor of more than 4x in performance if your pipelining is being stymied by pesky dependences.


FPGA 2016: Some Highlights

I’ve just returned from the ACM International Symposium on FPGAs, held – as usual – in sunny Monterey, California. This year I was Finance Chair of the conference, which meant I had less running about to do than the last two years, when I was Programme Chair in 2014 and then General Chair in 2015. I therefore took my time to enjoy the papers presented, which were generally of a high quality. Intel’s acquisition of Altera last year provided an interesting backdrop to the conference, and genuinely seems to have fired the community up, as more and more people outside the FPGA “usual suspects” are becoming very interested in the potential for this technology.

Below, I provide my impressions of the highlights of the conference, which necessarily form a very biased view based on what I find particularly interesting.


Architectures and Low-Level CAD

There were a couple of very interesting papers from Altera on their Stratix 10 device. These devices come with customisable clock trees, allowing you to keep the clock distribution local to your clock regions. The results showed that for regions consisting of fewer than 1.6M LEs, it was better to use a configurable clock region rather than any fixed clock region, as the latter needs greater margining. In a separate paper, Dave Lewis presented a detailed look at the Stratix 10 pipelined routing architecture, which gave a good insight into industrial architecture exploration. They had explored a range of circuits and tried to identify the maximum retiming performance that could be achieved around loops as a function of the number of pipelining registers they need to insert in the routing muxes. The circuit designs were discussed, but for me the most interesting thing about this innovation is the way it changes the tools: focus can be placed on P&R for loops rather than feed-forward portions of the design in the case that the design can tolerate latency; this is exactly where tools should be focusing effort. The kind of timing feedback to the user also changes significantly, and for the better.

Zgheib et al from Paolo Ienne’s group at EPFL presented their FPRESSO tool available at which harnesses standard cell tools to explore FPGA architectures. Their tool is open source, and this paper won the best paper prize (becoming something of a habit for Paolo!). This tool could be an interesting basis for future academic architecture exploration.

Safeen Huda from Jason Anderson’s group at U of T presented an interesting suggestion for how suppress glitches for power reasons in FPGA circuits in the presence of PVT variation.

Davis demonstrated our work on run-time estimation of power on a per-module basis, part of the PRiME project, at the relevant poster session, and I was pleased by the number of people who could see the value in run-time monitors for this purpose.


High Level Tools

Gao’s talk from my group on automatic optimisation of numerical code for HLS using expression rewriting was very well received, especially since he has made his tool available online at We would be delighted to receive feedback on this tool.

Ramanathan’s presentation from my group on the case for work-stealing on FPGAs using atomic memory operations in OpenCL also seemed to generate quite a buzz at the discussion session afterwards.



There were quite a few application papers this year targeting Convolutional Neural Networks (CNNs), the application of the moment. Both of the full papers in this area (from Tsinghua and from Arizona State) emphasised the need to use low precision fixed-point datapath, an approach I’ve been pushing in the FPGA compute space for the last 15 years or more. This application seems to be particularly suited to the problem, allowing computation with little impact on classification down as low as 8 bits. The work from Tsinghua university also took advantage of an SVD approach to reduce the amount of compute required. I think there’s some promise to combine this with the fixed-point quantization, as first pointed out by Bouganis in his FCCM 2005 paper.



The conference was preceded by the OLAF workshop run by John Wawrzynek and Hayden So. I must admit that I am not a huge fan of the idea of a hardware-implemented FPGA overlay architecture. I can definitely see the possible advantages of an overlay architecture as a conceptual device, a kind of intermediate format for FPGA compilation. I find it harder to make a compelling case for implementing that architecture in the actual hardware. However, if overlay architectures make programming FPGAs easier in those hard-to-reach areas (until we’ve caught up with our HLS technology!) and therefore expand the user base, then bring it on! A bit like floating point, in fact!

Stealing Work for Your FPGA

Tomorrow, my PhD student Nadesh Ramanathan gives his first conference presentation, at the ACM International Symposium on FPGAs, claiming a place for work-stealing on FPGAs (joint work with John Wickerson and Felix Winterstein). The short paper on which the presentation is based can be found here.

Nadesh argues that we should pursue lock free approaches to load balancing for FPGAs, and shows that this can be implemented within Altera’s OpenCL framework. Initial work from an efficient K-means clustering algorithm which manipulates dynamic data structures demonstrates that this approach shows promise for the future. As we move to put more and more irregular applications on FPGAs, this kind of methodology will become increasingly important.


Fields Can Make Your Loops Run Faster

On Tuesday, my PhD student Xitong Gao will present our work (joint with John Wickerson) on Automatically Optimizing the Latency, Area, and Accuracy of C Programs for High-Level Synthesis at the ACM International Symposium on Field-Programmable Gate Arrays.

Our starting point for this work is that people often want to express their algorithms in terms of real numbers, but then want to have an implementation is some finite precision arithmetic on an FPGA after passing the results through a High-Level Synthesis tool.

One of the key limiting factors of the performance of an algorithm is the ability of the HLS tool to pipeline loops in the design in order to maximise throughput. It turns out that numerical properties of real numbers can come to the rescue!

It’s a well known (but often forgotten) fact that computer-represented numbers are not – in fact are very far from – real numbers. In particular, the main equivalence properties that the real numbers exhibit: closure, associativity, commutativity, distributivity, etc. simply do not apply for a large number of practically used data representations.

In our paper we take account of this discrepancy by automatically refactoring code to make it run faster (while tracking area and accuracy too). As a trivial example for the recurrence x_n = 3 x_{n-1} + 1, a multiply and an add must be performed before executing the next iteration. But transforming this to x_n = 9 x_{n-2} + 4 gives much more time to execute. Combining this with expression balancing, etc., leads to a wide variety of possible implementations, which our tool can explore automatically while also proving the numerical properties of the transformed code. This latter point makes it quite unlike so-called “unsafe” optimisations commonly used in compilers such as -funsafe-math-optimizations in gcc.

HiPEAC 2016

I have just returned from HiPEAC, a major “roadshow” of primarily European research in high performance and embedded computing. As always, it was good to catch up with old friends and colleagues from around the world. I briefly describe below some of the highlights for me, as well as my contributions as part of a panel discussion held on  Tuesday.

The keynote talk at PARMA-DITAM from my old friend Pedro Diniz on compiler transformations for resilience. A theme that kept coming up throughout the conference, and is also a key topic of the large project I have with Southampton, Manchester and Newcastle Universities.

Kirsten Eder’s talk at the ICT-ENERGY workshop on the work at Bristol on modelling power consumption in the XMOS architecture. An interesting choice, it turns out, because the communication in XMOS is transparent to the programmer, which makes power modelling at instruction level much more appealing.

Another old friend, Juergen Teich, gave a great keynote at IMPACT on extending scheduling to the parametric / symbolic case, in order to allow for the number and shape of a rectangular array of processing units to be determined at run time rather than at compile time. This aligns very nicely with our own drive to parametricity, in our case for run-time pipelining decisions in high-level synthesis.

Cristiano Malossi from IBM gave an interesting keynote at WAPCO on mixed precision conjugate gradient solvers, amongst other things. My former PhD student Antonio Roldao Lopes did an early investigation into how the precision of such a solver impacts its performance (lower precision, more iterations, but each iteration runs faster). It was interesting to see this extended to a mixed precision setting, where expensive parts of the algorithm are executed in low precision while cheap parts are executed in high precision.

I was invited to participate in a panel discussion in front of the audience at the WRC workshop. The other panelists were Aaron Smith, who is writing compilers for the Microsoft FPGA system now in the Bing search engine, Ronan Keryell, who does high-level design R&D at Xilinx, Peter Hofstee, from IBM, and Koen Bertels from TU Delft. The panel was moderated by another old friend, Juergen Becker. We were asked to discuss whether “the embedded revolution is dead” or whether it’s “just the microprocessor” whose death has come (!), and what the future holds. My perspective was that the embedded revolution has not started yet (how much computational intelligence is embedded in most things around us, and how much better could our lives be if there were more?), that the microprocessor is far from dead (though future microprocessors will look far more ‘FPGA-like’ as the number of cores expand beyond the point at which it becomes unreasonable to ignore the fact we’re computing in space), and that the future belongs to high performance embedded systems based on heterogeneous architectures. I don’t think there was much disagreement from other members of the panel. Juergen did his tongue-in-cheek best to start an IBM-and-Microsoft-versus-others fight by suggesting that it was actually datacentres which are dead, though I think there is a general consensus that the balance between local and cloud-based compute needs to be carefully evaluated in future system designs and that neither will kill the other.

Questions from the audience were interesting because people seem mostly worried about how to program these damn things! Which is good for those of us working on exactly this question, of course. The panel’s view was that with today’s compiler technology, the two main options were Domain Specific Languages and OpenCL. I made the point that architecture specialisation, and FPGA design in particular, is naturally aligned with static analysis – you get automatic specialisation by automatically understanding what you’re going to implement; I therefore ventured that future languages for heterogeneous computing, whatever they are, will be designed to make static analysis a simpler task.