HLS and Power: Some FPL Contributions

This week sees the IEEE International Conference on Field-Programmable Logic and Applications, in Ghent, Belgium.  Two of my team are attending to present their research papers on High Level Synthesis and on Run-time Power Estimation. In this post, I briefly summarise the key contributions of these papers.

High-Level Synthesis (HLS) is an important technology, which aims to automatically generate hardware designs from high-level (typically software) descriptions of their behaviour. In a previous blog post, I described some work from my PhD student Junyi Liu (joint with Sam Bayliss) on extending a common paradigm for analysis memory dependences – the polyhedral model – to a parametric version, for efficient pipelining in HLS. This week, Junyi presents an alternative use for the same parametric polyhedral HLS framework: automatic loop tiling (joint work with John Wickerson). Loop tiling is a very common compiler transformation – for example it is often used in matrix-matrix multiplication. The key advantage is to make sure that you only have a small set of data you’re working with at any given moment in time (traditionally for cache, in the FPGA context for embedded scratch-pad memories). The size of this working set can be traded off against the amount of off-chip memory traffic by selection of tile sizes. In a multi-dimensional loop, there are many possible options, and navigating this space is non-trivial. Junyi’s work provides a way to produce an explicit formula for both the memory requirement and the amount of off-chip data traffic required for any given tile size. He can then use nonlinear optimisation techniques to explicitly optimise the traffic subject to any given constraint on buffer size. This work is available as an open-source tool at https://github.com/Junyi-Liu/PolyTSS.

Back in 2016, some work I did with Eddie Hung, James Davis, Josh Levine, Ed Stott and Peter Cheung won the best paper prize at FCCM 2016. We showed that it is possible to use an online (recursive least squares) algorithm to learn the instantaneous power consumption of individual components in an FPGA design, with a view to some kind of run-time manager using this information. The solution worked by monitoring certain signal activity at run-time, but the missing part of the puzzle was which signals to monitor. James’s latest paper, STRIPE, with the same co-authors, answers this question. It turns out that the answer to this problem – as with so many in engineering (and life?) – lies in linear algebra. Golub and Van Loan describe in their classic textbook how QR factorisation can be used to heuristically select a subset of “nearly linearly independent” vectors from a larger set, and it’s this approach that tends to win out when given enough data to work with.

Floating-Point Workshop

This week I attended a very enjoyable workshop at Dagstuhl on Analysis and Synthesis of Floating-Point Programs, organised by Eva Darulova, Ally Donaldson, Zvonimir Rakamarić, and Cindy Rubio González. As I noted in my talk on Algorithm-Architecture Codesign, this is one of the very few meetings I’ve been to at which delegates spanned some of – what I consider to be – the most exciting research areas at the moment: computer architecture, high-level synthesis, computer arithmetic, numerical analysis, programming languages, and formal methods. Hats off to the organisers for getting all these people in the room at once!

Below, I just choose a few of the many great talks I heard as some personal highlights of the workshop for me. Presentations and – more importantly! – debates during and after presentations were of uniformly excellent quality.

Rubio González, Hollingsworth, and Rakamarić all presented work on precision tuning. This is a topic I did some of the early work on back in 2001, in the context of fixed-point arithmetic for DSP algorithms in hardware, and have maintained an interest in ever since [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16], over the years slowly migrating from the very special class of LTI algorithms implemented in fixed-point arithmetic to much broader classes of algorithm, including proving termination of loops under finite precision floating-point and making forays into real algebraic geometry. The topic is currently exhibiting a resurgence of interest, especially for floating-point software. Of the tools presented, I only personally have some experience with FPTuner.

Damouche discussed a tool for automatically rewriting floating-point code for accuracy improvement (joint work with Martel). I’ve been aware of the interesting work of Martel for a while, and it inspired our own SOAP tool and the associated papers [17,18,19] which extend the capability to hardware where one is concerned with performance, area, and numerical error. This theme was also picked up by Panchekha, who has developed some very interesting tools for diagnosing numerical instability and correcting it.

Another common theme of the workshop was reproducibility. Many researchers (and developers!) are unhappy about the non-reproducible nature of floating-point code: change compiler or platform, and you might get a different result – more insidiously, run again on the same platform and you still might get a different result. My colleague Miriam Leeser, Michaela Taufer, Ganesh Gopalakrishnan and Thomas Wahl all spoke eloquently on this topic. Wahl’s work considered the idea of stabilising programs against platform uncertainty. The work auto-inserts pragmas in order to only determinise certain “key” properties, like ensuring the same control flow path is taken each time the program is run.

Donaldson spoke about detecting compiler bugs by inserting (precise) semantics-preserving transformations, and highlighted several such bugs his group has found. A similar theme was picked up by Nagarakatte, who has found bugs in LLVM floating-point optimisations and is proposing a DSL to specify such optimisations precisely.

Jim Demmel gave an interesting summary of proposed changes being discussed by the IEEE-754 floating-point standards body (a new rounded addition useful as a component within reproducible summation), the BLAS standards body, and progress made since his outstanding paper with Dumitriu and Holtz. This paper, when first written, inspired me to pursue the implications of this research for hardware design with my former PhD student, Theo Drane, now at Cadence. For Theo’s thesis, we used Demmel’s work to develop a design flow for hardware implementation of polynomial evaluation, given desired relative error bounds.

Titolo discussed an abstract interpretation approach to proving numerical properties in floating-point, which is also the conceptual framework utilised by our SOAP tool. Aiming at a similar goal, my former postdoctoral researcher Victor Magron presented the approach we derived together (jointly with Donaldson) for bounding error in floating-point computation, closely aligned with the approach I initially kick-started with my former PhD student David Boland back in 2010. I’ve blogged informally about this approach before – see here. Rakamarić discussed the tool FPTaylor which also targets this problem, within the context of the SMACK toolflow developed at Utah. While she didn’t give a talk at the seminar, Eva Darulova, one of the organisers, has developed an excellent paper and tool in the same area, and it was a pleasure discussing her work with her.

There was much discussion at the workshop on the topic of tool inter-operability. Tatlock and Panchekha presented a format for numerical benchmarking, and are urging the research community to cohere around this – it could be very interesting.

There was industrial representation from both Imagination Technologies and Cadence. In Drane’s talk, he made – I believe – an important observation that the research community should take note of “in my experience, if a customer has the time to do in depth verification of their numerical hardware, they also have the time to customise their hardware.”

I had a very enjoyable few days at Dagstuhl, and I hope that we find a way to keep this community together and talking to each other.