KAPow: Online Instrumentation of Power

Great news from the IEEE International Symposium on Field-Programmable Custom Computing Machines this week – our paper KAPow: A System Identification Approach to Online Per-module Power Estimation in FPGA Designs won the best paper award!

KAPow is all about trying to understand where your FPGA design is burning power, at run time, so that some higher level control entity can make intelligent decisions based on that information.

The problem is that we just don’t have the power sensors – we only know the total power consumed by the device, not the power consumed by each module in the design. So what to do?

Enter KAPow. Our tool (soon to be released publicly as an output from the PRiME project) will take RTL, instrument it and return back RTL that estimates its own power consumption.

The idea is fairly straight-forward: Automatically identify at synthesis (compile) time a subset of nets that you think are going to have a good correlation with power consumption of a module. Insert cheap counters to monitor their activity at run-time, and then use an online recursive least squares model to adaptively learn a model of power consumption of each module, based on the overall power consumption of the chip.

Results are good: power estimates are good down to about 5mW, and the infrastructure itself adds less than 4% to the total power consumed by the device.

Now, of course, the question is what to do with all this information that can now be collected at run-time. Watch this space!

Super Fast Loops

On Tuesday, my PhD student Junyi Liu presented our work (joint with John Wickerson) on Loop Splitting for Efficient Pipelining in High-Level Synthesis to the assembled audience at the IEEE International Symposium on Field-Programmable Custom Computing Machines in Washington DC.

A primary way in which FPGA applications tend to get their blindingly fast performance is through overlapping loop iterations in time – known variously as loop pipelining or software pipelining. These days, you can expect high-level synthesis tools to do this for you. Sometimes.

Unfortunately there are cases where the tools can’t get squeeze out performance. This paper addresses two such cases in a unified framework. The first is the case where some pesky loop iterations get in the way. Consider this trivial example:


for( int i=0; i&amp;lt;N; i++ )
  A[2*i] = A[i] + 0.5f;

In this case the early loop iterations are the problematic ones because A[2] depends on A[1], A[4] on A[2], etc. These tight dependences hinder pipelining, leading existing HLS tools to throw in the towel.

The second case is where there are loop invariant parameters that are not known until the loop executes. Consider the case:


for( int i=0; i&amp;amp;lt;N; i++ )
  A[i+m] = A[i] + 0.5f;

Without knowing the value of m at compile time, the dependence structure is unknown – we might have no read-after-write dependences, tight read-after-write dependences, or the dependences might be so many iterations away that we just don’t care and can pipeline away to our heart’s content. A limited version of this latter issue was addressed in Junyi’s earlier paper, the predecessor of this work.

In the new FCCM 2016 paper, we show that both these cases can be analysed using a parametric polyhedral framework, and show that we can automatically derive source-to-source transformations to significantly accelerate the loops in these cases. The end result? A push button approach that could gain you a factor of more than 4x in performance if your pipelining is being stymied by pesky dependences.

FPGA 2016: Some Highlights

I’ve just returned from the ACM International Symposium on FPGAs, held – as usual – in sunny Monterey, California. This year I was Finance Chair of the conference, which meant I had less running about to do than the last two years, when I was Programme Chair in 2014 and then General Chair in 2015. I therefore took my time to enjoy the papers presented, which were generally of a high quality. Intel’s acquisition of Altera last year provided an interesting backdrop to the conference, and genuinely seems to have fired the community up, as more and more people outside the FPGA “usual suspects” are becoming very interested in the potential for this technology.

Below, I provide my impressions of the highlights of the conference, which necessarily form a very biased view based on what I find particularly interesting.

Architectures and Low-Level CAD

There were a couple of very interesting papers from Altera on their Stratix 10 device. These devices come with customisable clock trees, allowing you to keep the clock distribution local to your clock regions. The results showed that for regions consisting of fewer than 1.6M LEs, it was better to use a configurable clock region rather than any fixed clock region, as the latter needs greater margining. In a separate paper, Dave Lewis presented a detailed look at the Stratix 10 pipelined routing architecture, which gave a good insight into industrial architecture exploration. They had explored a range of circuits and tried to identify the maximum retiming performance that could be achieved around loops as a function of the number of pipelining registers they need to insert in the routing muxes. The circuit designs were discussed, but for me the most interesting thing about this innovation is the way it changes the tools: focus can be placed on P&R for loops rather than feed-forward portions of the design in the case that the design can tolerate latency; this is exactly where tools should be focusing effort. The kind of timing feedback to the user also changes significantly, and for the better.

Zgheib et al from Paolo Ienne’s group at EPFL presented their FPRESSO tool available at fpresso.epfl.ch which harnesses standard cell tools to explore FPGA architectures. Their tool is open source, and this paper won the best paper prize (becoming something of a habit for Paolo!). This tool could be an interesting basis for future academic architecture exploration.

Safeen Huda from Jason Anderson’s group at U of T presented an interesting suggestion for how suppress glitches for power reasons in FPGA circuits in the presence of PVT variation.

Davis demonstrated our work on run-time estimation of power on a per-module basis, part of the PRiME project, at the relevant poster session, and I was pleased by the number of people who could see the value in run-time monitors for this purpose.

Nadesh Ramanathan

James Davis

High Level Tools

Gao’s talk from my group on automatic optimisation of numerical code for HLS using expression rewriting was very well received, especially since he has made his tool available online at https://admk.github.io/soap/. We would be delighted to receive feedback on this tool.

Ramanathan’s presentation from my group on the case for work-stealing on FPGAs using atomic memory operations in OpenCL also seemed to generate quite a buzz at the discussion session afterwards.

Applications

There were quite a few application papers this year targeting Convolutional Neural Networks (CNNs), the application of the moment. Both of the full papers in this area (from Tsinghua and from Arizona State) emphasised the need to use low precision fixed-point datapath, an approach I’ve been pushing in the FPGA compute space for the last 15 years or more. This application seems to be particularly suited to the problem, allowing computation with little impact on classification down as low as 8 bits. The work from Tsinghua university also took advantage of an SVD approach to reduce the amount of compute required. I think there’s some promise to combine this with the fixed-point quantization, as first pointed out by Bouganis in his FCCM 2005 paper.

Overlays

The conference was preceded by the OLAF workshop run by John Wawrzynek and Hayden So. I must admit that I am not a huge fan of the idea of a hardware-implemented FPGA overlay architecture. I can definitely see the possible advantages of an overlay architecture as a conceptual device, a kind of intermediate format for FPGA compilation. I find it harder to make a compelling case for implementing that architecture in the actual hardware. However, if overlay architectures make programming FPGAs easier in those hard-to-reach areas (until we’ve caught up with our HLS technology!) and therefore expand the user base, then bring it on! A bit like floating point, in fact!

Stealing Work for Your FPGA

Tomorrow, my PhD student Nadesh Ramanathan gives his first conference presentation, at the ACM International Symposium on FPGAs, claiming a place for work-stealing on FPGAs (joint work with John Wickerson and Felix Winterstein). The short paper on which the presentation is based can be found here.

Nadesh argues that we should pursue lock free approaches to load balancing for FPGAs, and shows that this can be implemented within Altera’s OpenCL framework. Initial work from an efficient K-means clustering algorithm which manipulates dynamic data structures demonstrates that this approach shows promise for the future. As we move to put more and more irregular applications on FPGAs, this kind of methodology will become increasingly important.

Fields Can Make Your Loops Run Faster

On Tuesday, my PhD student Xitong Gao will present our work (joint with John Wickerson) on Automatically Optimizing the Latency, Area, and Accuracy of C Programs for High-Level Synthesis at the ACM International Symposium on Field-Programmable Gate Arrays.

Our starting point for this work is that people often want to express their algorithms in terms of real numbers, but then want to have an implementation is some finite precision arithmetic on an FPGA after passing the results through a High-Level Synthesis tool.

One of the key limiting factors of the performance of an algorithm is the ability of the HLS tool to pipeline loops in the design in order to maximise throughput. It turns out that numerical properties of real numbers can come to the rescue!

It’s a well known (but often forgotten) fact that computer-represented numbers are not – in fact are very far from – real numbers. In particular, the main equivalence properties that the real numbers exhibit: closure, associativity, commutativity, distributivity, etc. simply do not apply for a large number of practically used data representations.

In our paper we take account of this discrepancy by automatically refactoring code to make it run faster (while tracking area and accuracy too). As a trivial example for the recurrence $x_n = 3 x_{n-1} + 1$ , a multiply and an add must be performed before executing the next iteration. But transforming this to $x_n = 9 x_{n-2} + 4$ gives much more time to execute. Combining this with expression balancing, etc., leads to a wide variety of possible implementations, which our tool can explore automatically while also proving the numerical properties of the transformed code. This latter point makes it quite unlike so-called “unsafe” optimisations commonly used in compilers such as -funsafe-math-optimizations in gcc.

HiPEAC 2016

I have just returned from HiPEAC, a major “roadshow” of primarily European research in high performance and embedded computing. As always, it was good to catch up with old friends and colleagues from around the world. I briefly describe below some of the highlights for me, as well as my contributions as part of a panel discussion held on Tuesday.

The keynote talk at PARMA-DITAM from my old friend Pedro Diniz on compiler transformations for resilience. A theme that kept coming up throughout the conference, and is also a key topic of the large project I have with Southampton, Manchester and Newcastle Universities.

Kirsten Eder’s talk at the ICT-ENERGY workshop on the work at Bristol on modelling power consumption in the XMOS architecture. An interesting choice, it turns out, because the communication in XMOS is transparent to the programmer, which makes power modelling at instruction level much more appealing.

Another old friend, Juergen Teich, gave a great keynote at IMPACT on extending scheduling to the parametric / symbolic case, in order to allow for the number and shape of a rectangular array of processing units to be determined at run time rather than at compile time. This aligns very nicely with our own drive to parametricity, in our case for run-time pipelining decisions in high-level synthesis.

Cristiano Malossi from IBM gave an interesting keynote at WAPCO on mixed precision conjugate gradient solvers, amongst other things. My former PhD student Antonio Roldao Lopes did an early investigation into how the precision of such a solver impacts its performance (lower precision, more iterations, but each iteration runs faster). It was interesting to see this extended to a mixed precision setting, where expensive parts of the algorithm are executed in low precision while cheap parts are executed in high precision.

I was invited to participate in a panel discussion in front of the audience at the WRC workshop. The other panelists were Aaron Smith, who is writing compilers for the Microsoft FPGA system now in the Bing search engine, Ronan Keryell, who does high-level design R&D at Xilinx, Peter Hofstee, from IBM, and Koen Bertels from TU Delft. The panel was moderated by another old friend, Juergen Becker. We were asked to discuss whether “the embedded revolution is dead” or whether it’s “just the microprocessor” whose death has come (!), and what the future holds. My perspective was that the embedded revolution has not started yet (how much computational intelligence is embedded in most things around us, and how much better could our lives be if there were more?), that the microprocessor is far from dead (though future microprocessors will look far more ‘FPGA-like’ as the number of cores expand beyond the point at which it becomes unreasonable to ignore the fact we’re computing in space), and that the future belongs to high performance embedded systems based on heterogeneous architectures. I don’t think there was much disagreement from other members of the panel. Juergen did his tongue-in-cheek best to start an IBM-and-Microsoft-versus-others fight by suggesting that it was actually datacentres which are dead, though I think there is a general consensus that the balance between local and cloud-based compute needs to be carefully evaluated in future system designs and that neither will kill the other.

Questions from the audience were interesting because people seem mostly worried about how to program these damn things! Which is good for those of us working on exactly this question, of course. The panel’s view was that with today’s compiler technology, the two main options were Domain Specific Languages and OpenCL. I made the point that architecture specialisation, and FPGA design in particular, is naturally aligned with static analysis – you get automatic specialisation by automatically understanding what you’re going to implement; I therefore ventured that future languages for heterogeneous computing, whatever they are, will be designed to make static analysis a simpler task.

Sizing Custom Parallel Caches

On Wednesday, my PhD student Felix Winterstein will present his paper Custom-Sized Caches in Application-Specific Memory Hierarchies at FPT 2015 in New Zealand. This is joint work with MIT and Intel, specifically the team who developed LEAP.

Imagine that your FPGA-based algorithm needs to access data. Lots of data. What do you do? Store that data in SDRAM and spend ages carefully thinking about how to make efficient use of scarce on-chip memory to get the best performance out of your design. What if that whole process could be automated?

Felix has been working for a while on a method to do just that. He’s been able to produce some incredibly exciting work, making use of the latest developments in software verification to devise an automatic flow that analyses heap manipulating software and how it should be efficiently implemented in hardware. At FPGA in February this year, we were able to show how our analysis could automatically design a parallel caching scheme, and know – prove, even! – when costly coherence protocols could be avoided. The remaining part of the puzzle was to automatically know – in a custom parallel caching scheme – how much on-chip memory you should allocate to which parallel cache. This is the last piece of the puzzle we will present on Wednesday.

Where High Performance Reconfigurable Computing Meets Embedded Systems

I have just returned from attending HiPEAC 2015 in Amsterdam. As part of the proceedings, I was asked by my long-time colleague Juergen Becker to participate in a panel debate on this topic.

Panels are always good fun, as they give one a chance to make fairly provocative statements that would be out of place in a peer review publication, so I took up the opportunity.

In my view, there are really two key areas where reconfigurable computing solutions provide an inherent advantage over most high performance computational alternatives:

In embedded systems, notions of correct or efficient behaviour are often defined at the application level. For example, in my work with my colleagues on control system design for aircraft, the most important notion of incorrect behaviour is would this control system cause the aircraft to fall out of the sky? An important metric of efficient behaviour might be how much fuel does the aircraft consume? These high level specifications, which incorporate the physical world (or models thereof) in its interaction with the computational process, allow a huge scope for novelty in computer architecture, and FPGAs are the ideal playground for this novelty.
In real time embedded systems, it is often important to know exactly how long a computation will take in advance. Designs implemented using FPGA technology often provides such guarantees – down to the cycle – where others fail. There is, however, potentially a tension between some high level design methodologies being proposed and the certainty of the timing of the resulting architecture.

Despite my best attempts to stir up controversy, there seemed to be very few dissenters to this view from other members of the panel, combined with a general feeling that the world is no longer one of “embedded” versus “general purpose” computing. If indeed we can draw such divisions, they are now between “embedded / local” and “datacentre / cloud”, though power and energy concerns dominate the design process in both places.