Industry Coheres around MX?

In recent years there has been a rediscovery (sometimes multiple times!) of block floating point, a number representation used in the early days of the electronic computer for general purpose compute, but also in specialist areas such as signal processing, throughout the field’s history. The main reason for this modern rediscovery is the value of block floating point, and related ideas like block minifloat, in modern machine learning implementations.

Just like in the early days of floating point, a number of different approaches and implementations have flourished in recent years, so it was interesting to see the recent launch of the Open Compute Project OCP Microscaling Formats (MX) Specification, coauthored across Microsoft, AMD, Arm, Intel, Meta, NVIDIA and Qualcomm by a number of well-known figures (including Imperial’s very own Martin Langhammer). This document will be referred to as “the specification” in the rest of this blog post. At the same time, a paper was released on arXiv, supported by a GitHub repo, to show that when used for training deep neural networks (DNNs), MX-based implementations can produce high quality results.

In this blog post, I will briefly review a few of the features of the specification that I found particularly interesting as an “old hand” in fixed point and block floating point computer arithmetic, and raise some issues that the community may wish to take forward.

Firstly, I should say that it’s great to see efforts like this to raise the profile and awareness of block floating point and related number systems. My hope is that efforts like this will drive further work on the extremely important topic of efficient numerical hardware.

Storage

  1. Building block types. The focus of the specification is primarily one of storage, although certain computations are mentioned. The specification primarily defines the construction of a type for MX computations from three parameters: a (scalar) type for scalings, a (scalar) type for elements, and a (natural) number of elements. A value inhabiting this type is a scalar scaling X together with k elements, P_i, for 1 \leq i \leq k. The numbers represented are (except special values), v_i = X \cdot P_i. This is a very flexible starting point, which is to be welcomed, and there is much to explore under this general hood. In all the examples given, the scalar data types are either floating point or integer/fixed point, but there seems to be nothing in the specification specifically barring other representations, which could of course go far beyond traditional block floating point and more recent block exponent biases.
  2. Concrete types. In addition to specifying a generic type constructor, as above, the specification also defines certain concrete instantiations, always with a power-of-two scaling stored in a conventional 8-bit biased integer exponent, and with either floating point or integer elements. By my reading of the specification, to obtain an MX-compliant implementation, one does not seem to need to implement any of these concrete types, however.
  3. Subnormals mandated. If implementing a concrete format specified, subnormal representations must be implemented.

Computation

1. Dot products

Apart from conversion into MX, the only MX operation discussed in the specification is dot products of MX vectors of equal number of elements. This appears to leave some basic operations undefined, for example simple addition of two MX vectors. (It is, of course, possible to provide a reasonable semantics for this operation consistent with the scalar base types, but it is not in the specification.)

The definition given of the “Dot” operation is interesting. The specification says “The following semantics must be minimally supported”, and gives the following equation:

C = \text{Dot}(A,B) = X^{(A)}X^{(B)}\sum_{i=1}^k P_i^{(A)} \times P_i^{(B)}

where P_i^{A} denotes the ith element of vector A, X^{(A)} denotes the scaling of A, etc.

To my mind, the problem with this definition is that \times (on elements), the (elided) product on scalings and the sum on element products is undefined. The intention, of course, is that we want the dot product to approximate a real dot product, in the sense that \llbracket C \rrbracket \approx \llbracket X^{(A)} \rrbracket \llbracket X^{(B)} \rrbracket \sum_{i=1}^k \llbracket P_i^{(A)} \rrbracket \times \llbracket P_i^{(B)} \rrbracket, where the semantic brackets correspond to interpretation in the base scalar types and arithmetic operations are over the reals. But the nature of approximation is left unstated. Note, in particular, that it is not a form of round-to-nearest defined on MX, as hinted at by the statement “The internal precision of the dot product and order of operations is implementation-defined.” This at least suggests that there’s an assumed implementation of the multi-operand summation via two-input additions executed in some (identical?) internal precision in some defined order, though this is a very specific – and rather restrictive – way of performing multi-operand addition in hardware. There’s nothing wrong with having this undefined, of course, but an MX-compliant dot product would need to have a very clear statement of its approximation properties – perhaps something that can be considered for the standard in due course.

2. General Dot Products

The specification also defines an operation called “DotGeneral” which is a dot product of two vectors made up of MX-vector components. The specification defines it in the following way:

C = \text{DotGeneral}(A,B) = \sum_{j=1}^n \text{Dot}(A_j, B_j)

“where A_j and B_j are the jth MX-compliant subvector of A and B.” Like in the “Dot” definition, the summation notation is also not defined in the specification, and also this definition doesn’t carry the same wording as for “Dot” on internal precision and order, but I am assuming it is intended to. It is left undefined how to form component subvectors. I am assuming that the authors envisaged the vector A_j consisting of the jth block of k contiguous elements in the vector A, but of course there are many other ways this could be done (e.g. strided access, etc.)

I would be interested to hear from any blog readers who think I have misinterpreted the standard, or who think other components of the specification – not discussed here – are more interesting. I will be watching for uptake of this specification: please do alert me of interesting uses.

Supercharging Formal with E-Graphs

Designing circuits is not easy. Mistakes can be made, and mistakes made are not easily fixed. Even leaving aside safety-critical applications, product recalls can cost upwards of hundreds of millions of US$, and nobody wants to be responsible for that. As a result, the industry invests a huge effort in formally verifying hardware designs, that is coming up with computer-generated or computer-assisted proofs of hardware correctness.

My Intel-funded PhD student Sam Coward (jointly advised by Theo Drane) is about to head off to FMCAD 2023 in Ames, Iowa, to present a contribution to making this proof generation process speedier and more automated, while still relying on industry-standard tried and trusted formal verification tools. Together with his Intel colleague Emiliano Morini and then-intern (and Imperial student) Bryan Tan, Sam noticed that the e-graph techniques we have been using in his PhD for optimising datapath [1, 2] also have a natural application to verifying datapath designs, akin – but distinct – to their long-time use as a data-structure in SMT solvers.

The basic idea is straight-forward. Ever since our earliest ARITH paper, we’ve developed a toolbox of Intel-inspired rewrite rules that can be applied to optimise designs across parametric bitwidths. Our optimisation flow, called ROVER, applies these to a single expression we wish to optimise, and then we extract a minimal cost implementation. Our new paper asks a different question: what if we start, not from a single expression we wish to optimise, but from two expressions we’re trying to prove to be equivalent?

Following the approach I blogged about here from ROVER, if the expressions are not equivalent then they should never end up in the same equivalence class after multiple rewrite iterations. If they are equivalent then they may or may not end up in the same equivalence class: we provide no guarantees of completeness. So let’s think about what happens in each of these cases.

If they end up in the same equivalence class, then our tool thinks it has proven the two expressions to be equivalent. But no industrial design team will trust our tool to just say so – and rightly so! – tool writers make mistakes too. However, work by Sam and other colleagues from the Universities of Washington and Utah, has provided a methodology by which we can use our tool’s reasoning to export hints, in the form of intermediate proof steps, to a trusted industrial formal verification tool. And so this is how we proceed: our tool will generate many, sometimes hundreds, of hints which will help state-of-the-art industrial formal verification tools to find proofs more quickly – sometimes much more quickly – than they were able to do so beforehand.

So what if the two expressions don’t end up in the same equivalence class, despite actually being equivalent? No problem! We can still find two expressions for which their proof of equivalence, via some external tool, makes the overall proof we’re aiming for much easier… crunching down the proof to its hard essence, stripping away what we can. The main technical idea here is how to extract the expressions from the e-graph. Rather than aiming for the best possible implementation cost, as we did with ROVER, we aim to minimise the distance between the two implementations that are left for the industrial tool to verify. The overall process is shown in the figure below, where “EC” refers to an equivalence check conducted by the industrially-trusted tool and S^* and I^* are the extracted “close” expressions.

We show one example where, without the hints provided by our tool, the industrial formal verifier does not complete within a 24 hour timeout. Even when results are achievable with the industrial formal verifier, we still get an up to a 6x speedup in proof convergence.

This work forms a really nice “power use” of e-graphs. If you are at FMCAD 2023, come along and hear Sam talk about it in person!

Discovering Special Cases

My PhD student Sam Coward (jointly advised by Theo Drane from Intel) is about to head off on a speaking tour, where he will be explaining some of our really exciting recent developments. We’ve developed an approach that allows the discovery, exploitation, and generation of datapath hardware that works well in certain special cases. We’ve developed some theory (based on abstract interpretation), some software (based on egg), and some applications (notably in floating-point hardware design). Sam will be formally presenting this work at SOAP, EGRAPHS, and DAC over the coming weeks. In this blog post, I will try to explain the essence of the work. More detail can be found in the papers, primarily [1,2], with [3] for background.

We know that sometimes we can take shortcuts in computation. As a trivial example, we know that \text{abs}(x) can just be replaced by x for non-negative values of x. Special cases abound, and are often used in complex ways to create really efficient hardware. A great example of this is the near/far-path floating-point adder. Since the publication of this idea by Oberman and Flynn in the late 1990s, designs based on this have become standard in modern hardware. These designs use the observation that there are two useful distinct regimes to consider when adding two values of differing sign. If the numbers are close in magnitude then very little work has to be done to align their mantissas, yet a lot of work might be required to renormalise the result of the addition. On the other hand, if the two numbers are far in magnitude then a lot of work might be needed to align their mantissas, yet very little is required to renormalise the result of the addition. Thus we never see both alignment and renormalisation being significant computational steps.

Readers of this blog may remember that Sam, Theo and I published a paper at ARITH 2022 that demonstrated that e-graphs can be used to discover hardware that is equivalent in functionality, but better in power, performance, and area. E-graphs are built by repeatedly using rewrite rules of the form \ell \to r, e.g. \texttt{x + x} \to \texttt{2*x}. But our original ARITH paper wasn’t able to consider special cases. What we really need for that is some kind of conditional rewrite rules, e.g. x \geq 0 \Rightarrow \texttt{abs(x)} \to \texttt{x}, where I am using math script to denote the value of a variable and teletype script to denote an expression.

So we set out to answer:

  1. how can we deal with conditional rewrites in our e-graphs?
  2. how can we evaluate whether a condition is true in a certain context?
  3. how can we make use of this to discover and optimise special cases in numerical hardware?

Based on an initial suggestion from Pavel Panchekha, Sam developed an approach to conditionality by imagining augmenting the domain in which we’re working with an additional element, let’s call it *. Now let’s imagine introducing a new operator \texttt{assume} that takes two expressions, the second of which is interpreted as a Boolean. Let’s give \texttt{assume} the following semantics: \llbracket \texttt{assume(x,c)} \rrbracket = \llbracket \texttt{x} \rrbracket \text{ if } \llbracket c \rrbracket \text{ , and } * \text{, otherwise}. In this way we can ‘lift’ equivalences in a subdomain to equivalences across the whole domain, and use e-graphs without modification to reason about these equivalences. Taking the absolute value example from previously, we can write this equivalence as \texttt{assume( abs(x), x >= 0 )} \to \texttt{assume( x, x >= 0 )}. These \texttt{assume} function symbols then appear directly within the e-graph data structure. Note that the assume on the right-hand side here is important: we need both sides of the rewrite to evaluate to the same value for all possible values of x, and they do: for negative values they both evaluate to * and for non-negative values they both evaluate to x.

So how do we actually evaluate whether a condition is true in a given context? This is essentially a program analysis question. Here we make use of a variation of classical interval arithmetic. However, traditionally interval arithmetic has been a fairly weak program analysis method. As an example, if we know that x \in [-1,1], then a classical evaluation of x - x would give me [-2,2], due to the loss of information about the correlation between the left and right-hand sides of the subtraction operator. Once again, our e-graph setting comes to the rescue! Taking this example, a rewrite \mathtt{x - x} \to \mathtt{0} would likely fire, resulting in zero residing in the same e-class as \mathtt{x - x}. Since the interval associated with \mathtt{0} is [0,0], the same interval will automatically be associated with \mathtt{x - x} by our software, leading to a much more precise analysis.

This interaction between rewrites and conditions goes even further: a more precise analysis leads to the possibility to fire more conditional rewrite rules, as more conditions will be known to hold; firing more rewrite rules results in an even more precise analysis. The two techniques reinforce each other in a virtuous cycle:

A virtuous cycles: greater precision leads to more rewrites leads to greater precision.

Our technique is able to discover, and generate RTL for, near/far-path floating-point adders from a naive RTL implementation (left transformed to right, below), resulting in a 33% performance advantage for the hardware.

Left: A naive floating-point subtractor. Right: The subtractor produced by our software.

I’m really excited by what Sam’s been able to achieve, as I really think that this kind of approach has the potential to lead to huge leaps forward in electronic design automation for word-level designs.

Put Your Resource Where it’s Needed!

We all know that we have finite resources to accomplish a variety of tasks. Some tasks are easier than others, so we tend to spend less time and energy on them. Yet – when we survey the landscape of traditional neural networks – this observation does not apply to them. Take a convolutional neural network classifying pictures: does it do any less work (floating point operations, perhaps) if the picture is “obviously a cat” than if it’s a “cat hiding in the bushes”? Traditionally, the answer is a clear “no”, because there is no input dependence in the control-flow of the inference algorithm: every single image is processed in exactly the same way and takes exactly the same work.

But there are people who have looked to address this fundamental shortcoming by designing algorithms that put in less work if possible. Chief amongst these techniques is an approach known as “early exit neural networks”. The idea of these networks is that the network itself generates a confidence measure: if it’s confident early on, then it stops computing further and produces an early result.

My student Ben Biggs (jointly advised by Christos Bouganis) has been exploring the implications of bringing FPGAs to automatically accelerate such networks. There exist several good tool flows (e.g. fpgaConvNet, HPIPE) for automatically generating neural network implementations on FPGAs, but to the best of our knowledge, none of them support such dynamic control flow. Until now. Next week at FCCM 2023, Ben will be presenting our paper ATHEENA: A Toolflow for Hardware Early-Exit Network Automation.

The basic idea of ATHEENA is that early exit networks come in chunks, each of which is utilised a certain proportion of the time (e.g. 10% of images are hard to classify, 90% are not). So we make use of an existing neural network generator to generate each chunk, but allocate different FPGA resources to the different parts of the network, in order to maintain a steady throughput with no back pressure build-up points in the inference pipeline.

This principle is illustrated below. Imagine that the first stage of a network executes no matter what the input. Then I can create a throughput / area Pareto curve shown on the top left. Equally, I can create a throughput / area Pareto curve for the second stage, and then scale up the throughput achievable by a function of how frequently I actually need to use that stage: if it’s only needed 10% of the time, then I can support a classification rate 10x higher than the nominal throughput of the design returned by a standard DNN generator. By combining these two throughput / area curves for the same nominal throughput, I get an allocation of resources to each part of the network. Of course, if the nominal proportion p of ‘hard’ samples I used to characterise the network varies in practice from the actual proportion q, then I may end up with a somewhat different behaviour, as indicated by the purple region.

In practice, it turns out that this works really well. The graph below from Ben’s paper shows that on a real example, running on a real FPGA board, he’s able to improve throughput by more than 2x for the same area or reduce area by more than half for the same throughput.

I’m delighted that Ben’s work has been nominated for the best paper prize at FCCM this year and that it has received all three reproducibility badges available: Open Research Objects, Research Objects Reviewed and Results Reproduced, indicating Ben’s commitment to high quality reproducible research in our field.

If you plan to be in Los Angeles next week, come and hear Ben talk about it!

Industry & Academia Unite!

I’ve just returned from a very busy week in California visiting Intel and AMD in Folsom before chairing the SpatialML annual meeting in Monterey and then attending the first in-person ACM FPGA conference, also in Monterey. It was a great trip, and I think the theme for me was just how well the tech industry and academia can work together, building on each other’s strengths.

Industry Visits

I was delighted to have the first chance to pay an in-person visit to the team built by my former PhD student Theo Drane at Intel in Folsom, where he has been able to gather an exceptionally strong and diverse mix of skills. Intel is sponsoring my PhD student Sam Coward (far left of photo) who also a member of staff in the group. Together we have been working on some really exciting new ways to optimise datapath designs using word-level rewrite rules. While on the trip, we learnt that Sam’s DAC paper has been accepted, so there will be more I will want to say publicly about the development of this work soon!

With the team at Intel

Also in Folsom, I was able to meet members of the graphics team at AMD, who have recently agreed to support two new PhD students in my lab working on the interaction between microarchitecture and machine learning. I expect them to start in October 2023, with one co-supervised by Deniz Gündüz at Imperial and one co-supervised by Wayne Luk, also at Imperial. We’re planning exciting work, so watch this space.

SpatialML

In 2019 I was awarded a grant by EPSRC to bring together Imperial and Southampton (lead there Geoff Merrett) in the UK with Toronto (Anderson, Betz, Chow) and UCLA (Cong, Ercegovac) and industry (Imagination, Xilinx, Maxeler, NVIDIA, Google, Intel, Arm & Corerain) to reimagine the full stack of machine learning hardware when computing across space, not just time. You can visit the centre website here and see an up-to-date feed of our work on Twitter. Due to COVID, the last time we were able to hold an all-hands meeting was just after kick off, in February 2020, so it was truly a delight to be able to do so again this year. In the meantime, we have been very active and also expanded the centre to include Cornell & Cornell Tech (Abdelfattah, Zhang) as well as Sydney (Boland, Leong) and Samsung AI, also since then Maxeler has been acquired by Groq and Xilinx by AMD. Whereas in 2020, I had primarily focused the workshop on faculty and industry talks, this year I re-focused on hearing the progress that various PhD researchers had made since 2020 as well as very successful 2min lightning talks from all participants to aid networking and build a solid foundation for our researcher exchange programme. In addition, the new participants were given the chance to highlight their work in extended talks.

Members of the EPSRC-funded International Centre for Spatial Computational Learning

We had the following speakers:

At the meeting dinner, we were also able to celebrate the announcement that our member, Jason Cong, has just won the EDAA Achievement Award, to be presented at DATE 2023. We were able to end the workshop with a productive session, planning collaboration and researcher exchange between our sites over the coming year.

FPGA 2023

The ACM International Symposium on FPGAs (known as ‘FPGA’) is one of the key dates in my calendar. As a former chair of the conference, I sit on its steering committee, and I have always been a great supporter of the genuine community around this conference, combining a strong academic and industrial presence. It was wonderful to be back in person after two years of virtual FPGAs.

This year two of my collaborators were chairing: Zhiru Zhang (Cornell) was Program Chair and Paolo Ienne (EPFL) was General Chair. They put together a great event – you can read the program here. There were many highlights, but I would particularly mention the paper by my collaborator Lana Josipović and her PhD student Jiahui Xu and collaborators on the use of model checking to remove overhead in dynamically-scheduled HLS, closely related to my student Jianyi’s stream of work which I’ve described in other blog posts. I also had the privilege to serve on the best paper award panel this year, alongside Deming Chen and Viktor Prasanna, an award that went to some work from Jane Li‘s group for their development of a plug-and-play approach to integrating SSD storage into high-level synthesis, presented by the lead author Linus Wong, during the session I chaired on “High-Level Abstraction and Tools”.

It was a personal delight to see so many of my former PhD students, and the definite highlight was getting to meet the first child of one of my former students, born over the period I’ve been out of the loop due to COVID. My former student Sam Bayliss was in attendance and presenting a tutorial on programming the AMD AI Engines developed at using MLIR. My former student Kan Shi (Chinese Academy of Sciences) was also presenting his work together with my other former student David Boland (Sydney) and my current undergrad student Yuhan Diao (Imperial College) on on-chip debugging. Kan is also a part-time YouTuber, and his recent YouTube interview with me was mentioned to me by several attendees – I suspect it has had far wider reach than my publications!

Some of the Imperial College staff, students and alumni attending FPGA 2023

Overall, I have come away from the trip energised from the technical conversations, with far too many ideas for collaboration than I’ll be able to follow up on, and with an even further strengthened belief in the immense promise of industrial / academic collaboration in my field.

Equivalent. But better.

Ever since primary (elementary) school, we’ve known that multiplying an integer by 10 is easy. No need for the long written calculations we churn through when doing multiplication by hand. Just stick an extra zero on the end, and we’re done. Multiplication is (relatively) hard, concatenation of digits is easy. And yet, in this case, they’re equivalent in terms of the operation they perform.

Similar equivalences abound in the design of digital hardware for arithmetic computation, and my PhD student Sam Coward (jointly supervised by Theo Drane from Intel) has been devising ways to automatically take advantage of such equivalences to make Intel hardware smaller and more efficient. He will be presenting our work on this topic at the main computer arithmetic conference, ARITH, next week. The conference will be online, and registration is free: https://arith2022.arithsymposium.org.

Let’s explore this example from our early school years a bit more. I’ll use Verilog notation \{\cdot,\cdot\} to denote a function taking two bit vectors and concatenating them together. Of course the ‘multiplication by 10 is easy’ becomes ‘multiplication by 2 is easy’ in binary. Putting this together, we can write 2*x \simeq \{x,0\}, meaning that multiplication by two is the same as concatenation with a zero. But what does ‘the same as’ actually mean here? Clearly they are not the same expression syntactically and one is cheap to compute whereas one is expensive. What we mean is that no matter which value of x I choose, the value computed on the left hand side is the same as the value computed on the right hand side. This is why I’ve chosen to write \simeq rather than =. \simeq clearly defines a relation on the set of expressions. This is a special kind of relation called a congruence: it’s an equivalence relation, i.e. it is symmetric, transitive, and reflexive, but it also ‘plays well’ with function application: if x \simeq y then it necessarily follows that f(x) \simeq f(y) for every function symbol f. Like any equivalence relation on a set, \simeq partitions the set into a set of equivalence classes: in our setting a class corresponds to expressions that can be freely interchanged without changing the functionality of our hardware, even if it changes the performance, area or energy consumption of the resulting design.

Our colleagues Willsey, Nandi, Wang, Flat, Tatlock and Panchekha recently published egg, a wonderful open source library for building and exploring data structures known as ‘e-graphs’, specifically designed to capture these relations on expressions. Sam, Theo and I have developed a set of ‘rewrites’ capturing some of the important intuition that Intel designers apply manually, and encoded these for use within egg. To give you a flavour of these rewrites, here’s the table from Sam’s paper; you can see the example we started with is hiding in there by the name ‘Mult by Two’. The subscripts are used to indicate how many digits we’re dealing with; not all these rules are true for arbitrarily-sized integers, and Sam has gone to some lengths to discover simple rules – listed here as ‘sufficient condition’ – for when they can be applied. This is really important in hardware, where we can use as few or as many bits as the job requires.

You can imagine that, when you have this many equivalences, they all interact and you can very quickly build up a very large set of possible equivalent expressions. e-graphs help us to compactly represent this large set.

Once our tool has spent enough time building such a representation of equivalences, we need to extract an efficient implementation as a hardware implementation. This is actually a hard problem itself, because common subexpressions change the hardware cost. For example if I’m calculating (x+1)*(x+1) then I wouldn’t bother to calculate x+1 twice. We describe in our paper how we address this problem via an optimisation formulation. Our tool solves this optimisation and produces synthesisable Verilog code for the resulting circuit.

So, does it generate good circuits? It certainly does! The graph below shows the possible circuit area and performance achievable before (blue) and after (orange) the application of our tool flow before standard logic synthesis tools. For this example, silicon area can be reduced by up to around 70% – a very significant saving.

Area/Delay tradeoff for a smoothing kernel before and after our work

I’ve really enjoyed working on this topic with Sam and Theo. Lots more exciting content to follow. In the meantime, please tune in to hear Sam talk about it next week.

Keeping the Pipelines Full

On the 16th May, my PhD student Jianyi Cheng (jointly advised with John Wickerson) will present his most recent paper “Dynamic C-Slow Pipelining for HLS” at FCCM 2022 in New York, the FPGA community’s first in-person conference since the pandemic hit.

Readers of this blog may remember that Jianyi has been working on high-level synthesis techniques that combine the best of dynamic scheduling with the best of static scheduling [FPGA 2020,FCCM 2021]. The general principle underlying his work is to make the most of what information we have at compile time to develop highly efficient custom architectures, while leaving what we don’t know at compile time to influence execution at run-time.

A very common design pattern in hardware acceleration is the idea of C-slow pipelining. Pipelining tends to be taught early in undergraduate programmes, but C-slow pipelining rarely gets mentioned. The idea arises in circuits with feedback loops. The basic approach to pipelining doesn’t really work in this setting: although we can throw multiple registers into the circuit, potentially improving clock frequency at the cost of latency, just like with feed-forward circuits, we can’t then overlap computation to achieve improved throughput, unlike the feed-forward case, because of the data dependency corresponding to the feedback loop.

C-slow pipelining essentially says “OK, but you can use the spare capacity induced by the pipeline registers to overlap computation of independent streams of data, if you happen to have them available.”

Our new paper introduces a dynamic HLS flow for C-slow pipelining. This is particularly valuable in the context of a globally dynamic environment but where certain components exhibit static control flow and can be efficiently pipelined, for example some deep but predictable computation that must be repeated many times but with the arrival times and sources for this computation may change dynamically at runtime, a perfect fit for our prior work.

Jianyi presents a way to leverage the Boogie language and tool flow from Microsoft Research to automatically prove sufficient conditions for C-slowing to be correct. He is then able to introduce a new hardware component within the Dynamatic HLS tool that allows the schedule to “run ahead” to implement certain bounded out-of-order executions corresponding to C-slowing at the circuit level.

At the cost of a small area overhead in the region of 10%, this combined software analysis and hardware transformation is able to reduce wall-clock execution time by more than half compared to the vanilla dynamic scheduling approach.

If you’ll be in NYC in mid-May, go along and hear Jianyi’s talk!

Nonlinearity is Your Friend

My former PhD student Erwei Wang and I recently teamed up with some collaborators at UCL: Dovydas Joksas, Nikolaos Barmpatsalos, Wing Ng, Tony Kenyon and Adnan Mehonic and our paper has just been published by Advanced Science (open access).

Our goal was to start to answer the question of how specific circuit and device features can be accounted for in the training of neural networks built from analogue memristive components. This is a step outside my comfort zone of digital computation, but naturally fits with the broader picture I’ve been pursuing under the auspices of the Center for Spatial Computational Learning on bringing circuit-level features into the neural network design process.

One of the really interesting aspects of deep neural networks is that the basic functional building blocks can be quite diverse and still result in excellent classification accuracy, both in theory and in practice. Typically these building blocks include linear operations and a type of nonlinear function known as an activation function; the latter being essential to the expressive power of ‘depth’ in deep neural networks. This linear / nonlinear split is something Erwei and I, together with our coauthors James Davis and Peter Cheung, challenged for FPGA-based design, where we showed that the nonlinear expressive power of Boolean lookup tables provides considerable advantages. Could we apply the a similar kind of reasoning to analogue computation with memristors?

Memristive computation for the linear part of the computation performed in neural networks has been proposed for some time. Computation essentially comes naturally, using Ohm’s law to perform scalar multiplication and Kirchhoff’s Current Law to perform addition, resulting in potentially energy-efficient analogue dot product computation in a physical structure known as a ‘crossbar array’. To get really high energy efficiency, though, devices should have high resistance. But high resistance brings nonlinearity in practice. So do we back away from high resistance devices so we can be more like our mathematical abstractions used in our training algorithms? We argue not. Instead, we argue that we should make our mathematical abstractions more like our devices! After all, we need nonlinearity in deep neural networks. Why not embrace the nonlinearity we have, rather than compromise energy efficiency to minimise it in linear components, only to reintroduce it later in activation functions?

MNIST classification error trading off against power consumption

I think our first experiments in this area are a great success. We have been able to not only capture a variety of behaviours traditionally considered ‘non-ideal’ and harness them for computation, but also show very significant energy efficiency savings as a result. You can see an example of this in the figure above (refer to the paper for more detail). In high power consumption regimes, you can see little impact of our alternative training flow (green & blue) compared to the standard approach (orange) but when you try to reduce power consumption, a very significant gap opens up between the two precisely because our approach is aware of the impact this has on devices, and the training process learns to adapt the network accordingly.

We’ve only scratched the surface of what’s possible – I’m looking forward to lots more to come! I’m also very pleased that Dovydas has open-sourced our training code and provided a script to reproduce the results in the paper: please do experiment with it.

Pruning Circuits

On Tuesday, my former PhD student Erwei Wang (now at AMD) will present our recent paper “Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network Inference” at the ACM International Symposium on FPGAs. This is joint work with our collaborator Mohamed Abdelfattah from Cornell Tech as well as James Davis, George-Ilias Stavrou and Peter Y.K. Cheung at Imperial College.

In 2019, I published a paper in Phil. Trans. Royal Soc. A, suggesting that it would be fruitful to explore the possibilities opened up when considering the graph of a Boolean circuit – known as a netlist – as a neural network topology. The same year, in a paper at FCCM 2019 (and then with a follow-up article in IEEE Transactions on Computers), Erwei, James, Peter and I showed how to put some of these ideas into practice by learning the content of the Boolean lookup tables that form the reconfigurable fabric of an FPGA, for a fixed circuit topology.

Our new paper takes these ideas significantly further and actually learns the topology itself, by pruning circuit elements. Those working on deep neural networks will be very familiar with the idea of pruning – removing certain components of a network to achieve a more efficient implementation. Our new paper shows how to apply these ideas to prune circuits made of lookup tables, leaving a simplified circuit capable of highly-efficient inference. Such pruning can consist of reducing the number of inputs of each Boolean LUT and, in the limit, removing the LUT completely from the netlist. We show that very significant savings are possible compared to binarising a neural network, pruning that network, and only then mapping it into reconfigurable logic – the standard approach to date.

We have open-sourced our flow, and our paper has been given a number of ACM reproducibility badges. Please try it out at https://github.com/awai54st/Logic-Shrinkage and let us know what you think. And, if you’re attending FPGA next week, reach out and say hello.

Better accuracy / silicon area tradeoffs through pruning circuits

Islands of Certainty

Readers of this blog may remember that my PhD student Jianyi Cheng (jointly supervised by John Wickerson) has been working on high-level synthesis, combining dynamic scheduling with static scheduling. His latest contribution, to be presented on 28th February at the ACM FPGA conference, is about finding islands of static control flow in a sea of dynamic behaviour.

Here’s the story so far:

So now, two years later, we are back at the same conference to present a method to do just that. We now have an automated flow to select parts of a program to statically schedule, resulting in a 4x reduction in area combined with a 13% boost in performance compared to a fully dynamic circuit, a result that is close to the best achievable — as shown by exhaustively enumerating different parts of the program to schedule statically.

The basic idea of the paper is to develop the concept of a static island — a part of a dataflow graph where making decisions on scheduling of operations once, at compile time, is likely to have minimal impact on performance (or may even improve it) while opening the door to static resource sharing. We can throw a boundary around these islands, synthesise them efficiently with commercial HLS tools (we use Xilinx Vitis HLS), and integrate the result into the overall dynamic circuit using our previous open-source compiler flow.

So what makes a good static island? Unsurprisingly, these islands should exhibit static control flow or control flow with balanced path timing, e.g. in a conditional statement the if and else branch should take the same time, and loops should have constant dependence distances (or none at all). Jianyi also shows that there is an advantage to having these islands consume their inputs at offset-times, e.g. for a two-input island we may wish the static scheduler to be aware that second input is available — on average — two cycles after the first. He shows precisely how to generate ‘wrapper’ circuits for these components, allowing them to communicate with a dynamically scheduled environment.

The overall design flow, shown below, is now fully automated – freeing the user from writing the pragmas we required two years ago.