# Arithmetic for Neural Networks

Last month, the Royal Society Phil Trans A published my paper Rethinking Arithmetic for Deep Neural Networks, part of a special issue put together by Jack Dongarra, Laura Grigori and Nick Higham. In this blog post, I aim to briefly introduce the ideas in the paper. The paper is open access, so please read it for further detail. In addition, my slides from a talk given on an early version of this work are available from Nick Higham’s blog, and an mp3 recording of me talking to these slides has been made available by the Royal Society here.

The central theme of the paper is that hardware accelerator circuits for neural networks can actually be treated as neural networks. Consider the two graphs below. One of them represents a simple deep neural network where each node performs an inner product and a ReLU operation. The other represents the result of (i) deciding to used 4-bit fixed-point arithmetic, and then (ii) synthesising the resulting network into a circuit, technology-mapped to 2-input Boolean gates.

Although there are obvious differences (in structure, in number of nodes) – there is a core commonality: a computation described as a graph operating on parameterisable functional nodes.

So what can we gain from this perspective?

1. Binarized Neural Networks are universal. The paper proves that any network you want to compute can be computed using binarized neural networks with zero loss in accuracy. It’s simply not the case that some problems need high precision. But, as a corollary, it is necessary to not be tied too closely to the original network topology if you want to be guaranteed not to lose accuracy.

2. Boolean topologies are tricky things. So if we want to rethink topologies, what first principles should we use to do so? In the paper, I suggest looking to inspiration from the theory of metric spaces as one step towards producing networks that generalise well. Topology, node functionality, and input / output encoding interact in subtle, interesting, and under-explored ways.

3. This viewpoint pays practical dividends. My PhD student Erwei Wang and collaborators James Davis and Peter Cheung and I have developed a Neural Network flow called LUTNet, which uses Boolean lookup tables as the main computational node in a neural network, leading to very efficient FPGA implementations.

I’m hugely excited by where this work could go, as well as the breadth of the fields it draws on for inspiration. Please do get in touch if you would like to collaborate on any of the open questions in this paper, or any other topic inspired by this work.

# Machine Learning at FPT 2019

Next week, the IEEE International Conference on Field-Programmable Technology (FPT) will take place in Tianjin in China. I’m proud that my former PhD student Qiang Liu will be General Chair of the conference.

I am a coauthor of two papers to be presented at FPT, one led by my former BEng student Aaron Zhao, now a PhD student at Cambridge supervised by my colleague Rob Mullins, and one led by my former postdoc, Ameer Abdelhadi, now with COHESA / UofT. The paper by Aaron is also in collaboration with two of my former PhD students, Xitong Gao, now with the Chinese Academy of Sciences, and Junyi Liu, now with Microsoft Research.

The first paper, led by Aaron, is entitled ‘Automatic Generation of Multi-precision Multi-arithmetic CNN Accelerators for FPGAs’, and can be found on arXiv here. This paper is a serious look at getting an automated CNN flow for FPGAs that makes good use of some of the arithmetic flexibility available on these devices. Powers-of-two (“free” multiplication) and fixed-point (“cheap” multiplication) are both leveraged.

The second paper, led by Ameer, looks at the computation of a set of approximate nearest neighbours. This is useful in a number of machine learning settings, both directly as a non-neural deep learning inference algorithm and indirectly within sophisticated deep learning algorithms like Neural Turing Machines. Ameer has shown that this task can be successfully accelerated in an FPGA design, and explores some interesting ways to parameterise the algorithm to make the most of the hardware, leading to tradeoffs between algorithm accuracy and performance.

If you’re at FPT this year, please go and say hello to Aaron, Ameer and Qiang.

# Approximating Circuits

Next week, Ilaria Scarabottolo, currently a visiting research student in my research group at Imperial, will present her paper “Partition and Propagate” at DAC 2019 in Las Vegas. In this post, I will provide a brief preview of her work (joint with Giovanni Ansaloni and Laura Pozzi from Lugano and me.)

I’ve been interested in approximation, and how it can be used to save resources, ever since my PhD 20 years ago, where I coined the term “lossy synthesis” to mean the synthesis of a circuit / program where error can be judiciously introduced in order to effect an improvement in performance or silicon area. Recently, this area of research has become known as “approximate computing“, and a bewildering number of ways of approximating behaviour – at the circuit and software level – have been introduced.

Some of the existing approaches for approximate circuit synthesis are point solutions for particular IP cores (e.g. our approximate multiplier work) or involve moving beyond standard digital design methodologies (e.g. our overclocking work.) However, a few pieces of work develop a systematic method for arbitrary circuits, and Ilaria’s work falls into this category.

Essentially, she studies that class of approximation that can be induced solely by removing chunks of a logic circuit, replacing dangling nets with constant values – a technique my co-authors referred to as Circuit Carving in their DATE 2018 paper.

Our DAC paper presents a methodology for bounding the error that can be induced by performing such an operation. Such error can be bounded by exhaustive simulation or SAT, but not for large circuits with many inputs due to scalability concerns. On the other hand, coarse bounds for the error can be derived very quickly. Ilaria’s work neatly explores the space between these two extremes, allowing analysis execution time to be traded for bound quality in a natural way.

Approximation’s time has definitely come, with acceptance in the current era often driven by machine-learning applications, as I explore in a previous blog post. Ilaria’s paper is an interesting and general approach to the circuit-level problem.

# Boolean Circuits are Neural Networks

On Monday, my PhD student Erwei Wang will present our work (joint also with James Davis and Peter Cheung) called LUTNet: Rethinking Inference in FPGA Soft Logic at the IEEE International Symposium on Field-Programmable Custom Computing Machines in San Diego, California.

In this paper, we take a very unusual approach to the design of a deep neural network accelerator in hardware: for us, the nodes in the neural network are Boolean lookup tables.

We were motivated initially by the fact that in very low precision FPGA neural network architectures, lookup tables are often used for arithmetic, but they are often used for very specific functions: while a $K$-LUT is capable of implementing any nonlinear Boolean function with $K$ inputs, it ends up getting used for only a tiny fraction of these $2^{2^K}$ functions. A good example is binarised neural networks (BNNs) such as FINN, where LUTs end up being used to implement XNOR gates (multiplication over $\{-1,+1\}$) and popcount functions. Our research question is therefore: rather than restricting ourselves to these functions, can we make better use of the LUTs by embracing the nonlinearity and the $K$-input support they give us?

We show that this is indeed possible. Our basic approach is to start with a weight-binarised neural network, add inputs to each node to bring them up to $K$ support, and then retrain the Boolean function implemented by that node. Retraining Boolean functions is a bit tricky, of course, because neural network training algorithms are not designed for this purpose. We generate a smooth interpolating function over the LUT entries, allowing us to use standard neural network training software (we use TensorFlow).

The end result is that the re-trained neural network is far more prunable than the original, because the extra inputs to the $K$-LUTs compensate for the removal of other nodes. Thus we end up with a much sparser neural network for the same classification accuracy. The sparsity improves our area by a factor of two or more, yet the more complex inference functions at each node are effectively provided “for free” by the FPGA architecture.

Circuit netlist? Neural network? Same thing!

# Efficient Memory via Formal Verification

My new PhD student Jianyi Cheng is presenting a very exciting paper at the ACM International Symposium on FPGAs (FPGA 2019). This is work he did for his Masters degree, and is a collaboration with Joy Chen and Jason Anderson at the University of Toronto, as well as Shane Fleming and myself at Imperial. In this blog post, I aim to summarise the main idea.

Multi-threaded programming is now a fairly mainstream activity, and has found its way into high-level synthesis tools, both through OpenCL and also LegUp pthreads support. We focus here on the latter.

At FPL 2017, Joy and Jason had a paper that automatically decided how to partition shared arrays for multi-threaded code, aiming to reduce the amount of arbitration required between hardware units and chunks of memory. Their approach used a simulation trace to identify candidate partitions, and designed the arbiters so that, for example, if accesses to partition P were only observed in that trace to come from thread T, then there is very low latency access to P from T at execution time. In this way, they were able to significantly speed up synthesised multi-threaded code making use of shared memories.

However, the arbiters were still there. They were necessary because while no access by some other thread T’ was observed during simulation, there was no guarantee that such an access might not occur at run-time. So the arbiters sat there, taking up FPGA area and – for large enough numbers of ports – hitting the critical path of the design.

Enter our work.

In our paper, we show – building on the excellent PhD thesis by Nathan Chong that I examined a few years back – how the original multi-threaded code can be translated into  single-threaded code in a verification language developed by Microsoft Research called Boogie. We then show how to automatically construct assertions in Boogie that, if passed, correspond to a formal proof that a particular thread can never access a particular partition. This lets us strip out the arbiters, gaining back the area and significantly boosting the clock frequency.

I think it’s a really neat approach. Please come and hear Jianyi give his talk and/or read the paper!

# Approximation of Boolean Functions

Approximate Computing has been a buzzphrase for a while. The idea, generally, is to trade off quality of result / solution, for something else – performance, power consumption, silicon area. This is not a new topic, of course, because in numerical computation people have generally always worked with finite precision number representations. In my early work in 2001, before the phrase “Approximate Computing” was in circulation, I introduced this as “Lossy Synthesis” – the idea that circuit synthesis can be broadened to incorporate the automated control of loss of numerical quality in exchange for reduction in area and increase in performance.

Most approximate computing frameworks focus on domains where numerical error is tolerable. Perhaps we don’t care if our answer is 1% wrong, for example, or perhaps we don’t even care if it’s out by 100%, so long as that happens very infrequently.

However, there is another interesting class of computation. Consider a function producing a Boolean output $f : \chi \to {\mathbb B}$, where ${\mathbb B} = \{T, F\}$. An interesting challenge is to produce another function $\tilde{f} : \chi \to {\mathbb T}$ with a ternary output ${\mathbb T} = \{T, F, -\}$ bearing a close resemblance to $f$. We can make the idea of bearing a close resemblance precise in the following way: if $\tilde{f}$ declares a value true (false), then so must $f$. We can think of this as relation between fibres:

$\tilde{f}^{-1}(\{T\}) \subseteq f^{-1}(\{T\})$ and $\tilde{f}^{-1}(\{F\}) \subseteq f^{-1}(\{F\})$            (1)

We can then think of the function $\tilde{f}$ as approximating $f$ if the fibre of the ‘don’t know’ element, $-$, is small in some sense, e.g. if $|\tilde{f}^{-1}(\{-\})|$ is small.

In the context of approximate computing, we can pose the following optimisation problem:

$\min_{\tilde{f}}: \mbox{Cost}(\tilde{f})$ subject to $|\tilde{f}^{-1}(\{-\})| < \tau$ and (1),

where $\mbox{Cost}$ represents the cost (energy, area, latency) of implementing a function. One application area for this kind of investigation is in computer graphics. It is often the case that, when rendering a scene, an algorithm first needs to decide which components of the scene will definitely not be visible, and therefore need not be considered further. Should this part of the graphics pipeline make a mistake by deciding a component may be visible when it is actually invisible, little harm is done – more computation is required downstream in the graphics pipelining, costing energy and time, but not a reduced quality rendering. On the other hand, if it makes a mistake by deciding that a component is invisible when it is actually visible, this may cause a significant visual artefact in the rendered scene.

Last year, I had a bright Masters student, Georgios Chatzianastasiou, who decided to explore this problem in the context of $f$ being the Slab Method in computer graphics and $\tilde{f}$ being one of a family of approximations $\tilde{f}_p$, each produced by using interval arithmetic approximations to $f$ computed in floating-point with precision $p$. In this way we get a family of approximate computing hardware IP blocks, all of which guarantee that, when given a ray and a bounding box, if the IP reports no intersection between the two, then there is provably no intersection. Yet each family member operates at a different precision, requiring different circuit area, trading off against the rate of false positives’. Georgios wrote a paper on the implementation, which was accepted by FPL 2018 – he presents it next Wednesday.

If you’re at the FPL conference, please go and say hello to Georgios. If you’re interested in working with me to deepen and broaden the scope of this work, please get in touch!

# Throwaway Digits

Tomorrow, my PhD student He Li will present our paper Digit Elision for Arbitrary-accuracy Iterative Computation (joint work with James Davis and John Wickerson) at the IEEE Symposium on Computer Arithmetic in Amherst, MA.

Readers of this blog may remember that we previously came up with a neat way of computing arbitrarily precise values of arbitrarily deep iterations of an iterative real-number computation, while only using constant-area compute hardware. This latest paper extends our previous work in the following way.

In our previous work, we computed every digit of every iteration of the computation. While for any computable real function this will give a correct result, it tends to be wasteful in practice. There are two reasons it’s wasteful. Firstly, often the reason we’re computing an iteration is because that iteration converges. Convergence can be seen as agreement in most-significant digits – after a while they don’t change. So why do we recompute them? We see this again and again in standard numerical computing – each iteration might add just a couple of new correct digits, but we still end up wasting time and energy computing all of the digits in each iteration, even the stable ones. Secondly, not all iterations may contribute equally to the overall error resulting from early termination. This paper addresses these two issues.

The first, and more general, issue is the wastefulness of computing stabilised digits. But just because they look stable, are they really stable? Maybe we’ve stabilised to 0.9, 0.99, 0.999, 0.999, and then one more iteration might kick us over to 1.0001. So can we really afford not to recompute most-significant digits? Ercegovac‘s Online Arithmetic comes to our rescue again! If we compute in an appropriate redundant number representation, then we can prove that stability of digits means we don’t need to consider them any more. This is our first contribution – to recognise this and utilise it within an appropriately modified computational architecture.

The second, and more specific, issue is that some digits are effectively ‘don’t care’. In this paper, we only analyse the specific case of stationary iterative methods (Jacobi, SOR, etc.) for this kind of digit. We show that, in these cases, for a fixed digit budget (e.g. “compute at most D digits across all iterations”), you should allocate these digits by computing a constant more digits each iteration. This constant can be estimated from the infinity norm of a certain matrix involved in the computation. Again, we modify our hardware architecture to take advantage of this pattern.

The end result is that we end up tracing out a corridor of digits, shown in the figure below, where the vertical axis is iteration and the horizontal axis is precision / digit number. Some digits have provably stabilised and no longer need computation (marked “), some are irrelevant don’t cares (marked X). This corridor radically improves the storage requirements of the original ARCHITECT scheme.

# Hardware for Rational Functions

Next Tuesday, my collaborator Silviu-Ioan Filip will present some of our recent work with Nicolas Brisebarre, Miloš Ercegovac, Matei Istoan and Jean-Michel Muller at the IEEE International Symposium on Computer Arithmetic.

In the 1970s, Miloš invented a rather nice method called the E-method for evaluating rational functions, i.e. ratios of two polynomials.  The basic idea of his method is as follows. We may solve a system of linear equations $Ay = b$ where $A$ is a matrix of a special structure formed from constants $q_i$ together with variable $x$:

$A = \begin{bmatrix} 1 & -x & 0 & 0 & \cdots & 0 & 0 \\ q_1 & 1 & -x & 0 & \cdots & 0 & 0 \\ q_2 & 0 & 1 & -x & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & \vdots & \vdots \\ q_{n-1} & 0 & 0 & 0 & \cdots & 1 & -x \\ q_n & 0 & 0 & 0 & \cdots & 0 & 1 \end{bmatrix}$

If we further choose the vector $b = \begin{bmatrix} p_0 & \cdots & p_n \end{bmatrix}^T$, then it turns out that the first element of the solution vector is the rational function $\frac{p_n x^n + \cdots + p_0}{q_n x^n + \cdots + q_0}$.

So we can use this to evaluate such rational functions. On the face of it, that doesn’t seem very interesting: why would we go to the bother of solving a system of linear equations to evaluate a rational function?

The answer lies in the combination of this idea with another one of Miloš’s key contributions, the idea of online arithmetic – computing results most-significant-digit first. In fact, if the matrix $A$ is sufficiently well conditioned then we may use a stationary iterative method to solve the system of equations in such a way that it produces one new correct digit of the solution for each iteration of the method, leading to very efficient evaluation.

Our paper at ARITH makes two novel contributions. Firstly, we show how to find such a matrix $A$ that is sufficiently well conditioned and for which the solution is close to a given function we’re trying to approximate, improving on the previous technique of Brisebarre et al. Secondly, we show how this method can be efficiently implemented in modern FPGA hardware, when aiming for high throughput.

The main domain of interest will be functions where rational approximation provides a much better fit than polynomials, as the computation required essentially provides rational computation for the price of polynomial computation. A buy-one-get-one-free offer, if you will.

I’m pleased to say that both the rational approximation generator and the hardware IP core generator will soon be open-sourced. Watch this space! Edit: I’m pleased to say this is now available at https://github.com/sfilip/emethod.

Next week my PhD student Nadesh Ramanathan will present his paper (joint work with Wickerson) on Concurrency-Aware Scheduling for High-Level Synthesis at FCCM 2018.

This work is the latest instalment of our approach to scheduling multithreaded software in high-level synthesis while taking advantage of the weak memory behaviour allowable in the C/C++11 standard.

Our previous work analysed, and then synthesised, each thread individually. What this paper adds is the ability to perform an inter-thread analysis – while still synthesising threads individually. It is natural, in hardware synthesis, to assume knowledge of the other threads that are being synthesised at compile time. We show in this paper that such knowledge can – and often does – considerably improve high-level synthesis results, by removing redundant constraints during the scheduling process.

Readers wanting to know a little more before diving into the paper itself could also read John Wickerson’s description of our work.

# Iterating Exactly

I’m very excited to share that my PhD student, He Li, will tomorrow be presenting his paper ARCHITECT: Arbitrary-precision Constant-hardware Iterative Compute at the IEEE International Conference on Field-Programmable Technology 2017 (joint work also with James Davis and John Wickerson.)

Anyone who has done any numerical computation will sooner or later encounter a loop like this:

while( P(x) )
x = f(x);
`

Where $P(x)$ denotes a predicate determining when the loop will exit, $f$ is a function transforming the state of the loop at each iteration, and $x$ is – critically – a vector of real numbers. Such examples crop up everywhere, for example the Jacobi method, conjugate gradient, etc.

How do people tend to implement such loops? They approximate them by using a finite precision number system like floating point instead of reals.

OK, let’s say you’ve done your implementation. You run for 1000 iterations and still the loop hasn’t quit. Is that because you need to run for a few more iterations? Or is it because you computed in single precision instead of double precision? (Or double instead of quad, etc.) Do you have to throw away all your computation, go back to the first iteration, and try again in a higher precision? Often we just don’t know.

He’s paper solves this problem. As time progresses, we increase both the iteration and the accuracy to which a given iterate is known, snaking through the two-dimensional iteration / precision space, linearising two countably infinite dimensions into the single countably infinite dimension of time (clock cycle) using a trick due to Cantor.

This is the essence of our contribution.

To make it work in practice, efficiently in hardware, requires some tricks. For a start, we need to be able to support arbitrary precision arithmetic on finite computational hardware (only memory space growing with precision, not compute hardware). Secondly, we need to compute from most-significant to least-significant digit, iteratively refining our computation as we proceed. This form of computation is not supported naturally by standard binary arithmetic, but is supported by redundant arithmetic. We make use of online arithmetic to enable this transformation.

So now you don’t need to worry – rounding error will not stop you getting your answer. There’s an FPGA design for that.