# Highlights of CSE 2019

Over the second half of this week, I’ve been attending the SIAM Computational Science and Engineering conference in Spokane, Washington – a short flight north (and a radical change in weather) from my earlier conference in California this week.

This was my first SIAM conference. I was kindly invited to speak on the topic of floating-point error analysis by Pierre Blanchard, Nick Higham and Theo Mary. I very much enjoyed the sessions they organised and indeed the CSE conference, which I hope to be able to attend more regularly from now on.

My own talk was entitled Approximate Arithmetic – A Hardware perspective. I spoke about the rise of architecture specialisation as driving the need for closer collaboration between computer architects and numerical analysts, about some of our work on automatic error bounds Boland and Constantinides (2011) and Magron, Constantinides and Donaldson (2017), on code refactoring Gao and Constantinides (2015), as well as some of our most recent work on machine learning (I will blog separately about this latter topic over the next couple of months.)

The CSE conference is very large – with 30-40 small parallel sessions happening at any given moment – so I cannot begin to summarise the conference. However, I include some notes below on other talks I found particularly interesting.

### Plenary Sessions

I very much enjoyed the plenary presentation by Rachel Ward on Stochastic Gradient Descent (SGD) in Theory and Practice. She introduced the SGD method very nicely, and looked at various assumptions for convergence. She took a particularly illuminating approach, by looking at applying SGD to the simple special case of solving a system of linear equations by minimising $F(w) = \frac{1}{2}||Aw-b||^2$ in the case where $\exists w^*. Aw^* = b$. She showed that if the system is under-determined, then SGD converges to the solution of minimum 2-norm, and therefore has an inherent regularising effect. I was surprised by some of the results on overparameterised neural networks, showing that SGD finds global minimisers and that there really doesn’t tend to be much overfitting despite the huge number of parameters, pointing to the implicit regularisation caused by the SGD algorithm itself. I learnt a lot from this talk, and have several papers on my “to read” list as a result, in particular:

There was also an interesting plenary from Anima Anandkumar on the role of tensors in machine learning. The mathematical structure of tensors and multi-linear algebra are topics I’ve not explored before – mainly because I’ve not seen the need to spend time on them. Anandkumar certainly provided me with motivation to do that!

### Floating-Point Error Analysis

Theo Mary from the University of Manchester gave a very good presentation of his work with Nick Higham on probabilistic rounding error analysis, treating numerical roundoff errors as zero-mean independent random variables of arbitrary distribution, making use of Hoeffding’s inequality to a produce a backward error analysis. Their work is described in more detail on their own blog post and – in more depth – in their their very interesting paper. It’s a really exciting and useful direction, I think, given the greater emphasis on average-case performance from modern applications, together with both very large data sets and very low precision computation, the combination of which renders many worst-case analyses meaningless. In a similar vein, Ilse Ipsen also presented a very interesting approach: a forward error analysis, more specialised in that she only looked at inner products, but also without the assumption of independence, making use of Azuma’s inequality. The paper on this topic has not yet been finished, but I certainly look forward to reading it in due course!

### Reducing Communication Costs

There were a number of interesting talks on mitigating communication costs. Lawrence Livermore National Labs presented several papers relating to the ZFP format they’ve recently proposed for (lossily) compressed floating-point vectors, at a mini-symposium organised by Alyson Fox, Jeffrey Hittinger, and James Diffenderfer. Diffenderfer’s talk developed a bound on the norm-wise relative error of vectors reconstructed from ZFP; Alyson Fox’s talk then extended this to the setting of iterative methods, noting as future work their interest in probabilistic analyses. In the same session, Nick Higham gave a crystal clear and well-motivated talk on his recent work with Srikara Pranesh and Mawussi Zunonslides and paper are available. This work extends the applicability of Nick’s earlier work with Erin Carson to cases that would have over- or under-flowed, or led to subnormal numbers, without the scaling technique developed and analysed here. They use matrix equilibration – this reminded me of some work I did with my former PhD student Juan Jerez and colleague Eric Kerrigan, but in our case for a different algorithm kernel and targeting fixed-point arithmetic, where making use of the full dynamic range is particularly important. The Higham, Pranesh and Zunon results are both interesting and practically very useful.

In a different session, Hartwig Anzt spoke about the work he and others have been doing to explicitly decouple storage precision from compute precision in sparse linear algebra. The idea is simple but effective: take the high-order bits of the mantissa (and the sign / exponent) and store them in one chunk of data and – separately – store the low-order bits in another chunk. Perform all arithmetic in high precision (because it’s not the computation that’s the bottleneck), but convert low-precision stored data to high precision on the fly at data load (e.g. by packing low-order bits with zeros.) Then, at run-time, decide whether to load the full-precision data or only the low-precision data, based on current estimates of convergence. This approach could also make a good case study application for the run-time adaptation methodology we developed with U. Southampton in the PRiME project.

### A Reflection

Beyond the technical talks, there were two things that stood out for me since I’m new to the conference. Firstly, there were many more women than in the typical engineering conferences I attend. I don’t know whether the statistics on maths versus engineering are in line with this observation, but clearly maths is doing something right from which we could learn. Secondly, there were clear sessions devoted to community building: mentoring sessions, tutorials for new research students, SIAM student chapter presentations, early career panels, presentations on funding programmes, diversity and inclusion sessions, a session on helping people improve their CV, an explicit careers fair, etc. Partly this may simply reflect the size of the conference, but even so, this seems to be something SIAM does particularly well.

# Highlights of FPGA 2019

This week, I attended the ACM FPGA 2019 conference in Seaside (nr. Monterey), California, the annual premier ACM event on FPGAs and associated technology. I’ve been involved in this conference for many years, as author, TPC member, TPC and general chair, and now steering committee member. Fashions have come and gone over this time, including in the applications of FPGA technology, but the programme at FPGA is always interesting and high quality. This year particular thanks should go to Steve Neuendorffer for organising the conference programme and to Kia Bazargan in his role as General Chair.

Below, I summarise my personal highlights of the conference. These are by no means my view of the “best” papers – they are all good – but rather those that interested me the most.

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity, a collaboration between Tsinghua, Beihang, Harbin Institute of Technology, and Microsoft Research, tackled the problem of ensuring that an inference implementation, when sparsified, gets sparsified in a way that leads to balanced load across the various memory banks. The idea is simple but effective, and leads to an interesting tradeoff between the quality of LSTM output and performance. I think it would be interesting to try to design a training method / regulariser that encourages this kind of structured sparsity in the first place.

Kees Vissers from Xilinx presented a keynote talk summarising their new Versal architecture, which the Imperial team had previously had the pleasure of hearing about from our alumnus Sam Bayliss. This is a really very different architecture to standard FPGA fare, and readers might well be interested in taking a look at Kees’s slides to learn more.

Vaughn Betz presented a paper from the University of Toronto, Math Doesn’t Have to be Hard: Logic Block Architectures to Enhance Low Precision Multiply-Accumulate on FPGAs. This work proposed a number of relatively minor tweaks to Intel FPGA architectures which might have a signifiant impact on low-precision MAC performance. Vaughn began by pointing out that in this application, very general LUTs often get wasted by being used as very simple gates – he gave the example of AND gates in partial product generation, and even as buffers. A number of architectural proposals were made to avoid this issue. I find this particularly interesting at the moment, because together with my PhD student Erwei Wang and others, I have proposed a new neural network architecture called LUTNet, motivated by exactly the same concern. However, our approach is the dual of that presented by Vaughn – we keep the FPGA architecture constant but modify the basic computations performed by the neural network to be more well-tuned to the underlying architecture. Expect a future blog post on our approach!

Lana Josipović presented the most recent work on the dynamically scheduled HLS tool from Paolo Ienne‘s group at EPFL, which they first presented at last year’s conference – see my blog post from last year. This time they have added speculative execution to their armoury. This is a very interesting line of work as HLS moves to encompass more and more complex algorithns, and Lana did a great job illustrating how it works.

Yi-Hsiang Lai presented HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing, an interesting collaboration between Zhiru Zhang‘s group at Cornell and Jason Cong‘s group. This work proposed separating functionality from implementation / optimisation concerns, such as datapath, precision and memory customisation, providing a cleaner level of abstraction. The approach seems very interesting, and reminded me of the aspect-oriented HLS work I contributed to in the REFLECT European project, about which Joāo Cardoso and others have since written a book. I think it’s a promising approach, and I’d be interested to explore the potential and challenges of their tool-flow. This paper won the best paper prize of the conference – congratulations to the authors!

My PhD student Jianyi Cheng presented our own paper, EASY: Efficient Arbiter SYnthesis from Multi-Threaded Code, and did an excellent job. Our paper is described in more detail in an earlier blog post.

Other papers I found particularly interesting include Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs, Microsemi’s contribution on analytic placement, ETH Zürich’s paper on an FPGA implementation of an approximate maximum graph matching algorithm, and U. Waterloo’s paper on a lightweight NoC making use of traffic injection regulation to avoid stalls. Unfortunately I had to miss the talks after noon on Tuesday, so there may well be more of interest in that part of the programme too.

The panel discussion – chaired by Deming Chen – was on the topic of whether FPGAs have a role to play in Supercomputing. As I pointed out in the discussion, to answer this question scientifically we need to have a working definition of “FPGA” and of “Supercomputing” – both seem to be on shifting sands at the moment, and we need to resist reducing a question like this to “does LINPACK run well on a Virtex or Stratix device.”

We also had the pleasure of congratulating Deming Chen and Paul Chow on their recently awarded fellowships, awarding a best paper prize, recognising several historical FPGA papers of significance, and last but by no means least welcoming the new baby of two of the stalwarts of the FPGA community – baby complete with “I am into FPGA” T-shirt! All this led to an excellent community feeling, which we should continue to nurture.

# Efficient Memory via Formal Verification

My new PhD student Jianyi Cheng is presenting a very exciting paper at the ACM International Symposium on FPGAs (FPGA 2019). This is work he did for his Masters degree, and is a collaboration with Joy Chen and Jason Anderson at the University of Toronto, as well as Shane Fleming and myself at Imperial. In this blog post, I aim to summarise the main idea.

Multi-threaded programming is now a fairly mainstream activity, and has found its way into high-level synthesis tools, both through OpenCL and also LegUp pthreads support. We focus here on the latter.

At FPL 2017, Joy and Jason had a paper that automatically decided how to partition shared arrays for multi-threaded code, aiming to reduce the amount of arbitration required between hardware units and chunks of memory. Their approach used a simulation trace to identify candidate partitions, and designed the arbiters so that, for example, if accesses to partition P were only observed in that trace to come from thread T, then there is very low latency access to P from T at execution time. In this way, they were able to significantly speed up synthesised multi-threaded code making use of shared memories.

However, the arbiters were still there. They were necessary because while no access by some other thread T’ was observed during simulation, there was no guarantee that such an access might not occur at run-time. So the arbiters sat there, taking up FPGA area and – for large enough numbers of ports – hitting the critical path of the design.

Enter our work.

In our paper, we show – building on the excellent PhD thesis by Nathan Chong that I examined a few years back – how the original multi-threaded code can be translated into  single-threaded code in a verification language developed by Microsoft Research called Boogie. We then show how to automatically construct assertions in Boogie that, if passed, correspond to a formal proof that a particular thread can never access a particular partition. This lets us strip out the arbiters, gaining back the area and significantly boosting the clock frequency.

I think it’s a really neat approach. Please come and hear Jianyi give his talk and/or read the paper!

# Neural Networks, Approximation and Hardware

My PhD student Erwei Wang, various collaborators and I have recently published a detailed survey article on this topic: Deep Neural Network Approximation for Custom Hardware: Where We’ve Been, Where We’re Going, to appear in ACM CSUR. In this post, I will informally explain my personal view of the role of approximation in supervised learning (classification), and how this links to the very active topic of DNN accelerator design in hardware.

We can think of a DNN as a graph $G$, where nodes perform computations and edges carry data. This graph can be interpreted (executed) as a function $\llbracket G \rrbracket$ mapping input data to output data. The quality of this DNN is typically judged by a loss function $\ell$. Let’s think about the supervised learning case: we typically evaluate the DNN on a set of $n$ test input data points $x_i$ and their corresponding desired output $y_i$, and compute the mean loss:

$L(G) = \frac{1}{n} \sum_{i=1}^n {\ell\left( \llbracket G \rrbracket(x_i), y_i \right)}$

Now let’s think about approximation. We can define the approximation problem as – starting with $G$ – coming up with a new graph $G'$, such that $G'$ can be somehow much more efficiently implemented than $G$, and yet $L(G')$ is not significantly greater than $L(G)$ – if at all. All the main methods for approximating NNs such as quantisation of activations and weights and sparsity – structured and unstructured – can be viewed in this way.

There are a couple of interesting differences here to the different problem – often studied in approximate computing, or lossy synthesis – of approximating the original function $\llbracket G \rrbracket$. In this latter setting, we can define a distance $d(G',G)$ between $G$ and $G'$ (perhaps worst case or average case difference over the input data set), and our goal is to find a $G'$ that keeps this distance bounded while improving the performance, power consumption, or area of the implementation. But in the deep learning setting, even the original network $G$ is imperfect, i.e. $L(G) > 0$. In fact, we’re not really interested in keeping the distance between $G$ and $G'$ bounded – we’re actually interested bounding the distance between $\llbracket G' \rrbracket$ and some oracle function defining the perfect classification behaviour. This means that there is a lot more room for approximation techniques. It also means that $L(G')$ may even improve compared to $L(G)$, as sometimes seen – for example – through the implicit regularisation behaviour of rounding error in quantised networks. Secondly, we don’t even have access to the oracle function, only to a sample (the training set.) These features combine to make the DNN setting an ideal playground for novel approximation techniques, and I expect to see many such ideas emerging over the next few years, driven by the push to embed deep learning into edge devices.

I hope that the paper we’ve just published in ACM CSUR serves as a useful reference point for where we are at the moment with techniques that simultaneously affect classification performance (accuracy / loss) and computational performance (energy, throughput, area). These are currently mainly based around quantisation of the datatypes in $G$ (fixed point, binarisation, ternarisation, block floating point, etc.) topological changes to the network (pruning) and re-parametrisation of the network (weight sharing, low-rank factorisation, circulant matrices) as well as approximation of nonlinear activation functions. My view is that this is scratching the surface of the problem – expect to see many more developments in this area and consequent rapid changes in hardware architectures for neural networks!

# Approximation of Boolean Functions

Approximate Computing has been a buzzphrase for a while. The idea, generally, is to trade off quality of result / solution, for something else – performance, power consumption, silicon area. This is not a new topic, of course, because in numerical computation people have generally always worked with finite precision number representations. In my early work in 2001, before the phrase “Approximate Computing” was in circulation, I introduced this as “Lossy Synthesis” – the idea that circuit synthesis can be broadened to incorporate the automated control of loss of numerical quality in exchange for reduction in area and increase in performance.

Most approximate computing frameworks focus on domains where numerical error is tolerable. Perhaps we don’t care if our answer is 1% wrong, for example, or perhaps we don’t even care if it’s out by 100%, so long as that happens very infrequently.

However, there is another interesting class of computation. Consider a function producing a Boolean output $f : \chi \to {\mathbb B}$, where ${\mathbb B} = \{T, F\}$. An interesting challenge is to produce another function $\tilde{f} : \chi \to {\mathbb T}$ with a ternary output ${\mathbb T} = \{T, F, -\}$ bearing a close resemblance to $f$. We can make the idea of bearing a close resemblance precise in the following way: if $\tilde{f}$ declares a value true (false), then so must $f$. We can think of this as relation between fibres:

$\tilde{f}^{-1}(\{T\}) \subseteq f^{-1}(\{T\})$ and $\tilde{f}^{-1}(\{F\}) \subseteq f^{-1}(\{F\})$            (1)

We can then think of the function $\tilde{f}$ as approximating $f$ if the fibre of the ‘don’t know’ element, $-$, is small in some sense, e.g. if $|\tilde{f}^{-1}(\{-\})|$ is small.

In the context of approximate computing, we can pose the following optimisation problem:

$\min_{\tilde{f}}: \mbox{Cost}(\tilde{f})$ subject to $|\tilde{f}^{-1}(\{-\})| < \tau$ and (1),

where $\mbox{Cost}$ represents the cost (energy, area, latency) of implementing a function. One application area for this kind of investigation is in computer graphics. It is often the case that, when rendering a scene, an algorithm first needs to decide which components of the scene will definitely not be visible, and therefore need not be considered further. Should this part of the graphics pipeline make a mistake by deciding a component may be visible when it is actually invisible, little harm is done – more computation is required downstream in the graphics pipelining, costing energy and time, but not a reduced quality rendering. On the other hand, if it makes a mistake by deciding that a component is invisible when it is actually visible, this may cause a significant visual artefact in the rendered scene.

Last year, I had a bright Masters student, Georgios Chatzianastasiou, who decided to explore this problem in the context of $f$ being the Slab Method in computer graphics and $\tilde{f}$ being one of a family of approximations $\tilde{f}_p$, each produced by using interval arithmetic approximations to $f$ computed in floating-point with precision $p$. In this way we get a family of approximate computing hardware IP blocks, all of which guarantee that, when given a ray and a bounding box, if the IP reports no intersection between the two, then there is provably no intersection. Yet each family member operates at a different precision, requiring different circuit area, trading off against the rate of `false positives’. Georgios wrote a paper on the implementation, which was accepted by FPL 2018 – he presents it next Wednesday.

If you’re at the FPL conference, please go and say hello to Georgios. If you’re interested in working with me to deepen and broaden the scope of this work, please get in touch!

# Throwaway Digits

Tomorrow, my PhD student He Li will present our paper Digit Elision for Arbitrary-accuracy Iterative Computation (joint work with James Davis and John Wickerson) at the IEEE Symposium on Computer Arithmetic in Amherst, MA.

Readers of this blog may remember that we previously came up with a neat way of computing arbitrarily precise values of arbitrarily deep iterations of an iterative real-number computation, while only using constant-area compute hardware. This latest paper extends our previous work in the following way.

In our previous work, we computed every digit of every iteration of the computation. While for any computable real function this will give a correct result, it tends to be wasteful in practice. There are two reasons it’s wasteful. Firstly, often the reason we’re computing an iteration is because that iteration converges. Convergence can be seen as agreement in most-significant digits – after a while they don’t change. So why do we recompute them? We see this again and again in standard numerical computing – each iteration might add just a couple of new correct digits, but we still end up wasting time and energy computing all of the digits in each iteration, even the stable ones. Secondly, not all iterations may contribute equally to the overall error resulting from early termination. This paper addresses these two issues.

The first, and more general, issue is the wastefulness of computing stabilised digits. But just because they look stable, are they really stable? Maybe we’ve stabilised to 0.9, 0.99, 0.999, 0.999, and then one more iteration might kick us over to 1.0001. So can we really afford not to recompute most-significant digits? Ercegovac‘s Online Arithmetic comes to our rescue again! If we compute in an appropriate redundant number representation, then we can prove that stability of digits means we don’t need to consider them any more. This is our first contribution – to recognise this and utilise it within an appropriately modified computational architecture.

The second, and more specific, issue is that some digits are effectively ‘don’t care’. In this paper, we only analyse the specific case of stationary iterative methods (Jacobi, SOR, etc.) for this kind of digit. We show that, in these cases, for a fixed digit budget (e.g. “compute at most D digits across all iterations”), you should allocate these digits by computing a constant more digits each iteration. This constant can be estimated from the infinity norm of a certain matrix involved in the computation. Again, we modify our hardware architecture to take advantage of this pattern.

The end result is that we end up tracing out a corridor of digits, shown in the figure below, where the vertical axis is iteration and the horizontal axis is precision / digit number. Some digits have provably stabilised and no longer need computation (marked “), some are irrelevant don’t cares (marked X). This corridor radically improves the storage requirements of the original ARCHITECT scheme.

# Hardware for Rational Functions

Next Tuesday, my collaborator Silviu-Ioan Filip will present some of our recent work with Nicolas Brisebarre, Miloš Ercegovac, Matei Istoan and Jean-Michel Muller at the IEEE International Symposium on Computer Arithmetic.

In the 1970s, Miloš invented a rather nice method called the E-method for evaluating rational functions, i.e. ratios of two polynomials.  The basic idea of his method is as follows. We may solve a system of linear equations $Ay = b$ where $A$ is a matrix of a special structure formed from constants $q_i$ together with variable $x$:

$A = \begin{bmatrix} 1 & -x & 0 & 0 & \cdots & 0 & 0 \\ q_1 & 1 & -x & 0 & \cdots & 0 & 0 \\ q_2 & 0 & 1 & -x & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & \vdots & \vdots \\ q_{n-1} & 0 & 0 & 0 & \cdots & 1 & -x \\ q_n & 0 & 0 & 0 & \cdots & 0 & 1 \end{bmatrix}$

If we further choose the vector $b = \begin{bmatrix} p_0 & \cdots & p_n \end{bmatrix}^T$, then it turns out that the first element of the solution vector is the rational function $\frac{p_n x^n + \cdots + p_0}{q_n x^n + \cdots + q_0}$.

So we can use this to evaluate such rational functions. On the face of it, that doesn’t seem very interesting: why would we go to the bother of solving a system of linear equations to evaluate a rational function?

The answer lies in the combination of this idea with another one of Miloš’s key contributions, the idea of online arithmetic – computing results most-significant-digit first. In fact, if the matrix $A$ is sufficiently well conditioned then we may use a stationary iterative method to solve the system of equations in such a way that it produces one new correct digit of the solution for each iteration of the method, leading to very efficient evaluation.

Our paper at ARITH makes two novel contributions. Firstly, we show how to find such a matrix $A$ that is sufficiently well conditioned and for which the solution is close to a given function we’re trying to approximate, improving on the previous technique of Brisebarre et al. Secondly, we show how this method can be efficiently implemented in modern FPGA hardware, when aiming for high throughput.

The main domain of interest will be functions where rational approximation provides a much better fit than polynomials, as the computation required essentially provides rational computation for the price of polynomial computation. A buy-one-get-one-free offer, if you will.

I’m pleased to say that both the rational approximation generator and the hardware IP core generator will soon be open-sourced. Watch this space! Edit: I’m pleased to say this is now available at https://github.com/sfilip/emethod.