# Throwaway Digits

Tomorrow, my PhD student He Li will present our paper Digit Elision for Arbitrary-accuracy Iterative Computation (joint work with James Davis and John Wickerson) at the IEEE Symposium on Computer Arithmetic in Amherst, MA.

Readers of this blog may remember that we previously came up with a neat way of computing arbitrarily precise values of arbitrarily deep iterations of an iterative real-number computation, while only using constant-area compute hardware. This latest paper extends our previous work in the following way.

In our previous work, we computed every digit of every iteration of the computation. While for any computable real function this will give a correct result, it tends to be wasteful in practice. There are two reasons it’s wasteful. Firstly, often the reason we’re computing an iteration is because that iteration converges. Convergence can be seen as agreement in most-significant digits – after a while they don’t change. So why do we recompute them? We see this again and again in standard numerical computing – each iteration might add just a couple of new correct digits, but we still end up wasting time and energy computing all of the digits in each iteration, even the stable ones. Secondly, not all iterations may contribute equally to the overall error resulting from early termination. This paper addresses these two issues.

The first, and more general, issue is the wastefulness of computing stabilised digits. But just because they look stable, are they really stable? Maybe we’ve stabilised to 0.9, 0.99, 0.999, 0.999, and then one more iteration might kick us over to 1.0001. So can we really afford not to recompute most-significant digits? Ercegovac‘s Online Arithmetic comes to our rescue again! If we compute in an appropriate redundant number representation, then we can prove that stability of digits means we don’t need to consider them any more. This is our first contribution – to recognise this and utilise it within an appropriately modified computational architecture.

The second, and more specific, issue is that some digits are effectively ‘don’t care’. In this paper, we only analyse the specific case of stationary iterative methods (Jacobi, SOR, etc.) for this kind of digit. We show that, in these cases, for a fixed digit budget (e.g. “compute at most D digits across all iterations”), you should allocate these digits by computing a constant more digits each iteration. This constant can be estimated from the infinity norm of a certain matrix involved in the computation. Again, we modify our hardware architecture to take advantage of this pattern.

The end result is that we end up tracing out a corridor of digits, shown in the figure below, where the vertical axis is iteration and the horizontal axis is precision / digit number. Some digits have provably stabilised and no longer need computation (marked “), some are irrelevant don’t cares (marked X). This corridor radically improves the storage requirements of the original ARCHITECT scheme.

# Hardware for Rational Functions

Next Tuesday, my collaborator Silviu-Ioan Filip will present some of our recent work with Nicolas Brisebarre, Miloš Ercegovac, Matei Istoan and Jean-Michel Muller at the IEEE International Symposium on Computer Arithmetic.

In the 1970s, Miloš invented a rather nice method called the E-method for evaluating rational functions, i.e. ratios of two polynomials.  The basic idea of his method is as follows. We may solve a system of linear equations $Ay = b$ where $A$ is a matrix of a special structure formed from constants $q_i$ together with variable $x$:

$A = \begin{bmatrix} 1 & -x & 0 & 0 & \cdots & 0 & 0 \\ q_1 & 1 & -x & 0 & \cdots & 0 & 0 \\ q_2 & 0 & 1 & -x & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & \vdots & \vdots \\ q_{n-1} & 0 & 0 & 0 & \cdots & 1 & -x \\ q_n & 0 & 0 & 0 & \cdots & 0 & 1 \end{bmatrix}$

If we further choose the vector $b = \begin{bmatrix} p_0 & \cdots & p_n \end{bmatrix}^T$, then it turns out that the first element of the solution vector is the rational function $\frac{p_n x^n + \cdots + p_0}{q_n x^n + \cdots + q_0}$.

So we can use this to evaluate such rational functions. On the face of it, that doesn’t seem very interesting: why would we go to the bother of solving a system of linear equations to evaluate a rational function?

The answer lies in the combination of this idea with another one of Miloš’s key contributions, the idea of online arithmetic – computing results most-significant-digit first. In fact, if the matrix $A$ is sufficiently well conditioned then we may use a stationary iterative method to solve the system of equations in such a way that it produces one new correct digit of the solution for each iteration of the method, leading to very efficient evaluation.

Our paper at ARITH makes two novel contributions. Firstly, we show how to find such a matrix $A$ that is sufficiently well conditioned and for which the solution is close to a given function we’re trying to approximate, improving on the previous technique of Brisebarre et al. Secondly, we show how this method can be efficiently implemented in modern FPGA hardware, when aiming for high throughput.

The main domain of interest will be functions where rational approximation provides a much better fit than polynomials, as the computation required essentially provides rational computation for the price of polynomial computation. A buy-one-get-one-free offer, if you will.

I’m pleased to say that both the rational approximation generator and the hardware IP core generator will soon be open-sourced. Watch this space! Edit: I’m pleased to say this is now available at https://github.com/sfilip/emethod.

Next week my PhD student Nadesh Ramanathan will present his paper (joint work with Wickerson) on Concurrency-Aware Scheduling for High-Level Synthesis at FCCM 2018.

This work is the latest instalment of our approach to scheduling multithreaded software in high-level synthesis while taking advantage of the weak memory behaviour allowable in the C/C++11 standard.

Our previous work analysed, and then synthesised, each thread individually. What this paper adds is the ability to perform an inter-thread analysis – while still synthesising threads individually. It is natural, in hardware synthesis, to assume knowledge of the other threads that are being synthesised at compile time. We show in this paper that such knowledge can – and often does – considerably improve high-level synthesis results, by removing redundant constraints during the scheduling process.

Readers wanting to know a little more before diving into the paper itself could also read John Wickerson’s description of our work.

# Iterating Exactly

I’m very excited to share that my PhD student, He Li, will tomorrow be presenting his paper ARCHITECT: Arbitrary-precision Constant-hardware Iterative Compute at the IEEE International Conference on Field-Programmable Technology 2017 (joint work also with James Davis and John Wickerson.)

Anyone who has done any numerical computation will sooner or later encounter a loop like this:

while( P(x) )
x = f(x);


Where $P(x)$ denotes a predicate determining when the loop will exit, $f$ is a function transforming the state of the loop at each iteration, and $x$ is – critically – a vector of real numbers. Such examples crop up everywhere, for example the Jacobi method, conjugate gradient, etc.

How do people tend to implement such loops? They approximate them by using a finite precision number system like floating point instead of reals.

OK, let’s say you’ve done your implementation. You run for 1000 iterations and still the loop hasn’t quit. Is that because you need to run for a few more iterations? Or is it because you computed in single precision instead of double precision? (Or double instead of quad, etc.) Do you have to throw away all your computation, go back to the first iteration, and try again in a higher precision? Often we just don’t know.

He’s paper solves this problem. As time progresses, we increase both the iteration and the accuracy to which a given iterate is known, snaking through the two-dimensional iteration / precision space, linearising two countably infinite dimensions into the single countably infinite dimension of time (clock cycle) using a trick due to Cantor.

This is the essence of our contribution.

To make it work in practice, efficiently in hardware, requires some tricks. For a start, we need to be able to support arbitrary precision arithmetic on finite computational hardware (only memory space growing with precision, not compute hardware). Secondly, we need to compute from most-significant to least-significant digit, iteratively refining our computation as we proceed. This form of computation is not supported naturally by standard binary arithmetic, but is supported by redundant arithmetic. We make use of online arithmetic to enable this transformation.

So now you don’t need to worry – rounding error will not stop you getting your answer. There’s an FPGA design for that.

# Passing Data Structures to FPGAs

Next week, my former PhD student and postdoctoral researcher, Felix Winterstein, will present our paper Pass a Pointer: Exploring Shared Virtual Memory Abstractions in OpenCL Tools for FPGAs at the IEEE International Conference on Field-Programmable Technology in Melbourne, Australia.

Before launching his current startup, Xelera, Felix and I worked together on the problem of automating the production of custom memory systems for FPGA-based accelerators. I previously blogged about some highly novel work we’d done during his PhD on high-level synthesis for code manipulating complex data structures like trees and linked lists. Full detail can be found in the book version of his PhD thesis. All this work – as exciting as it is – was based on sequential C code description as the input format to a high-level synthesis tool.

Many readers of this blog will be aware that OpenCL is rapidly becoming viewed as an alternative way to write correctness-portable code for FPGA development, with both Intel and Xilinx offering OpenCL flows based around OpenCL 1.X. However, OpenCL 2.0 offers a number of interesting features around shared virtual memory which could radically simplify programming, at the cost of making the compiler significantly more complex for FPGA-based computation. It is this issue we address in the paper Felix will present next week.

There’s lots of exciting program analysis work that could be built on top of Felix’s framework, and I’m keen to explore this further – if a reader of this blog would like to collaborate in this direction or like to do a PhD in this field, feel free to get in touch.

Perhaps most importantly, Felix’s framework is open source – check it out at https://github.com/constantinides/FPGA-shared-mem and let us know if you use it!

# HLS and Power: Some FPL Contributions

This week sees the IEEE International Conference on Field-Programmable Logic and Applications, in Ghent, Belgium.  Two of my team are attending to present their research papers on High Level Synthesis and on Run-time Power Estimation. In this post, I briefly summarise the key contributions of these papers.

High-Level Synthesis (HLS) is an important technology, which aims to automatically generate hardware designs from high-level (typically software) descriptions of their behaviour. In a previous blog post, I described some work from my PhD student Junyi Liu (joint with Sam Bayliss) on extending a common paradigm for analysis memory dependences – the polyhedral model – to a parametric version, for efficient pipelining in HLS. This week, Junyi presents an alternative use for the same parametric polyhedral HLS framework: automatic loop tiling (joint work with John Wickerson). Loop tiling is a very common compiler transformation – for example it is often used in matrix-matrix multiplication. The key advantage is to make sure that you only have a small set of data you’re working with at any given moment in time (traditionally for cache, in the FPGA context for embedded scratch-pad memories). The size of this working set can be traded off against the amount of off-chip memory traffic by selection of tile sizes. In a multi-dimensional loop, there are many possible options, and navigating this space is non-trivial. Junyi’s work provides a way to produce an explicit formula for both the memory requirement and the amount of off-chip data traffic required for any given tile size. He can then use nonlinear optimisation techniques to explicitly optimise the traffic subject to any given constraint on buffer size. This work is available as an open-source tool at https://github.com/Junyi-Liu/PolyTSS.

Back in 2016, some work I did with Eddie Hung, James Davis, Josh Levine, Ed Stott and Peter Cheung won the best paper prize at FCCM 2016. We showed that it is possible to use an online (recursive least squares) algorithm to learn the instantaneous power consumption of individual components in an FPGA design, with a view to some kind of run-time manager using this information. The solution worked by monitoring certain signal activity at run-time, but the missing part of the puzzle was which signals to monitor. James’s latest paper, STRIPE, with the same co-authors, answers this question. It turns out that the answer to this problem – as with so many in engineering (and life?) – lies in linear algebra. Golub and Van Loan describe in their classic textbook how QR factorisation can be used to heuristically select a subset of “nearly linearly independent” vectors from a larger set, and it’s this approach that tends to win out when given enough data to work with.

# Overclocking For Fun and Profit

This week at the Design, Automation and Test in Europe (DATE) conference, Kevin Murray is presenting some exciting work I’ve done in collaboration with Kevin, his supervisor Vaughn Betz at the University of Toronto, and Andrea Suardi at Imperial College.

I’ve been working for a while on the idea that one form of approximate computing derives from circuit overclocking. The idea is that if you overclock a circuit then this may induce some error. However the error may be small or rare, despite a very significant performance enhancement. We’ve shown, for example, that such tradeoffs make sense for image processing hardware and – excitingly – that the tradeoffs themselves can be improved by adopting “overclocking-friendly” number representations.

In the work I’ve done on this topic to date, the intuition that a given circuit is “overclocking friendly” for a certain set of input data has been a human one. In this latest paper we move to an automated approach.

Once we accept the possibility of overclocking, our circuit timing analysis has to totally change – we can’t any longer be content with bounding the worst-case delay of a circuit, because we’re aiming to violate this worst case by design. What we’re really after is a histogram of timing critical paths – allowing us to answer questions like “what’s the chance that we’ll see a critical path longer than this in any given clock period?” Different input values and different switching activities give rise to the sensitization of different paths, leading to different timing on each clock cycle.

This paper’s fundamental contribution is to show that the #SAT probem can be efficiently used to quantify these probabilities, giving rise to the first method for determining at synthesis time the affinity of a given circuit to approximation-by-overclocking.