Industry & Academia Unite!

I’ve just returned from a very busy week in California visiting Intel and AMD in Folsom before chairing the SpatialML annual meeting in Monterey and then attending the first in-person ACM FPGA conference, also in Monterey. It was a great trip, and I think the theme for me was just how well the tech industry and academia can work together, building on each other’s strengths.

Industry Visits

I was delighted to have the first chance to pay an in-person visit to the team built by my former PhD student Theo Drane at Intel in Folsom, where he has been able to gather an exceptionally strong and diverse mix of skills. Intel is sponsoring my PhD student Sam Coward (far left of photo) who also a member of staff in the group. Together we have been working on some really exciting new ways to optimise datapath designs using word-level rewrite rules. While on the trip, we learnt that Sam’s DAC paper has been accepted, so there will be more I will want to say publicly about the development of this work soon!

With the team at Intel

Also in Folsom, I was able to meet members of the graphics team at AMD, who have recently agreed to support two new PhD students in my lab working on the interaction between microarchitecture and machine learning. I expect them to start in October 2023, with one co-supervised by Deniz Gündüz at Imperial and one co-supervised by Wayne Luk, also at Imperial. We’re planning exciting work, so watch this space.

SpatialML

In 2019 I was awarded a grant by EPSRC to bring together Imperial and Southampton (lead there Geoff Merrett) in the UK with Toronto (Anderson, Betz, Chow) and UCLA (Cong, Ercegovac) and industry (Imagination, Xilinx, Maxeler, NVIDIA, Google, Intel, Arm & Corerain) to reimagine the full stack of machine learning hardware when computing across space, not just time. You can visit the centre website here and see an up-to-date feed of our work on Twitter. Due to COVID, the last time we were able to hold an all-hands meeting was just after kick off, in February 2020, so it was truly a delight to be able to do so again this year. In the meantime, we have been very active and also expanded the centre to include Cornell & Cornell Tech (Abdelfattah, Zhang) as well as Sydney (Boland, Leong) and Samsung AI, also since then Maxeler has been acquired by Groq and Xilinx by AMD. Whereas in 2020, I had primarily focused the workshop on faculty and industry talks, this year I re-focused on hearing the progress that various PhD researchers had made since 2020 as well as very successful 2min lightning talks from all participants to aid networking and build a solid foundation for our researcher exchange programme. In addition, the new participants were given the chance to highlight their work in extended talks.

Members of the EPSRC-funded International Centre for Spatial Computational Learning

We had the following speakers:

At the meeting dinner, we were also able to celebrate the announcement that our member, Jason Cong, has just won the EDAA Achievement Award, to be presented at DATE 2023. We were able to end the workshop with a productive session, planning collaboration and researcher exchange between our sites over the coming year.

FPGA 2023

The ACM International Symposium on FPGAs (known as ‘FPGA’) is one of the key dates in my calendar. As a former chair of the conference, I sit on its steering committee, and I have always been a great supporter of the genuine community around this conference, combining a strong academic and industrial presence. It was wonderful to be back in person after two years of virtual FPGAs.

This year two of my collaborators were chairing: Zhiru Zhang (Cornell) was Program Chair and Paolo Ienne (EPFL) was General Chair. They put together a great event – you can read the program here. There were many highlights, but I would particularly mention the paper by my collaborator Lana Josipović and her PhD student Jiahui Xu and collaborators on the use of model checking to remove overhead in dynamically-scheduled HLS, closely related to my student Jianyi’s stream of work which I’ve described in other blog posts. I also had the privilege to serve on the best paper award panel this year, alongside Deming Chen and Viktor Prasanna, an award that went to some work from Jane Li‘s group for their development of a plug-and-play approach to integrating SSD storage into high-level synthesis, presented by the lead author Linus Wong, during the session I chaired on “High-Level Abstraction and Tools”.

It was a personal delight to see so many of my former PhD students, and the definite highlight was getting to meet the first child of one of my former students, born over the period I’ve been out of the loop due to COVID. My former student Sam Bayliss was in attendance and presenting a tutorial on programming the AMD AI Engines developed at using MLIR. My former student Kan Shi (Chinese Academy of Sciences) was also presenting his work together with my other former student David Boland (Sydney) and my current undergrad student Yuhan Diao (Imperial College) on on-chip debugging. Kan is also a part-time YouTuber, and his recent YouTube interview with me was mentioned to me by several attendees – I suspect it has had far wider reach than my publications!

Some of the Imperial College staff, students and alumni attending FPGA 2023

Overall, I have come away from the trip energised from the technical conversations, with far too many ideas for collaboration than I’ll be able to follow up on, and with an even further strengthened belief in the immense promise of industrial / academic collaboration in my field.

Equivalent. But better.

Ever since primary (elementary) school, we’ve known that multiplying an integer by 10 is easy. No need for the long written calculations we churn through when doing multiplication by hand. Just stick an extra zero on the end, and we’re done. Multiplication is (relatively) hard, concatenation of digits is easy. And yet, in this case, they’re equivalent in terms of the operation they perform.

Similar equivalences abound in the design of digital hardware for arithmetic computation, and my PhD student Sam Coward (jointly supervised by Theo Drane from Intel) has been devising ways to automatically take advantage of such equivalences to make Intel hardware smaller and more efficient. He will be presenting our work on this topic at the main computer arithmetic conference, ARITH, next week. The conference will be online, and registration is free: https://arith2022.arithsymposium.org.

Let’s explore this example from our early school years a bit more. I’ll use Verilog notation \{\cdot,\cdot\} to denote a function taking two bit vectors and concatenating them together. Of course the ‘multiplication by 10 is easy’ becomes ‘multiplication by 2 is easy’ in binary. Putting this together, we can write 2*x \simeq \{x,0\}, meaning that multiplication by two is the same as concatenation with a zero. But what does ‘the same as’ actually mean here? Clearly they are not the same expression syntactically and one is cheap to compute whereas one is expensive. What we mean is that no matter which value of x I choose, the value computed on the left hand side is the same as the value computed on the right hand side. This is why I’ve chosen to write \simeq rather than =. \simeq clearly defines a relation on the set of expressions. This is a special kind of relation called a congruence: it’s an equivalence relation, i.e. it is symmetric, transitive, and reflexive, but it also ‘plays well’ with function application: if x \simeq y then it necessarily follows that f(x) \simeq f(y) for every function symbol f. Like any equivalence relation on a set, \simeq partitions the set into a set of equivalence classes: in our setting a class corresponds to expressions that can be freely interchanged without changing the functionality of our hardware, even if it changes the performance, area or energy consumption of the resulting design.

Our colleagues Willsey, Nandi, Wang, Flat, Tatlock and Panchekha recently published egg, a wonderful open source library for building and exploring data structures known as ‘e-graphs’, specifically designed to capture these relations on expressions. Sam, Theo and I have developed a set of ‘rewrites’ capturing some of the important intuition that Intel designers apply manually, and encoded these for use within egg. To give you a flavour of these rewrites, here’s the table from Sam’s paper; you can see the example we started with is hiding in there by the name ‘Mult by Two’. The subscripts are used to indicate how many digits we’re dealing with; not all these rules are true for arbitrarily-sized integers, and Sam has gone to some lengths to discover simple rules – listed here as ‘sufficient condition’ – for when they can be applied. This is really important in hardware, where we can use as few or as many bits as the job requires.

You can imagine that, when you have this many equivalences, they all interact and you can very quickly build up a very large set of possible equivalent expressions. e-graphs help us to compactly represent this large set.

Once our tool has spent enough time building such a representation of equivalences, we need to extract an efficient implementation as a hardware implementation. This is actually a hard problem itself, because common subexpressions change the hardware cost. For example if I’m calculating (x+1)*(x+1) then I wouldn’t bother to calculate x+1 twice. We describe in our paper how we address this problem via an optimisation formulation. Our tool solves this optimisation and produces synthesisable Verilog code for the resulting circuit.

So, does it generate good circuits? It certainly does! The graph below shows the possible circuit area and performance achievable before (blue) and after (orange) the application of our tool flow before standard logic synthesis tools. For this example, silicon area can be reduced by up to around 70% – a very significant saving.

Area/Delay tradeoff for a smoothing kernel before and after our work

I’ve really enjoyed working on this topic with Sam and Theo. Lots more exciting content to follow. In the meantime, please tune in to hear Sam talk about it next week.

Keeping the Pipelines Full

On the 16th May, my PhD student Jianyi Cheng (jointly advised with John Wickerson) will present his most recent paper “Dynamic C-Slow Pipelining for HLS” at FCCM 2022 in New York, the FPGA community’s first in-person conference since the pandemic hit.

Readers of this blog may remember that Jianyi has been working on high-level synthesis techniques that combine the best of dynamic scheduling with the best of static scheduling [FPGA 2020,FCCM 2021]. The general principle underlying his work is to make the most of what information we have at compile time to develop highly efficient custom architectures, while leaving what we don’t know at compile time to influence execution at run-time.

A very common design pattern in hardware acceleration is the idea of C-slow pipelining. Pipelining tends to be taught early in undergraduate programmes, but C-slow pipelining rarely gets mentioned. The idea arises in circuits with feedback loops. The basic approach to pipelining doesn’t really work in this setting: although we can throw multiple registers into the circuit, potentially improving clock frequency at the cost of latency, just like with feed-forward circuits, we can’t then overlap computation to achieve improved throughput, unlike the feed-forward case, because of the data dependency corresponding to the feedback loop.

C-slow pipelining essentially says “OK, but you can use the spare capacity induced by the pipeline registers to overlap computation of independent streams of data, if you happen to have them available.”

Our new paper introduces a dynamic HLS flow for C-slow pipelining. This is particularly valuable in the context of a globally dynamic environment but where certain components exhibit static control flow and can be efficiently pipelined, for example some deep but predictable computation that must be repeated many times but with the arrival times and sources for this computation may change dynamically at runtime, a perfect fit for our prior work.

Jianyi presents a way to leverage the Boogie language and tool flow from Microsoft Research to automatically prove sufficient conditions for C-slowing to be correct. He is then able to introduce a new hardware component within the Dynamatic HLS tool that allows the schedule to “run ahead” to implement certain bounded out-of-order executions corresponding to C-slowing at the circuit level.

At the cost of a small area overhead in the region of 10%, this combined software analysis and hardware transformation is able to reduce wall-clock execution time by more than half compared to the vanilla dynamic scheduling approach.

If you’ll be in NYC in mid-May, go along and hear Jianyi’s talk!

Nonlinearity is Your Friend

My former PhD student Erwei Wang and I recently teamed up with some collaborators at UCL: Dovydas Joksas, Nikolaos Barmpatsalos, Wing Ng, Tony Kenyon and Adnan Mehonic and our paper has just been published by Advanced Science (open access).

Our goal was to start to answer the question of how specific circuit and device features can be accounted for in the training of neural networks built from analogue memristive components. This is a step outside my comfort zone of digital computation, but naturally fits with the broader picture I’ve been pursuing under the auspices of the Center for Spatial Computational Learning on bringing circuit-level features into the neural network design process.

One of the really interesting aspects of deep neural networks is that the basic functional building blocks can be quite diverse and still result in excellent classification accuracy, both in theory and in practice. Typically these building blocks include linear operations and a type of nonlinear function known as an activation function; the latter being essential to the expressive power of ‘depth’ in deep neural networks. This linear / nonlinear split is something Erwei and I, together with our coauthors James Davis and Peter Cheung, challenged for FPGA-based design, where we showed that the nonlinear expressive power of Boolean lookup tables provides considerable advantages. Could we apply the a similar kind of reasoning to analogue computation with memristors?

Memristive computation for the linear part of the computation performed in neural networks has been proposed for some time. Computation essentially comes naturally, using Ohm’s law to perform scalar multiplication and Kirchhoff’s Current Law to perform addition, resulting in potentially energy-efficient analogue dot product computation in a physical structure known as a ‘crossbar array’. To get really high energy efficiency, though, devices should have high resistance. But high resistance brings nonlinearity in practice. So do we back away from high resistance devices so we can be more like our mathematical abstractions used in our training algorithms? We argue not. Instead, we argue that we should make our mathematical abstractions more like our devices! After all, we need nonlinearity in deep neural networks. Why not embrace the nonlinearity we have, rather than compromise energy efficiency to minimise it in linear components, only to reintroduce it later in activation functions?

MNIST classification error trading off against power consumption

I think our first experiments in this area are a great success. We have been able to not only capture a variety of behaviours traditionally considered ‘non-ideal’ and harness them for computation, but also show very significant energy efficiency savings as a result. You can see an example of this in the figure above (refer to the paper for more detail). In high power consumption regimes, you can see little impact of our alternative training flow (green & blue) compared to the standard approach (orange) but when you try to reduce power consumption, a very significant gap opens up between the two precisely because our approach is aware of the impact this has on devices, and the training process learns to adapt the network accordingly.

We’ve only scratched the surface of what’s possible – I’m looking forward to lots more to come! I’m also very pleased that Dovydas has open-sourced our training code and provided a script to reproduce the results in the paper: please do experiment with it.

Pruning Circuits

On Tuesday, my former PhD student Erwei Wang (now at AMD) will present our recent paper “Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network Inference” at the ACM International Symposium on FPGAs. This is joint work with our collaborator Mohamed Abdelfattah from Cornell Tech as well as James Davis, George-Ilias Stavrou and Peter Y.K. Cheung at Imperial College.

In 2019, I published a paper in Phil. Trans. Royal Soc. A, suggesting that it would be fruitful to explore the possibilities opened up when considering the graph of a Boolean circuit – known as a netlist – as a neural network topology. The same year, in a paper at FCCM 2019 (and then with a follow-up article in IEEE Transactions on Computers), Erwei, James, Peter and I showed how to put some of these ideas into practice by learning the content of the Boolean lookup tables that form the reconfigurable fabric of an FPGA, for a fixed circuit topology.

Our new paper takes these ideas significantly further and actually learns the topology itself, by pruning circuit elements. Those working on deep neural networks will be very familiar with the idea of pruning – removing certain components of a network to achieve a more efficient implementation. Our new paper shows how to apply these ideas to prune circuits made of lookup tables, leaving a simplified circuit capable of highly-efficient inference. Such pruning can consist of reducing the number of inputs of each Boolean LUT and, in the limit, removing the LUT completely from the netlist. We show that very significant savings are possible compared to binarising a neural network, pruning that network, and only then mapping it into reconfigurable logic – the standard approach to date.

We have open-sourced our flow, and our paper has been given a number of ACM reproducibility badges. Please try it out at https://github.com/awai54st/Logic-Shrinkage and let us know what you think. And, if you’re attending FPGA next week, reach out and say hello.

Better accuracy / silicon area tradeoffs through pruning circuits

Islands of Certainty

Readers of this blog may remember that my PhD student Jianyi Cheng (jointly supervised by John Wickerson) has been working on high-level synthesis, combining dynamic scheduling with static scheduling. His latest contribution, to be presented on 28th February at the ACM FPGA conference, is about finding islands of static control flow in a sea of dynamic behaviour.

Here’s the story so far:

So now, two years later, we are back at the same conference to present a method to do just that. We now have an automated flow to select parts of a program to statically schedule, resulting in a 4x reduction in area combined with a 13% boost in performance compared to a fully dynamic circuit, a result that is close to the best achievable — as shown by exhaustively enumerating different parts of the program to schedule statically.

The basic idea of the paper is to develop the concept of a static island — a part of a dataflow graph where making decisions on scheduling of operations once, at compile time, is likely to have minimal impact on performance (or may even improve it) while opening the door to static resource sharing. We can throw a boundary around these islands, synthesise them efficiently with commercial HLS tools (we use Xilinx Vitis HLS), and integrate the result into the overall dynamic circuit using our previous open-source compiler flow.

So what makes a good static island? Unsurprisingly, these islands should exhibit static control flow or control flow with balanced path timing, e.g. in a conditional statement the if and else branch should take the same time, and loops should have constant dependence distances (or none at all). Jianyi also shows that there is an advantage to having these islands consume their inputs at offset-times, e.g. for a two-input island we may wish the static scheduler to be aware that second input is available — on average — two cycles after the first. He shows precisely how to generate ‘wrapper’ circuits for these components, allowing them to communicate with a dynamically scheduled environment.

The overall design flow, shown below, is now fully automated – freeing the user from writing the pragmas we required two years ago.

What’s the Rush?

At FPL 2021, my PhD student Jianyi Cheng (jointly supervised by John Wickerson) will present our short paper “Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining for HLS”. In this post, I explain the simple idea behind this paper and how it can significantly accelerate certain neglected corner cases in high-level synthesis (HLS).

By far the most significant way to extract high performance from a hardware accelerator in high-level synthesis is to use loop pipelining. Loop pipelining is the idea of starting the next iteration of a loop before the previous one finishes, allowing multiple iterations to be executing simultaneously. However, some loop iterations may need a result produced by earlier loop iterations, limiting the extent to which this can be done. HLS tools generally determine a ‘safe’ initiation interval – the number of clock cycles between starting two adjacent loop iterations – and then schedule the iterations statically at multiples of this interval.

This limit on initiation interval of the loop essentially derives from two properties. Firstly, if it takes a long time for the computation of a loop iteration to execute, then any iterations waiting on its result must be delayed. But secondly if an iteration’s result is only needed many iterations later, it can afford to take a long time to compute: what’s the rush? These two factors – latency and dependence distance – together determine the safe initiation interval.

The simple observation of our paper is that typically HLS tools will generally independently over-approximate latency and under-approximate dependence distance. However, there are some examples of programs where there is a correlation between dependence distance and latency. Jianyi gives this nice motivating example in the paper:

double f( double a ) {
  return (((((a+0.64)*a+0.7)*a+0.21)*a+0.33)*a+0.25)*a+0.125;
}

void example( double vec[M] ) {

  for (int i = 0; i < N; i++) {
    double e = vec[i];
    if (e > 0) vec[i+63] = f(e);
    else vec[i*i+9] = e * e;
  }

}

In this code snippet, you can see two control paths in the loop. The if branch has a long latency (it computes the Horner scheme polynomial f) but also writes to elements of vec that only get read many iterations later. Meanwhile the else branch has a short latency but can write – in the early stages of the loop at least – to values read in nearby iterations.

The end result is that the commercial tools Jianyi tried don’t cope very well with scheduling this loop. However, Jianyi has developed an approach that uses the formal verification tool Boogie to show that this loop can actually be scheduled very efficiently by exploiting this correlation.

He has developed an LLVM pass called iiProver that proves that it is safe to use a certain II with the commercial Vitis HLS tool from Xilinx. iiProver and our benchmarks are available – please take a look: https://github.com/JianyiCheng/iiProver. And you can hear Jianyi talking about his work on Youtube here: https://www.youtube.com/watch?v=SdQeBBc85jc.

It Probably Works!

Followers of my research will know that I’ve long been interested in rounding errors and how they can be controlled to best achieve efficient hardware designs. Going back 20 years, I published a book on this topic based on my PhD dissertation, where I addressed the question of how to choose the precision / word-length (often called ‘bit width’ in the literature) of fixed point variables in a digital signal processing algorithm, in order to achieve a controlled tradeoff between signal-to-noise ratio and implementation cost.

Fast forward several years, and my wonderful collaborators Fredrik Dahlqvist, Rocco Salvia, Zvonimir Rakamarić and I have a new paper out on this topic, to be presented by Rocco and Fredrik at CAV 2021 next week. In this post, I summarise what’s new here – after all, the topic has been studied extensively since Turing!

I would characterise the key elements of this work as: (i) probabilistic, i.e. we’re interested in showing that computation probably achieves its goal, (ii) floating point (especially of the low custom-precision variety), and (iii) small-scale computation on straight-line code, i.e. we’re interested in deep analysis of small kernels rather than very large code, code with complex control structures, or code operating on very large data structures.

Why would one be interested in showing that something probably works, rather than definitely works? In short because worst-case behaviour is often very far from average case behaviour of numerical algorithms, a point discussed with depth in Higham and Mary’s SIAM paper. Often, ‘probably works’ is good enough, as we’ve seen recently with the huge growth of machine learning techniques predicated on this assumption.

In recent work targeting large-scale computation, Higham and Mary and, independently, Ipsen, have considered models of rounding error that are largely / partially independent of the statistical distribution of the error induced by a specific rounding operation. Fredrik was keen to take a fresh look at the kind of distributions one might see in practice, and in our paper has derived a ‘typical distribution’ that holds under fairly common assumptions.

Rocco and Fredrik then decided that a great way to approximate the probabilistic behaviour of the program is to sandwich whatever distribution is of interest between two other easy to compute distributions, utilising the prior idea of a p-box.

One of the core problems of automated analysis of numerical programs has always been that of ‘dependence’. Imagine adding together two variables each in the range [-1,1]. Clearly their sum is in the range [-2,2]. But what if we knew, a priori, that these two variables were related somehow? For example in the expression X + (-X), which is clearly always zero. Ideally, an automated system should be able to produce a tighter result that [-2,2] for this! Over the years, many approaches to dealing with this issue have arisen, from very the very simple approach of affine arithmetic to the more complex semialgebraic techniques Magron, Donaldson and myself developed using sequences of semidefinite relaxations. In our CAV paper, we take the practical step of cutting-out regions of the resulting probability space with zero probability using modern SMT solver technology. Another interesting approach used in our paper is in the decision of which nonlinear dependences to keep and which to throw away for scalability reasons. Similar to my work with Magron, we keep first-order dependence on small rounding error variables but higher-order dependence on input program variables.

I am really excited by the end result: not only a wonderful blend of ideas from numerical analysis, programming languages, automated reasoning and hardware, but also a practical open-source tool people can use: https://github.com/soarlab/paf. Please give it a try!

Readers interested in learning more about the deeply fascinating topic of numerical properties of floating point would be well advised to read Higham’s outstanding book on the topic. Readers interested in the proofs of the theorems presented in our CAV paper should take a look at the extended version we have on arXiv. Those interested in some of the issues arising (in the worst case setting) when moving beyond straight-line code could consult this paper with Boland. Those interested in the history of this profoundly beautiful topic, especially in its links to linear algebra, would do well to read Wilkinson.

Scheduling with Probabilities

Readers of this blog may remember that Jianyi Cheng, my PhD student jointly supervised by John Wickerson, has been investigating ways to combine dynamic and static scheduling in high-level synthesis (HLS). The basic premise has been that static scheduling, when it works well due to static control, works very well indeed. Meanwhile, for programs exhibiting highly dynamic control flow, static scheduling can be very conservative, a problem addressed by our colleagues Lana Josipović, Radhika Ghosal and Paolo Ienne at EPFL. Together with Lana and Paolo, we developed a scheme to combine the best of both worlds, which we published at FPGA 2020 (and recently extended in IEEE Transactions on CAD). I blogged about this work previously here. We provided a tool flow allowing us to stitch large efficient statically-scheduled components into a dynamic circuit.

However, when scheduling a circuit statically, there are many design choices that can be made, typically to trade off time (throughput, latency) against area. So while our previous work was useful to stitch pre-existing statically-scheduled components into a dynamically-scheduled environment, we had no way of automatically designing those components to optimally fit the dynamic environment.

Enter Jianyi’s latest contribution – to be presented at FCCM 2021 next week.

In his paper “Probabilistic Scheduling in High-Level Synthesis”, Jianyi tackles this problem. He demonstrates that the dynamic environment, including data-dependent decisions and even load-store queues, can be adequately modelled using a Petri net formalism, and uses the PRISM model checker from Kwiatowska et al. to extract an appropriate initiation interval for each statically-scheduled component.

One of Jianyi’s Petri net models of some memory accesses.

The initiation intervals inferred by Jianyi’s tool can then be given to a commercial HLS tool – in our case Vitis HLS – to schedule each component. The components – together with any remaining dynamically-scheduled code – is then integrated using our previously published framework, producing the complete FPGA-ready design. The whole process provides a quality of result very close to an exhaustive search of possible initiation intervals, without having to perform multiple scheduling runs, and so in a fraction of the time.

Easter Coq

This Easter I set myself a little challenge to learn a little bit of Coq – enough to construct a proof of a simple but useful theorem in computer arithmetic. Long-time readers of this blog will know that this is not my first outing with dependent types, though I’ve never used them in anger. Four years ago – also during the Easter break! – I read Stump‘s book on Agda and spent some time playing with proofs and programming, as I documented here.

This blog post documents some of the interesting things in Coq I observed over the last few days. I’ve decided to write the majority of this post in Coq itself, below, before finishing off with some concluding remarks. In this way, anyone really interested can step through the definitions and proofs themselves.


(* 
 * A first datapath identity
 * George A. Constantinides, 2/4/21
 *
 * This is an attempt to learn some basic Coq by proving a standard identity used in computer arithmetic,
 * namely \bar{x} + 1 = \bar{x – 1}. 
 * 
 * This identity is useful because it allows Boolean operations to move through arithmetic operations.
 *
 * The code is for learning and teaching purposes only. It is not intended to be an efficient or elegant
 * approach, nor is it intended to make best use of existing Coq libraries. On the contrary, I have often used
 * many steps when one would do, so we can step through execution and see how it works.
 *)


Require Import Coq.Program.Equality.
Require Import Coq.Logic.Eqdep_dec.
Require Import Coq.Arith.Peano_dec.

(* Create my own bitvector type. It has a length, passed as a nat, and consists of bools. *)
Inductive bv : nat -> Set :=
| nilbv : bv 0
| consbv : forall n : nat, bool -> bv n -> bv (S n).

(* Head and tail of a bitvector, with implicit length arguments *)
Definition hdi {n : nat} (xs : bv (S n)) :=
  match xs with
  | consbv _ head _ => head
  end.

Definition tli {n : nat} (xs : bv (S n)) :=
  match xs with
  | consbv _ _ tail => tail
  end.

(* The basic carry and sum functions of a Boolean full adder *)
Definition carryfunc (a : bool) (b: bool) (c: bool) : bool :=
  orb (orb (andb a b) (andb a c)) (andb b c).

Definition sumfunc (a : bool) (b : bool) (c : bool) : bool :=
  xorb (xorb a b) c.

(*
 * A ripple carry adder, with implicit length argument
 * Note that this definition makes use of a trick known as the ‘convoy pattern’ [1]
 * to get the dependent typing to work in a match clause. We use a ‘return’ clause
 * to make the type of the match result be a function which is then applied to the unmatched
 * argument. In this way the type system can understand that x and y have the same dependent type.
 * Note also the use of Fixpoint for a recursive definition.
 *)

Fixpoint rcai {n : nat} (x : bv n) (y : bv n) (cin : bool) : (bool * bv n) :=
  match x in bv n return bv n -> ( bool * bv n ) with
  | nilbv => fun _ => (cin, nilbv) (* an empty adder passes its carry in to its carry out *)
  | consbv n1 xh xt => fun y1 =>
       let (cout, sumout) := rcai xt (tli y1) (carryfunc cin xh (hdi y1)) in
                        (cout, consbv n1 (sumfunc cin xh (hdi y1)) sumout)
  end y.

(* We define addition modulo 2^n by throwing away the carry out, using snd, and then define an infix operator *)
Definition moduloadder {n : nat} (x : bv n) (y : bv n) : (bv n) :=
  snd (rcai x y false).

Infix “+” := moduloadder.

(* Bitwise negation of a word *)
Fixpoint neg {n : nat} (x : bv n) : (bv n) :=
  match x with
  | nilbv => nilbv
  | consbv n1 xh xt => consbv n1 (negb xh) (neg xt)
  end.

(* The word-level constant zero made of n zeros *)
Fixpoint bvzero {n : nat} : (bv n) :=
  match n with
  | O => nilbv
  | (S n1) => consbv n1 false bvzero
  end.

(* The word-level constant one with n leading zeros *)
Definition bvone {n : nat} :=
  consbv n true bvzero.

(* Additive inverse of a word, defined as ‘negate all the bits and add one’  *)
Definition addinv {n : nat} (x : bv (S n)) : (bv (S n)) :=
  neg(x) + bvone.

(* Subtraction modulo 2^n is defined as addition with the additive inverse and given its own infix operator *)
Definition modulosub {n : nat} (x : bv (S n)) (y : bv (S n)) :=
  x + (addinv y).

Infix “-” := modulosub.

(* a bit vector of just ones *)
Fixpoint ones {n : nat} : (bv n) :=
  match n with
  | O => nilbv
  | S n1 => consbv n1 true ones
  end.

(* OK, now we have some definitions, let’s prove some theorems! *)

(* Our first lemma (‘Lemma’ versus ‘Theorem’ has no language significance in Coq) says that inverting a
 * bitvector of ones gives us a bitvector of zeros.
 * There’s a couple of interesting points to note even in this simple proof by induction:
 * 1. I had to use ‘dependent destruction’,
 *    which is defined in the Coq.Program.Equality library, to get the destruction of variable x to take into account
 *    the length of the bitvector.
 * 2. The second use of inversion here didn’t get me what I wanted / expected, again due to dependent typing, for
 *    reasons I found explained in [2]. The solution was to use a theorem inj_pair_eq_dec, defined in 
 *    Coq.Logic.Eqdep_dec. This left me needing to prove that equality on the naturals is decidable. Thankfully,
 *    Coq.Arith.Peano_dec has done that.
 *)


Lemma invertzeros : forall {n : nat} (x : bv n),
  x = bvzero -> neg x = ones.
Proof.
  intros n x H.
  induction n.
  dependent destruction x.
  auto. (* base case proved *)
  dependent destruction x.
  simpl.
  f_equal.
  simpl bvzero in H.

  inversion H.
  reflexivity.

  simpl bvzero in H.
  inversion H. (* inversion with dependent type starts here…          *)
  apply inj_pair2_eq_dec in H2. (* goes via this theorem                                 *)
  2: apply eq_nat_dec. (* and completes via a proof of decidability of equality *)

  apply IHn.
  apply H2.
Qed.

(* 
 * The next lemma says that if you fix one input to a ripple carry adder to zero and feed in the carry-in as zero
 * too, then the carry out will not be asserted and the sum will just equal the fixed input.
 * I proved this by induction, reasoning by case on the possible Boolean values of the LSB.
 * The wrinkle to notice here is that I didn’t know how to deal with a ‘let’ clause, but thanks to Yann Herklotz
 * (https://yannherklotz.com) who came to my aid by explaining that a ‘let’ is syntactic sugar for a match.
 *)


Lemma rcai_zero: forall (n : nat) (x : bv n),
  rcai x bvzero false = (false, x).
Proof.
  intros n x.
  induction n.
  dependent destruction x.
  auto. (* base case proved *)
  dependent destruction x.
  simpl bvzero.
  simpl rcai.
  destruct b.
  unfold sumfunc. simpl.
  unfold carryfunc. simpl.

  destruct (rcai x bvzero false) eqn: H.
  f_equal.

  rewrite IHn in H.
  inversion H.
  reflexivity.

  rewrite IHn in H.
  inversion H.
  f_equal.

  unfold sumfunc. simpl.
  unfold carryfunc. simpl.

  destruct (rcai x bvzero false) eqn: H. (* The trick Yann taught me *)
  f_equal.

  rewrite IHn in H.
  inversion H.
  reflexivity.

  rewrite IHn in H.
  inversion H.
  f_equal.
Qed.

(* 
 * The next lemma proves that -1 is a vector of ones
 * One thing to note here is that I needed to explicitly supply the implicit argument n to addinv using @.
 *)

Lemma allones: forall {n : nat}, @addinv n bvone = ones.
Proof.
  intros n.
  induction n.
  auto. (* base case proved *)

  simpl.
  unfold bvone.
  unfold addinv.
  simpl.
  unfold bvone.
  unfold “+”.

  simpl.

  unfold carryfunc.
  simpl.
  unfold sumfunc.
  simpl.

  destruct (rcai (neg bvzero) bvzero false) eqn: H.
  simpl.

  f_equal.
  f_equal.

  rewrite rcai_zero in H.
  inversion H.

  apply invertzeros.
  reflexivity.
Qed.

(* 
 * This lemma captures the fact that one way you can add one to a bitvector using a ripple carry adder is
 * to add zero and assert the carry in port. 
 *)

Lemma increment_with_carry : forall (n : nat) (x : bv (S n)),
  x + bvone = snd (rcai x bvzero true).
Proof.
  intros n x.
  dependent destruction x.

  (* first peel off the LSB from the two operands *)

  simpl bvzero.
  simpl rcai.

  unfold bvone.
  unfold “+”.
  simpl rcai.

  (* now case split by the LSB of x to show the same thing *)

  destruct b.

  unfold carryfunc.
  simpl.
  unfold sumfunc.
  simpl.
  reflexivity.

  unfold carryfunc.
  simpl.
  unfold sumfunc.
  simpl.
  reflexivity.
Qed.

(* This lemma says that if you add a vector of ones to a value x using a ripple carry adder, while asserting the
 * carry in port, then the sum result will just be x. Of course this is because -1 + 1 = 0, though I didn’t prove
 * it that way.
 * A neat trick I found to use in this proof is to use the tactic ‘apply (f_equal snd)’ on one of the hypotheses
 * in order to isolate the sum component in the tuple produced by the ripple carry function rcai.
 *)

Lemma rcai_ones_cin_identity : forall (n : nat) (x : bv n),
  snd (rcai x ones true) = x.
Proof.
  intros n x.
  induction n.
  dependent destruction x.
  simpl.
  reflexivity.
  dependent destruction x.
  simpl ones.
  simpl rcai.

  (* case analysis *)
  destruct b.
  unfold carryfunc.
  unfold sumfunc.
  simpl.
  destruct (rcai x ones true) eqn: H.
  simpl.
  f_equal.
  apply (f_equal snd) in H. (* a neat trick *)
  simpl in H.
  rewrite IHn in H.
  auto.

  unfold carryfunc.
  unfold sumfunc.
  simpl.
  destruct (rcai x ones true) eqn: H.
  simpl.
  f_equal.
  apply (f_equal snd) in H.
  simpl in H.
  rewrite IHn in H.
  auto.
Qed.

(* 
 * This lemma is actually the main content of what we’re trying to prove, just not wrapped up in
 * very readable form yet.
 * Note the use of ‘rewrite <-‘ to use an existing lemma to rewrite a term from the RHS of the equality
 * in the lemma to the LHS. Without the ‘<-‘ it would do it the other way round.
 *)

Lemma main_helper : forall (n : nat) (x : bv (S n)),
  neg (x + ones) = neg x + bvone.
Proof.
  intros n x.
  induction n.
  dependent destruction x.
  destruct b.
  dependent destruction x.
  auto.
  dependent destruction x.
  auto. (* base case proved *)

  dependent destruction x.
  simpl.
  unfold bvone.
  unfold “+”.
  simpl rcai.

  destruct b.
  unfold carryfunc.
  unfold sumfunc.
  simpl.

  rewrite rcai_zero.

  destruct (rcai x (consbv n true ones) true) eqn: H.
  simpl neg.
  simpl snd.
  f_equal.
  f_equal.

  apply (f_equal snd) in H.
  simpl snd in H.
  rewrite rcai_ones_cin_identity in H.
  auto.

  unfold carryfunc.
  unfold sumfunc.
  simpl.

  destruct (rcai (neg x) (consbv n false bvzero)) eqn: H.
  apply (f_equal snd) in H.
  simpl snd in H.

  rewrite <- increment_with_carry in H.

  simpl snd.

  destruct (rcai x (consbv n true ones) false) eqn: H1.
  simpl snd.
  simpl neg.
  f_equal.

  apply (f_equal snd) in H1.
  simpl snd in H1.

  rewrite <- H1.
  rewrite <- H.

  apply IHn.
Qed.

Theorem main_theorem: forall (n : nat) (x : bv (S n)),
  neg x + bvone = neg (xbvone).
Proof.
  intros n x.
  unfold “-“.
  rewrite allones.
  rewrite <- main_helper.
  reflexivity.
Qed.

(* 
 * References
 * [1] http://adam.chlipala.net/cpdt/html/MoreDep.html
 * [2] https://stackoverflow.com/questions/24720137/inversion-produces-unexpected-existt-in-coq&nbsp;
 *)

Some Lessons

So what have I learned from this experience, beyond a little bit of Coq? Firstly, it was fun. It was a nice way to spend a couple of days of my Easter holiday. I am not sure I would want to do it under time pressure, though, as it was also frustrating at times. If I ever wanted to use Coq in anger for my work, I would want to take a couple of months – or more – to really spend time with it.

On the positive side, Coq really forced me to think about foundations. What do I actually mean when I write \overline{x} + 1 = \overline{x - 1}? Should I be thinking in {\mathbb Z}, in {\mathbb Z}/n\mathbb{Z}, or in digits, and when? How should bitvector arithmetic behave on zero-sized bitvectors? (Oh, and I certainly did not expect to be digging out a proof of decidability of natural equality from Coq’s standard library to prove this theorem!) The negative side is the same: Coq really forced me to think about foundations. And I remain to be convinced that I want to do that when I’m not on Easter holiday and in a philosophical mood.

I loved the type system and the expression of theorems. I’m luke warm about the proof process. At least the way I wrote the proofs – which was probably intolerably amateur – it felt like someone could come along and change the tactics at some point and my proof would be broken. Maybe this is not true, but this is what it felt like. This was a different feeling to that I remember when playing with Agda four years ago, which felt like everything needed to be explicit but somehow felt more nailed down and permanent. In Agda, the proofs are written in the same language as the types and I enjoyed that, too. Both languages are based on dependent types, and so as – I understand – is Lean. My colleague Kevin Buzzard is a strong advocate of Lean. Perhaps that’s one for another Easter holiday!

Thinking about this proof from a hardware perspective – designing efficient bit-parallel arithmetic hardware – it is clear that we do not need to have proved the theorem for all n. Each bit slice occupies silicon area, and as this is a finite resource, it would be sufficient to have one proof for each feasible value of n. Of course, this makes things much easier to prove, even if it comes with much more baggage. I can fire up an SMT solver and prove the theorem completely automatically for a specific value of n. As an example, if you paste the code below into the Z3 prover (hosted at rise4fun), the solver will report unsat, i.e. there is provably no satisfying value of the variable x violating the theorem for n = 4.

(declare-fun x () (_ BitVec 4))
(assert (not (= (bvadd (bvneg x) #x1) (bvneg (bvadd x #xF)))))
(check-sat)
(exit)

There are pluses and minuses to this. On the plus side, the SMT query is fast and automatic. On the minus side, in addition to only being valid for n = 4, it gives me – and perhaps some future AI – none of the intuition as to why this theorem holds. When I read mathematics, the proofs are not incidental, they are core to the understanding of what I’m reading.

Will this also be true for future AI-driven EDA tools?


Notes

In case this is useful to anyone (or to me in the future): I got syntax highlighting playing well for Coq with WordPress.com by using coqdoc to generate HTML and CSS, then hacking at the CSS so that it didn’t affect the rest of my WordPress theme, pasting it into the WordPress.com CSS customiser, and putting the generated HTML in a WordPress.com HTML block. Take care to avoid the CSS class .comment, used by coqdoc for code comments but also used by WordPress for blog post comment formatting!

Thanks again to Yann Herklotz for help understanding let bindings in Coq.