October 2023 – Thinking

In recent years there has been a rediscovery (sometimes multiple times!) of block floating point, a number representation used in the early days of the electronic computer for general purpose compute, but also in specialist areas such as signal processing, throughout the field’s history. The main reason for this modern rediscovery is the value of block floating point, and related ideas like block minifloat, in modern machine learning implementations.

Just like in the early days of floating point, a number of different approaches and implementations have flourished in recent years, so it was interesting to see the recent launch of the Open Compute Project OCP Microscaling Formats (MX) Specification, coauthored across Microsoft, AMD, Arm, Intel, Meta, NVIDIA and Qualcomm by a number of well-known figures (including Imperial’s very own Martin Langhammer). This document will be referred to as “the specification” in the rest of this blog post. At the same time, a paper was released on arXiv, supported by a GitHub repo, to show that when used for training deep neural networks (DNNs), MX-based implementations can produce high quality results.

In this blog post, I will briefly review a few of the features of the specification that I found particularly interesting as an “old hand” in fixed point and block floating point computer arithmetic, and raise some issues that the community may wish to take forward.

Firstly, I should say that it’s great to see efforts like this to raise the profile and awareness of block floating point and related number systems. My hope is that efforts like this will drive further work on the extremely important topic of efficient numerical hardware.

Storage

Building block types. The focus of the specification is primarily one of storage, although certain computations are mentioned. The specification primarily defines the construction of a type for MX computations from three parameters: a (scalar) type for scalings, a (scalar) type for elements, and a (natural) number of elements. A value inhabiting this type is a scalar scaling $X$ together with $k$ elements, $P_i$ , for $1 \leq i \leq k$ . The numbers represented are (except special values), $v_i = X \cdot P_i$ . This is a very flexible starting point, which is to be welcomed, and there is much to explore under this general hood. In all the examples given, the scalar data types are either floating point or integer/fixed point, but there seems to be nothing in the specification specifically barring other representations, which could of course go far beyond traditional block floating point and more recent block exponent biases.
Concrete types. In addition to specifying a generic type constructor, as above, the specification also defines certain concrete instantiations, always with a power-of-two scaling stored in a conventional 8-bit biased integer exponent, and with either floating point or integer elements. By my reading of the specification, to obtain an MX-compliant implementation, one does not seem to need to implement any of these concrete types, however.
Subnormals mandated. If implementing a concrete format specified, subnormal representations must be implemented.

Computation

1. Dot products

Apart from conversion into MX, the only MX operation discussed in the specification is dot products of MX vectors of equal number of elements. This appears to leave some basic operations undefined, for example simple addition of two MX vectors. (It is, of course, possible to provide a reasonable semantics for this operation consistent with the scalar base types, but it is not in the specification.)

The definition given of the “Dot” operation is interesting. The specification says “The following semantics must be minimally supported”, and gives the following equation:

$C = \text{Dot}(A,B) = X^{(A)}X^{(B)}\sum_{i=1}^k P_i^{(A)} \times P_i^{(B)}$

where $P_i^{A}$ denotes the $i$ th element of vector $A$ , $X^{(A)}$ denotes the scaling of $A$ , etc.

To my mind, the problem with this definition is that $\times$ (on elements), the (elided) product on scalings and the sum on element products is undefined. The intention, of course, is that we want the dot product to approximate a real dot product, in the sense that $\llbracket C \rrbracket \approx \llbracket X^{(A)} \rrbracket \llbracket X^{(B)} \rrbracket \sum_{i=1}^k \llbracket P_i^{(A)} \rrbracket \times \llbracket P_i^{(B)} \rrbracket$ , where the semantic brackets correspond to interpretation in the base scalar types and arithmetic operations are over the reals. But the nature of approximation is left unstated. Note, in particular, that it is not a form of round-to-nearest defined on MX, as hinted at by the statement “The internal precision of the dot product and order of operations is implementation-defined.” This at least suggests that there’s an assumed implementation of the multi-operand summation via two-input additions executed in some (identical?) internal precision in some defined order, though this is a very specific – and rather restrictive – way of performing multi-operand addition in hardware. There’s nothing wrong with having this undefined, of course, but an MX-compliant dot product would need to have a very clear statement of its approximation properties – perhaps something that can be considered for the standard in due course.

2. General Dot Products

The specification also defines an operation called “DotGeneral” which is a dot product of two vectors made up of MX-vector components. The specification defines it in the following way:

$C = \text{DotGeneral}(A,B) = \sum_{j=1}^n \text{Dot}(A_j, B_j)$

“where $A_j$ and $B_j$ are the $j$ th MX-compliant subvector of $A$ and $B$ .” Like in the “Dot” definition, the summation notation is also not defined in the specification, and also this definition doesn’t carry the same wording as for “Dot” on internal precision and order, but I am assuming it is intended to. It is left undefined how to form component subvectors. I am assuming that the authors envisaged the vector $A_j$ consisting of the $j$ th block of $k$ contiguous elements in the vector $A$ , but of course there are many other ways this could be done (e.g. strided access, etc.)

I would be interested to hear from any blog readers who think I have misinterpreted the standard, or who think other components of the specification – not discussed here – are more interesting. I will be watching for uptake of this specification: please do alert me of interesting uses.

Designing circuits is not easy. Mistakes can be made, and mistakes made are not easily fixed. Even leaving aside safety-critical applications, product recalls can cost upwards of hundreds of millions of US$, and nobody wants to be responsible for that. As a result, the industry invests a huge effort in formally verifying hardware designs, that is coming up with computer-generated or computer-assisted proofs of hardware correctness.

My Intel-funded PhD student Sam Coward (jointly advised by Theo Drane) is about to head off to FMCAD 2023 in Ames, Iowa, to present a contribution to making this proof generation process speedier and more automated, while still relying on industry-standard tried and trusted formal verification tools. Together with his Intel colleague Emilian o Morini and then-intern (and Imperial student) Bryan Tan, Sam noticed that the e-graph techniques we have been using in his PhD for optimising datapath [1, 2] also have a natural application to verifying datapath designs, akin – but distinct – to their long-time use as a data-structure in SMT solvers.

The basic idea is straight-forward. Ever since our earliest ARITH paper, we’ve developed a toolbox of Intel-inspired rewrite rules that can be applied to optimise designs across parametric bitwidths. Our optimisation flow, called ROVER, applies these to a single expression we wish to optimise, and then we extract a minimal cost implementation. Our new paper asks a different question: what if we start, not from a single expression we wish to optimise, but from two expressions we’re trying to prove to be equivalent?

Following the approach I blogged about here from ROVER, if the expressions are not equivalent then they should never end up in the same equivalence class after multiple rewrite iterations. If they are equivalent then they may or may not end up in the same equivalence class: we provide no guarantees of completeness. So let’s think about what happens in each of these cases.

If they end up in the same equivalence class, then our tool thinks it has proven the two expressions to be equivalent. But no industrial design team will trust our tool to just say so – and rightly so! – tool writers make mistakes too. However, work by Sam and other colleagues from the Universities of Washington and Utah, has provided a methodology by which we can use our tool’s reasoning to export hints, in the form of intermediate proof steps, to a trusted industrial formal verification tool. And so this is how we proceed: our tool will generate many, sometimes hundreds, of hints which will help state-of-the-art industrial formal verification tools to find proofs more quickly – sometimes much more quickly – than they were able to do so beforehand.

So what if the two expressions don’t end up in the same equivalence class, despite actually being equivalent? No problem! We can still find two expressions for which their proof of equivalence, via some external tool, makes the overall proof we’re aiming for much easier… crunching down the proof to its hard essence, stripping away what we can. The main technical idea here is how to extract the expressions from the e-graph. Rather than aiming for the best possible implementation cost, as we did with ROVER, we aim to minimise the distance between the two implementations that are left for the industrial tool to verify. The overall process is shown in the figure below, where “EC” refers to an equivalence check conducted by the industrially-trusted tool and $S^*$ and $I^*$ are the extracted “close” expressions.

We show one example where, without the hints provided by our tool, the industrial formal verifier does not complete within a 24 hour timeout. Even when results are achievable with the industrial formal verifier, we still get an up to a 6x speedup in proof convergence.

This work forms a really nice “power use” of e-graphs. If you are at FMCAD 2023, come along and hear Sam talk about it in person!

Month: October 2023

Industry Coheres around MX?

Storage

Computation

1. Dot products

2. General Dot Products

Supercharging Formal with E-Graphs