Notes on Computational Learning Theory

This blog collects some of my notes on classical computational learning theory, based on my reading of Kearns and Vazirani. The results are (almost) all from their book, the sloganising (and mistakes, no doubt) are mine.

The Probably Approximately Correct (PAC) Framework

Definition (Instance Space). An instance space is a set, typically denoted X. It is the set of objects we are trying to learn about.

Definition (Concept). A concept c over X is a subset of the instance space X.

Although not covered in Kearns and Vazirani, in general it is possible to generalise beyond Boolean membership to some degree of uncertainty or fuzziness – I hope to cover this in a future blog post.

Definition (Concept Class). A concept class {\mathcal C} is a set of concepts, i.e. {\mathcal C} \subset \mathcal{P}(X), where \mathcal P denotes power set. We will follow Kearns and Vazirani and also use c to denote the corresponding indicator function c : X \to \{0,1\}.

In PAC learning, we assume {\mathcal C} is known, but the target class c \in {\mathcal C} is not. However, it doesn’t seem a jump to allow for unknown target class, in an appropriate approximation setting – I would welcome comments on established frameworks for this.

Definition (Target Distribution). A target distribution {\mathcal D} is a probability distribution over X.

In PAC learning, we assume {\mathcal D} is unknown.

Definition (Oracle). An oracle is a function EX(c,{\mathcal D}) taking a concept class and a distribution, and returning a labelled example (x, c(x)) where x is drawn randomly and independently from {\mathcal D}.

Definition (Error). The error of a hypothesis concept class h \in {\mathcal C} with reference to a target concept class c \in {\mathcal C} and target distribution {\mathcal D}, is \text{error}(h) = Pr_{x \in {\mathcal D}}\left\{ c(x) \neq h(x) \right\}, where Pr denotes probability.

Definition (Representation Scheme). A representation scheme for a concept class {\mathcal C} is a function {\mathcal R} : \Sigma^* \to {\mathcal C} where \Sigma is a finite alphabet of symbols (or – following the Real RAM model – a finite alphabet augmented with real numbers).

Definition (Representation Class). A representation class is a concept class together with a fixed representation scheme for that class.

Definition (Size). We associate a size \text{size}(\sigma) with each string from a representation alphabet \sigma \in \Sigma^*. We similarly associate a size with each concept c via the size of its minimal representation \text{size}(c) = \min_{R(\sigma) = c} \text{size}(\sigma).

Definition (PAC Learnable). Let {\mathcal C} and {\mathcal H} be representation classes classes over X, where {\mathcal C} \subseteq {\mathcal H}. We say that concept class {\mathcal C} is PAC learnable using hypothesis class {\mathcal H} if there exists an algorithm that, given access to an oracle, when learning any target concept c \in {\mathcal C} over any distribution {\mathcal D} on X, and for any given 0 < \epsilon < 1/2 and 0 < \delta < 1/2, with probability at least 1-\delta, outputs a hypothesis h \in {\mathcal H} with \text{error}(h) \leq \epsilon.

Definition (Efficiently PAC Learnable). Let {\mathcal C}_n and {\mathcal H}_n be representation classes classes over X_n, where {\mathcal C}_n \subseteq {\mathcal H}_n for all n. Let X_n = \{0,1\}^n or X_n = {\mathbb R}^n. Let X = \cup_{n \geq 1} X_n, {\mathcal C} = \cup_{n \geq 1} {\mathcal C_n}, and {\mathcal H} = \cup_{n \geq 1} {\mathcal H_n}. We say that concept class {\mathcal C} is efficiently PAC learnable using hypothesis class {\mathcal H} if there exists an algorithm that, given access to a constant time oracle, when learning any target concept c \in {\mathcal C}_n over any distribution {\mathcal D} on X, and for any given 0 < \epsilon < 1/2 and 0 < \delta < 1/2:

  • Runs in time polynomial in n, \text{size}(c), 1/\epsilon, and 1/\delta, and
  • With probability at least 1-\delta, outputs a hypothesis h \in {\mathcal H} with \text{error}(h) \leq \epsilon.

There is much of interest to unpick in these definitions. Firstly, notice that we have defined a family of classes parameterised by dimension n, allowing us to talk in terms of asymptotic behaviour as dimensionality increases. Secondly, note the key parameters of PAC learnability: \delta (the ‘probably’ bit) and \epsilon (the ‘approximate’ bit). The first of these captures the idea that we may get really unlucky with our calls to the oracle, and get misleading training data. The second captures the idea that we are not aiming for certainty in our final classification accuracy, some pre-defined tolerance is allowable. Thirdly, note the requirements of efficiency: polynomial scaling in dimension, in size of the concept (complex concepts can be harder to learn), in error rate (the more sloppy, the easier), and in probability of algorithm failure to find a suitable hypothesis (you need to pay for more certainty). Finally, and most intricately, notice the separation of concept class from hypothesis class. We require the hypothesis class to be at least as general, so the concept we’re trying to learn is actually one of the returnable hypotheses, but it can be strictly more general. This is to avoid the case where the restricted hypothesis classes are harder to learn; Kearns and Vazirani, following Pitt and Valiant, give the example of learning the concept class 3-DNF using the hypothesis class 3-DNF is intractable, yet learning the same concept class with the more general hypothesis class 3-CNF is efficiently PAC learnable.

Occam’s Razor

Definition (Occam Algorithm). Let \alpha \geq 0 and 0 \leq b < 1 be real constants. An algorithm is an (\alpha,\beta)-Occam algorithm for {\mathcal C} using {\mathcal H} if, on an input sample S of cardinality m labelled by membership in c \in {\mathcal C}_n, the algorithm outputs a hypothesis h \in {\mathcal H} such that:

  • h is consistent with S, i.e. there is no misclassification on S
  • \text{size}(h) \leq \left(n \cdot \text{size}(c)\right)^\alpha m^\beta

Thus Occam algorithms produce succinct hypotheses consistent with data. Note that the size of the hypothesis is allowed to grow only mildly – if at all – with the size of the dataset (via \beta). Note, however, that there is nothing in this definition that suggests predictive power on unseen samples.

Definition (Efficient Occam Algorithm). An (\alpha,\beta)-Occam algorithm is efficient iff its running time is polynomial in n, m, and \text{size}(c).

Theorem (Occam’s Razor). Let A be an efficient (\alpha,\beta)-Occam algorithm for {\mathcal C} using {\mathcal H}. Let {\mathcal D} be the target distribution over X, let c \in {\mathcal C}_n be the target concept, 0 < \epsilon, \delta \leq 1. Then there is a constant a > 0 such that if A is given as input a random sample S of m examples drawn from oracle EX(c,{\mathcal D}), where m satisfies m \geq a \left( \frac{1}{\epsilon} \log \frac{1}{\delta} + \left(\frac{\left( n \cdot \text{size}(c) \right)^\alpha}{\epsilon}\right)^\frac{1}{1-\beta}\right), then A runs in time polynomial in n, \text{size}(c), 1/\epsilon and \frac{1}{\delta} and, with probability at least 1 - \delta, the output h of A satisfies error(h) \leq \epsilon.

This is a technically dense presentation, but it’s a philosophically beautiful result. Let’s unpick it a bit, so its essence is not obscured by notation. In summary, simple rules that are consistent with prior observations have predictive power! The ‘simple’ part here comes from (\alpha,\beta), and the predictive power comes from the bound on \text{error}(h). Of course, one needs sufficient observations (the complex lower bound on m) for this to hold. Notice that as \beta approaches 1, and so – by the definition of an Occam algorithm – we get close to being able to memorise our entire training set – we need an arbitrarily large training set (memorisation doesn’t generalise).

Vapnik-Chervonenkis (VC) Dimension

Definition (Behaviours). The set of behaviours on S = \{x_1, \ldots, x_m\} that are realised by {\mathcal C}, is defined by \Pi_{\mathcal C}(S) = \left\{ \left(c(x_1), \ldots, c(x_m)\right) | c \in {\mathcal C} \right\}.

Each of the points in S is either included in a given concept or not. Each tuple \left(c(x_1), \ldots, c(x_m)\right) then forms a kind of fingerprint of X according to a particular concept. The set of behaviours is the set of all such fingerprints across the whole concept class..

Definition (Shattered). A set S is shattered by {\mathcal C} iff \Pi_{\mathcal C}(S) = \{0,1\}^{|S|}.

Note that \{0,1\}^{|S|} is the maximum cardinality that’s possible, i.e. the set of behaviours is all possible behaviours. So we can think of a set as being shattered by a concept class iff there’s no combination of inclusion/exclusion in the concepts that isn’t represented at least once in the set.

Definition (Vapnik-Chervonenkis Dimension). The VC dimension of {\mathcal C}, denoted VCD({\mathcal C}), is the cardinality of the largest set shattered by {\mathcal C}. If arbitrarily large finite sets can be shattered by {\mathcal C}, then VDC({\mathcal C}) = \infty.

VC dimension in this sense captures the ability of {\mathcal C} to discern between samples.

Theorem (PAC-learning in Low VC Dimension). Let {\mathcal C} be any concept class. Let {\mathcal H} be any representation class off of VC dimension d. Let A be any algorithm taking a set of m labelled examples of a concept c \in {\mathcal C} and producing a concept in {\mathcal H} that is consistent with the examples. Then there exists a constant c_0 such that A is a PAC learning algorithm for {\mathcal C} using {\mathcal H} when it is given examples from EX(c,{\mathcal D}), and when m \geq c_0 \left( \frac{1}{\epsilon} \log \frac{1}{\delta} + \frac{d}{\epsilon} \log \frac{1}{\epsilon} \right).

Let’s take a look at the similarity between this theorem and Occam’s razor, presented in the last section of this blog post. Both bounds have a similar feel, but the VCD-based bound does not depend on \text{size}(c); indeed it’s possible that the size of hypotheses is infinite and yet the VCD is still finite.

As the theorem below shows, the linear dependence on VCD achieved in the above theorem is actually the best one can do.

Theorem (PAC-learning Minimum Samples). Any algorithm for PAC-learning a concept class of VC dimension d must use \Omega(d/\epsilon) examples in the worst case.

Definition (Layered DAG). A layered DAG is a DAG in which each vertex is associated with a layer \ell \in {\mathbb N} and in which the edges are always from some layer \ell to the next layer \ell+1. Vertices at layer 0 have indegree 0 and are referred to as input nodes. Vertices at other layers are referred to as internal nodes. There is a single output node of outdegree 0.

Definition (G-composition). For a layered DAG G and a concept class {\mathcal C}, the G-composition of {\mathcal C} is the class of all concepts that can be obtained by: (i) associating a concept c_i \in {\mathcal C} with each vertex N_i in G, (ii) applying the concept at each node to its predecessor nodes.

Notice that this way we can think of the internal nodes as forming a Boolean circuit with a single output; the G-composition is the concept class we obtain by restricting concepts to only those computable with the structure G. This is a very natural way of composing concepts – so what kind of VCD arises through this composition? This theorem provides an answer:

Theorem (VCD Compositional Bound). Let G be a layered DAG with n input nodes and s \geq 2 internal nodes, each of indegree r. Let {\mathcal C} be a concept class over {\mathbb R}^r of VC dimension d, and let {\mathcal C}_G be the G-composition of {\mathcal C}. Then VCD({\mathcal C}_G) \leq 2ds \log(es).

Weak PAC Learnability

Definition (Weak PAC Learning). Let {\mathcal C} be a concept class and let A be an algorithm that is given access to EX(c,{\mathcal D}) for target concept c \in {\mathcal C}_n and distribution {\mathcal D}. A is a weak PAC learning algorithm for {\mathcal C} using {\mathcal H} if there exist polynomials p(\cdot,\cdot) and q(\cdot,\cdot) such that A outputs a hypothesis h \in {\mathcal H} that with probability at least 1/q(n,\text{size}(c)) satisfies \text{error}(h) \leq 1/2 - 1/p(n,\text{size}(c)).

Kearns and Vazirani justifiably describe weak PAC learning as “the weakest demand we could place on an algorithm in the PAC setting without trivialising the problem”: if these were exponential rather than polynomial functions in n, the problem is trivial: take a fixed-size random sample of the concept and memorise it, randomly guess with probability 50% outside the memorised sample. The remarkable result is that efficient weak PAC learnability and efficient PAC learnability coincide for an appropriate PAC hypothesis class, based on ternary majority trees.

Definition (Ternary Majority Tree). A ternary majority tree with leaves from {\mathcal H} is a tree where each non-leaf node computes a majority (voting) function of its three children, and each leaf is labelled with a hypothesis from {\mathcal H}.

Theorem (Weak PAC learnability is PAC learnability). Let {\mathcal C} be any concept class and {\mathcal H} any hypothesis class. Then if {\mathcal C} is efficiently weakly PAC learnable using {\mathcal H}, it follows that {\mathcal C} is efficiently PAC learnable using a hypothesis class of ternary majority trees with leaves from {\mathcal H}.

Kearns and Varzirani provide an algorithm to learn this way. The details are described in their book, but the basic principle is based on “boosting”, as developed in the lemma to follow.

Definition (Filtered Distributions). Given a distribution {\mathcal D} and a hypothesis h_1 we define {\mathcal D_2} to be the distribution obtained by flipping a fair coin and, on a heads, drawing from EX(c,{\mathcal D}) until h_1 agrees with the label; on a tails, drawing from EX(c,{\mathcal D}) until h_1 disagrees with the label. Invoking a weak learning algorithm on data from this new distribution yields a new hypothesis h_2. Similarly, we define {\mathcal D_3} to be the distribution obtained by drawing examples from EX(c,{\mathcal D}) until we find an example on which h_1 and h_2 disagree.

What’s going on in these constructions is quite clever: h_2 has been constructed so that it must contain new information about c, compared to h_1; h_1 has, by construction, no advantage over a coin flip on {\mathcal D}_2. Similarly, h_3 contains new information about c not already contained in h_1 and h_2, namely on the points where they disagree. Thus, one would expect that hypotheses that work in these three cases could be combined to give us a better overall hypothesis. This is indeed the case, as the following lemma shows.

Lemma (Boosting). Let g(\beta) = 3 \beta^2 - 2 \beta^3. Let the distributions {\mathcal D}, {\mathcal D}_2, {\mathcal D}_3 be defined above, and let h_1, h_2 and h_3 satisfy \text{error}_{\mathcal D}(h_1) \leq \beta, \text{error}_{{\mathcal D}_2}(h_2) \leq \beta, \text{error}_{{\mathcal D}_3}(h_3) \leq \beta. Then if h = \text{majority}(h_1, h_2, h_3), it follows that \text{error}_{\mathcal D}(h) \leq g(\beta).

The function g is monotone and strictly decreasing over [0,1/2). Hence by combining three hypotheses with only marginally better accuracy than flipping a coin, the boosting lemma tells us that we can obtain a strictly stronger hypothesis. The algorithm for (strong) PAC learnability therefore involves recursively calling this boosting procedure, leading to the majority tree – based hypothesis class. Of course, one needs to show that the depth of the recursion is not too large and that we can sample from the filtered distributions with not too many calls to the overall oracle EX(c,{\mathcal D}), so that the polynomial complexity bound in the PAC definition is maintained. Kearns and Vazirani include these two results in the book.

Learning from Noisy Data

Up until this point, we have only dealt with correctly classified training data. The introduction of a noisy oracle allows us to move beyond this limitation.

Definition (Noisy Oracle). A noisy oracle \hat{EX}^\eta( c, {\mathcal D}) extends the earlier idea of an oracle with an additional noise parameter 0 \leq \eta < 1/2. This oracle behaves in the identical way to EX except that it returns the wrong classification with probability \eta.

Definition (PAC Learnable from Noisy Data). Let {\mathcal C} be a concept class and let {\mathcal H} be a representation class over X. Then {\mathcal C} is PAC learnable from noisy data using {\mathcal H} if there exists and algorithm such that: for any concept c \in {\mathcal C}, any distribution {\mathcal D} on X, any 0 \leq \eta < 1/2, and any 0 < \epsilon < 1, 0 < \delta < 1 and \eta_0 with \eta \leq \eta_0 < 1/2, given access to a noisy oracle \hat{EX}^\eta( c, {\mathcal D}) and inputs \epsilon, \delta, \eta_0, with probability at least 1 - \delta the algorithm outputs a hypothesis concept h \in {\mathcal H} with \text{error}(h) \leq \epsilon. If the runtime of the algorithm is polynomial in n, 1/\epsilon, 1/\delta and 1/(1 - 2\eta_0) then {\mathcal C} is efficiently learnable from noisy data using {\mathcal H}.

Let’s unpick this definition a bit. The main difference from the PAC definition is simply the addition of noise via the oracle and an additional parameter \eta_0 which bounds the error of the oracle; thus the algorithm is allowed to know in advance an upper bound on the noisiness of the data, and an efficient algorithm is allowed to take more time on more noisy data.

Kearns and Vazirani address PAC learnability from noisy data in an indirect way, via the use of a slightly different framework, introduced below.

Definition (Statistical Oracle). A statistical oracle STAT(c, {\mathcal D}) takes queries of the form (\chi, \tau) where \chi : X \times \{0,1\} \to \{0,1\} and 0 < \tau \leq 1, and returns a value \hat{P}_\chi satisfying P_\chi - \tau \leq \hat{P}_\chi \leq P_\chi + \tau where P_\chi = Pr_{x \in {\mathcal D}}[ \chi(x, c(x)) = 1 ].

Definition (Learnable from Statistical Queries). Let {\mathcal C} be a concept class and let {\mathcal H} be a representation class over X. Then {\mathcal C} is efficiently learnable from statistical learning queries using {\mathcal H} if there exists a learning algorithm A and polynomials p(\cdot, \cdot, \cdot), q(\cdot, \cdot, \cdot) and r(\cdot,\cdot,\cdot) such that: for any c \in {\mathcal C}, any distribution {\mathcal D} over X and any 0 < \epsilon < 1/2, if given access to STAT(c,{\mathcal D}), the following hold. (i) For every query (\chi,\tau) made by A, the predicate \chi can be evaluated in time q(1/\epsilon, n, \text{size}(c)), and \tau \leq r(1/\epsilon, n, \text{size}(c)), (ii) A has execution time bounded by p(1/\epsilon, n, \text{size}(c)), (iii) A outputs a hypothesis h \in {\mathcal H} that satisfies \text{error}(h) \leq \epsilon.

So a statistical oracle can be asked about a whole predicate \chi, for any given tolerance \tau. The oracle must return an estimate of the probability that this predicate holds (where the probability is over the distribution over X). It is, perhaps, not entirely obvious how to relate this back to the more obvious noisy oracle used above. However, it is worth noting that one can construct a statistical oracle that works with high probability by taking enough samples from a standard oracle, and then returning the relative frequency of \chi evaluating to 1 on that sample. Kearns and Vazirani provide an intricate construction to efficiently sample from a noisy oracle to produce a statistical oracle with high probability. In essence, this then allows an algorithm that can learn from statistical queries to be used to learn from noisy data, resulting in the following theorem.

Theorem (Learnable from Statistical Queries means Learnable from Noisy Data). Let {\mathcal C} be a concept class and let {\mathcal H} be a representation class over X. Then if {\mathcal C} is efficiently learnable from statistical queries using {\mathcal H}, {\mathcal C} is also efficiently PAC learnable using {\mathcal H} in the presence of classification noise.

Hardness Results

I mentioned earlier in this post that Pitt and Valiant showed that sometimes we want more general hypothesis classes than concept classes: the concept class 3-DNF using the hypothesis class 3-DNF is intractable, yet learning the same concept class with the more general hypothesis class 3-CNF is efficiently PAC learnable. So in their chapter Inherent Unpredictability, Kearns and Vazirani turn their attention to the case where a concept class is hard to learn independent of the choice of a hypothesis class. This leads to some quite profound results for those of us interested in Boolean circuits.

We will need some kind of hardness assumption to develop hardness results for learning. In particular, note that if P = NP, then by Occam’s Razor (above) polynomially evaluable hypothesis classes are also polynomially-learnable ones. So we will need to do two things: focus our attention on polynomially evaluable hypothesis classes (or we can’t hope to learn them polynomially), and make a suitable hardness assumption. The latter requires a very brief detour into some results commonly associated with cryptography.

Let {\mathbb Z}_N^* = \{ i \; | \; 0 < i < N \; \wedge \text{gcd}(i, N) = 1 \}. We define the cubing function f_N : {\mathbb Z}_N^* \to {\mathbb Z}_N^* by f_N(x) = x^3 \text{ mod } N. Let \varphi define Euler’s totient function. Then if \varphi is not a multiple of three, it turns out that f_N is bijective, so we can talk of a unique discrete cube root.

Definition (Discrete Cube Root Problem). Let p and q be two n-bit primes with \varphi(N) not a multiple of 3, where N = pq. Given N and f_N(x) as input, output x.

Definition (Discrete Cube Root Assumption). For every polynomial P, there is no algorithm that runs in time P(n) that solves the discrete cube root problem with probability at least 1/P(n), where the probability is taken over randomisation of p, q, x and any internal randomisation of the algorithm A. (Where N = pq).

This Discrete Cube Root Assumption is widely known and studied, and forms the basis of the learning complexity results presented by Kearns and Vazirani.

Theorem (Concepts Computed by Small, Shallow Boolean Circuits are Hard to Learn). Under the Discrete Cube Root Assumption, the representation class of polynomial-size, log-depth Boolean circuits is not efficiently PAC learnable (using any polynomially evaluable hypothesis class).

The result also holds if one removes the log-depth requirement, but this result shows that even by restricting ourselves to only log-depth circuits, hardness remains.

In case any of my blog readers knows: please contact me directly if you’re aware of any resource of positive results on learnability of any compositionally closed non-trivial restricted classes of Boolean circuits.

The construction used to provide the result above for Boolean circuits can be generalised to neural networks:

Theorem (Concepts Computed by Neural Networks are Hard to Learn). Under the Discrete Cube Root Assumption, there is a polynomial p and an infinite family of directed acyclic graphs (neural network architectures) G = \{G_{n^2}\}_{n \geq 1} such that each G_{n^2} has n^2 Boolean inputs and at most p(n) nodes, the depth of G_{n^2} is a constant independent of n, but the representation class {\mathcal C}_G = \cup_{n \geq 1} {\mathcal C}_{G_{n^2}} is not efficiently PAC learnable (using any polynomially evaluable hypothesis class), and even if the weights are restricted to be binary.

Through an appropriate natural definition of reduction in PAC learning, Kearns and Vazirani show that the PAC-learnability of all these classes reduce to functions computed by deterministic finite automata. So, in particular:

Theorem (Concepts Computed by Deterministic Finite Automata are Hard to Learn). Under the Discrete Cube Root Assumption, the representation class of Deterministic Finite Automata is not efficiently PAC learnable (using any polynomially evaluable hypothesis class).

It is this result that motivates the final chapter of the book.

Experimentation in Learning

As discussed above, PAC model utilises an oracle that returns labelled samples (x, c(x)). An interesting question is whether more learning power arises if we allow the algorithms to be able to select x themselves, with the oracle returning c(x), i.e. not just to be shown randomly selected examples but take charge and test their understanding of the concept.

Definition (Membership Query). A membership query oracle takes any instance x and returns its classification c(x).

Definition (Equivalence Query). An equivalence query oracle takes a hypothesis concept h \in {\mathcal C} and determines whether there is an instance x on which c(x) \neq h(x), returning this counterexample if so.

Definition (Learnable From Membership and Equivalence Queries). The representation class {\mathcal C} is efficiently exactly learnable from membership and equivalence queries if there is a polynomial p(\cdot,\cdot) and an algorithm with access to membership and equivalence oracles such that for any target concept c \in {\mathcal C}_n, the algorithm outputs the concept c in time p(\text{size}(c),n).

There are a couple of things to note about this definition. It appears to be a much stronger requirement than PAC learning, as the concept must be exactly learnt. On the other hand, the existence of these more sophisticated oracles, especially the equivalence query oracle, appears to narrow the scope. Kearns and Vazirani encourage the reader to prove that the true strengthening over PAC-learnability is in the membership queries:

Theorem (Exact Learnability from Membership and Equivalence means PAC-learnable with only Membership). For any representation class {\mathcal C}, if {\mathcal C} is efficiently exactly learnable from membership and equivalence queries, then {\mathcal C} is also efficiently learnable in the PAC model with membership queries.

They then provide an explicit algorithm, based on these two new oracles, to efficiently exactly learn deterministic finite automata.

Theorem (Experiments Make Deterministic Finite Automata Efficiently Learnable). The representation class of Deterministic Finite Automata is efficiently exactly learnable from membership and equivalence queries.

Note the contrast with the hardness result of the previous section: through the addition of experimentation, we have gone from infeasible learnability to efficient learnability. Another very philosophically pleasing result.

Review: Neuroqueer Heresies

I recently picked up this enjoyable and thought-provoking collection of essays from Nick Walker. The book consists of a diverse collection of essays she has written over the years, starting from questions of neurodiversity, and ending by exploring how queer theory can be brought to the neurodiversity movement. Some of these essays have appeared before, especially on Walker’s website, and some are newly written for this book. Given that the original essays appeared as far back as 2011, many of them come with a partner essay in the book, reflecting on the changes since first published and on the impact they have had.

The book is split into three sub-collections: The Neurodiversity Paradigm, Autistic Empowerment, and Postnormal Possibilities.

Part I, The Neurodiversity Paradigm, deals with neurodiversity terminology and ideas. “Throw Away the Master’s Tools”, the first essay, sets the scene by raging against the pathology paradigm in neurodiversity, the idea that neurodivergence is “less than” neurotypicality somehow. She instead locates neurodiversity as natural, positive, and similar to racial and cultural diversity. Readers of this blog who are interested can read a lot more detail on this in my review of Chapman’s recent book Empire of Normality, where Chapman identifies possible explanations for the rise of this phenomenon. Walker argues that using the tools and language of medicalised approaches to neurodivergence are not going to lead to neurodivergent liberation. For Walker, using the right language is a key part (perhaps even the key part, given the prominence she gives it) of working towards self-empowerment of marginalised groups.

This sets the scene quite nicely for the next few essays, starting with Neurodiversity: Some Basic Terms and Definitions. These essays deal with definitional questions: neurodivergence versus neurodiversity versus the neurodiversity paradigm, neurotypicality, etc. Neurominority is an interesting term defined here by Walker – it seems to plug a hole in language I was looking for when reviewing Chapman’s work, when I was talking about “clusters”; it would be interesting to compare Chapman’s ideas on “serial collectives” to the “neurominority” definition of Walker.

In the essay Neurodivergence and Disability, Walker deals with the still controversial question of the relationship between these two concepts. I know autistic colleagues who strongly identify as disabled as a result of their neurodivergence, as well as those who shy away from the disability label. To my mind, Walker’s view on this makes perfect sense: to view disability in the social model, leading to the idea that disabled should be considered the opposite of enabled. When someone says they are enabled, we tend to imagine that someone or something outside of them (a mentor, a social structure, or even simply a mechanical tool) allows them to achieve something. Meanwhile, when someone says they are disabled, many people read that as a deficit in the individual, rather than the failure of someone or something outside of that individual to allow them to achieve something. It is also in this context that we first get a preview of Walker’s very strong rejection of “person first” language: “person with a disability” locates the disability with the person, whilst “disabled person” does not. To quote Walker, “if you ask me whether autism is a disability, I’ll say no, but if you ask me whether autistic people are disabled, I’ll say yes”.

Part I of the book finishes with an essay Reflections on Neurocosmopolitanism, which imagines a future in which the neurodiversity paradigm has been embraced by individuals and society. While I love the forward-looking nature of this, I feel it lacks the materialist grounding in class of Chapman’s work, which makes it feel somewhat disconnected from reality for me.

Part II is entitled Autistic Empowerment, and includes a diverse set of essays specifically focused on autism. This ranges from the short, accessible essay What is Autism? to the strongly-titled Person-First Language is the Language of Autistiphobic Bigots. One interesting claim made in this part of the book, in an essay entitled Autism and Social Trauma, is that autistic communication difficulties with allistics do not equally apply to communication with other autistics: that this is not an issue of autistic difficulty but rather of inter-neurotype communication difficulty. This claim is very interesting, but (in common with the other essays) a reference is not provided for evidence. If any of my readers are aware of evidence supporting or disputing this claim, I’d be really interested to hear more. Other essays are included on stimming, on parenting autistic children, on guidance for neurotypical psychotherapists working with autistic clients and a set of principles for those constructing a course on autism, amongst others.

The shortest part of the book, Part III, is called Postnormal Possibilities, and as I understand it takes a turn towards the application of queer theory to neurodiversity. This is my first exposure to queer theory, so my comments here need to be read in this context. Walker introduces the verb to neuroqueer (also, and less prominent in Walker’s thinking, is the adjective neuroqueer). To neuroqueer covers, for Walker, a wide variety of activities, broadly involving transgressing and defying neuronormative and heteronormative behaviours simultaneously. To my mind, Walker’s best illustration of this might be provided in the final, and most lengthy essay of the section, entitled A Horizon of Possibility: Some Notes on Neuroqueer Theory, in which she talks about the “policing of hands”; that neuronormative and heteronormative society insists (at least in the societies Walker and I live in), a certain acceptable way of holding and using one’s hands. As I understand it, to consciously free oneself to stim with one’s hands, or to hold them in a manner that might cause disapproval as it “looks gay” or “looks autistic”, would be a clear example of neuroqueering.

I would recommend this book to others. It’s certainly thought provoking, and there are parts of it that I have already found myself repeating to others, e.g. at work, I found myself in a meeting arguing that disabled is the opposite of enabled, and drawing the appropriate conclusions for inclusive practices. From an academic perspective, I find the lack of references to experimental literature or even other lived experience literature to be frustrating. Walker does herself talk about the problematic nature of academic literature in the field in this very collection, but I would love to see a version of some of these essays referencing other work, so I can build a more fully rounded picture of the evidence base for some of the claims. Having said that, I think it’s still a great jumping off point, perhaps including for researchers into neurodivergence seeking that very evidence base. Walker comes across as a passionate advocate for her community.

Energy: Rewriting the Possibilities

In early June, my PhD student Sam Coward (co-advised by Theo Drane from Intel), will travel to ARITH 2024 in Málaga to present some of our most recent work, “Combining Power and Arithmetic Optimization via Datapath Rewriting”, a joint paper with Emiliano Morini, also of Intel. In this blog post, I will describe the fundamental idea of our work.

It’s well-known that ICT is driving a significant amount of energy consumption in the modern world. The core question of how to organise the fundamental arithmetic operations in a computer in order to reduce power (energy per unit time) has been studied for a long time, and continues to be a priority for designers across industry, including the group at Intel with whom this work has been conducted.

Readers of this blog will know that Sam has been doing great work on how to explore the space of behaviourally equivalent hardware designs automatically. First for area, then for performance, and now for power consumption!

In our latest work, Sam looks at how we can use the e-graph data structure, and the related egg tool, to tightly integrate arithmetic optimisations (like building multi-input adders in hardware) with clock gating and data gating, two techniques for power saving. Clock gating avoids clocking new values into registers in hardware if we know they’re not going to be used in a given cycle; this avoids the costly switching activity associated with propagating unused information in a digital circuit. Data gating also avoids switching, but in a different way – by replacing operands with values inducing low switching: for example, if I do not end up using a result of a \times b, then I may as well be computing a \times 0. In both cases, the fundamental issue becomes how to identify whether a value will be unused later in a computation. Intriguingly, this question is tightly related to the way a computation is performed: there are many ways of computing a given mathematical computation, and each one will have its own redundancies to exploit.

In our ARITH 2024 paper, Sam has shown how data gating and clock gating can be expressed as rewrites over streams of Boolean data types, lifting our previous work that looks at equivalences between bit vectors, to equivalences over streams of bit vectors. In this way, he’s able to express both traditional arithmetic equivalences like a + (b + c) = (a + b) + c and equivalences expressing clock and data gating within the same rewriting framework. A collection of these latter equivalences are shown in the table below from our paper.

Some of the rewrites between equivalent expressions used in our ARITH 2024 paper

Sam has been able to show that by combining the rewrites creatively, using arithmetic rewrites to expose new opportunity for gating, our tool ROVER is able to save some 15% to 30% of power consumption over a range of benchmark problems of industrial interest. Moreover, ROVER will automatically adjust the whole design to better suit different switching profiles, knowing that rarely-switching circuit components are less problematic for energy, and prioritising exposing rewrites where they are needed.

I think this is really interesting work, and shows just how general the e-graph approach to circuit optimisation can be. If you’re going to ARITH 2024, do make sure to talk to Sam and find out more. If not, make sure to read his paper!

FPGA 2024

I have just returned from a wonderful trip to California with colleagues and students, into which I managed to pack: visiting AMD for a day of presentations to each other, co-running a workshop on Spatial Machine Learning, attending the ACM FPGA 2024 conference at which my student Martin Langhammer presented, and catching up with international colleagues and with Imperial College alumni and their families. In this blog post I’ll briefly summarise some of the work-related key takeaways from this trip for me.

A few of our alumni in the San Francisco Bay Area. Immediately behind me, with white hair, is Dr Nan Zhuang who first taught me logic synthesis when I was a first-year undergraduate and he was a PhD student at Imperial College

AMD and Imperial

At AMD’s Xilinx campus, I was able to share some of our most recent work. I chose to focus on two aspects: the work we’ve done on e-graphs as an EDA data-structure, and how we have developed these for datapath optimisation, and the development of new highly-efficient LUT-based neural networks. It was great to catch up with AMD on the latest AI Engines developments, especially given the imminent open-sourcing of a complete design flow here. My former PhD students Sam Bayliss and Erwei Wang have been working hard on this – amongst many others, of course – and it was great to get a detailed insight into their work at AMD.

SpatialML Workshop

Readers of this blog may know that I currently lead an international centre on spatial machine learning (http://spatialml.net). This year we ran an open workshop, sharing the work we have been doing in the centre, and bringing in additional colleagues from outside the centre too. This was exceptionally well attended: we had delegates from academia and industry internationally in attendance: A*STAR, AMD, Ciena, Cornell, DRW, Essen, Groq, Imperial College, Intel, KAUST, Mangoboost, Minnesota, Rhode Island, Seoul, Simon Fraser, Southampton, Tsinghua, Toronto, and UCLA. We structured the day around a number of excellent academic talks from Zhiru Zhang (Cornell), Vaughn Betz (University of Toronto), Paul Chow (University of Toronto), Lei Xun (University of Southampton), Atefeh Sohrabizadeh (UCLA), and Alex Montgomerie (Imperial College), combined with industrial keynotes from Satnam Singh (Groq) and Sam Bayliss (AMD), closing with a panel discussion on the topic of working across abstraction layers; we had Jason Anderson (U of T), Jason Cong (UCLA), Steve Neuendorffer (AMD) and Wayne Luk (Imperial College) on the panel, with me chairing (sadly Theo Drane from Intel could not make it due to poor weather conditions in Northern California). Slides that are publicly sharable will be made available on the workshop website.

We learned about the use of machine learning (graph representation learning, GNNs) in EDA flows, about scaling LLM implementations across multiple devices in industry and academia, about dynamic DNNs and resource scaling, about programming models and flows for deep learning acceleration in both academia and industry (often using MLIR), and about FPGA architectural enhancements to support more efficient deep learning.

We received very positive feedback from workshop attendees. I would like to express particular thanks to my workshop co-organisers, especially the astoundingly productive Andrew Boutros of the University of Toronto, for all his hard work making this workshop happen.

Our SpatialML workshop begins
The panel discussion at the workshop. L-R: Jason H. Anderson, Jason Cong, George A. Constantinides, Wayne Luk, Stephen Neuendorffer

FPGA 2024

I consider the ACM FPGA conference as my “home” conference: I’m a steering committee member and former chair, but this year my only roles were as a member of the best paper committee and as a session chair, so I could generally sit back and enjoy listening to the high quality talks and interacting with the other delegates. This year Andrew Putnam from Microsoft was technical program chair and Zhiru Zhang from Cornell was general chair. There is of course too much presented in a conference to try to summarise it all, but here are some highlights for me.

  • Alireza Khataei and Kia Bazargan had a nice paper, “CompressedLUT: An Open Source Tool for Lossless Compression of Lookup Tables for Function Evaluation and Beyond”, on lossless compression of large lookup tables for function evaluation.
  • Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, and Paolo Ienne have done some great work, “Survival of the Fastest: Enabling More Out-of-Order Execution in Dataflow Circuits”, on enabling (and localising) out-of-order execution in dynamic high-level synthesis, extending the reach of out-of-order execution beyond the approach I took with Jianyi Cheng and John Wickerson.
  • Louis-Nöel Pouchet, Emily Tucker and coauthors have developed a specialised approach to checking equivalence of two HLS programs, “Formal Verification of Source-to-Source Transformations for HLS”, for the case where there are no dynamic control decisions (common in today’s HLS code), based on symbolic execution, rewriting, and syntactic equivalence testing. It basically does what KLEE does for corner cases of particular interest to HLS, but much faster. Their paper won the best paper award at FPGA 2024.
  • Jiahui Xu and Lana Josipović had a nice paper, “Suppressing Spurious Dynamism of Dataflow Circuits via Latency and Occupancy Balancing”, allowing for a more smooth tradeoff between dynamic and static execution in high-level synthesis than was possible from the early work my PhD student Jianyi Cheng was able to achieve on this topic. They get this through balancing paths for latency and token occupancy in a hardware-efficient way.
  • Daniel Gerlinghoff and coauthors had a paper, “Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic”, building on our LUTNet work and extensions thereof, introducing various approaches to scale down the logic utilisation via sequentialisation over some tensor dimensions and over bits in the linear (LogicNets) case.
  • My PhD student Martin Langhammer (also with Altera) presented his work, “A Statically and Dynamically Scalable Soft GPGPU”, on a super high clock frequency soft GPU for embedding into FPGA designs.
Martin Langhammer presenting our work on a super-high clock frequency compact soft GPU

Reflections

I’ve come away with many ideas from all three events: the AMD visit, the workshop, and the FPGA conference. In person attendance at conferences is still the best for this; I didn’t get nearly as many ideas when attending FPGA 2021 or 2022 remotely. It was also particularly satisfying to see our work on soft-logic efficient deep neural networks (starting with LUTNet, most recently PolyLUT) being cited by so many people at the conference; this work appears to have really made a long-term impact.

Finally, it is always a joy to visit Monterey. This year FPGA was held at a hotel on Cannery Row, described by John Steinbeck as “a poem, a stink, a grating noise, a quality of light, a tone, a habit, a nostalgia, a dream”.

Review: Empire of Normality

Robert Chapman, a neurodivergent philosopher, published this book, “Empire of Normality: Neurodiversity and Capitalism“, in 2023. I have been reading this thought-provoking book over the last few weeks, and have a few comments.

A Historical Materialist Approach

The general thesis of Chapman’s book, as I understand it, is that the very concept of “normal human functioning” is intimately bound up with – and emerged from – the economic demands of capitalism. Their book takes the reader on a really interesting journey from ancient to modern views of disease and abnormality, taking in the rise and fall of eugenics, the anti-psychiatry movement and its cooption by neoliberal policymakers, Fordism and post-Fordist production and their impact on mental health, the rise of the neurodiversity movement, and Chapman’s views on neurodiversity and liberation. The book aims to take a historical materialist approach to these questions, emphasising the importance of material economic conditions throughout. I learnt a lot about the history of mental illness, disability, and neurodiversity.

Chapman makes the argument that modern capitalism is harmful to neurotypical and neurodivergent alike, with neurotypical workers facing extreme forms of alienation while neurodivergent potential-workers are often consigned to a “surplus class”. This term appears to be used to denote the idea that such people form a group which can be called on to re-enter the workforce during times of economic growth, akin to the traditional unemployed, but the mechanisms for this to actually happen are not fully drawn out. I’m familiar with the classical concept of a “reserve army of labour”, but not of “surplus class”; late in the book, Chapman cites Adler-Bolton and Vierkant for this concept.

Bodies as Machines, Measurement & Pathology

The metaphor of the body as a machine, and of both neurodiversity and mental illness as corresponding to “broken machines” is criticised heavily throughout the book. Yet I think the metaphor appears to be used in two ways by Chapman, one being machine meaning deterministic device which can be repaired or enhanced with appropriate technical skill, and a different one – machine as a mass-produced device with interchangeable parts and interfaces to the outside world. It is not always clear to me which Chapman is criticising, and I think sometimes arguments against the first view are used rhetorically to support arguments against the second.

There’s a very detailed and interesting discussion on the history of measurement and its link to identifying outlying individuals. What’s less well developed, in my view, is a clear understanding of the potential of clustering of certain traits. To my mind, labels like ADHD have meaning precisely because they define one or more clusters of observable phenomena. These clusters may themselves be very far from the mean, yet identifying such clusters as clusters, rather than as isolated atypical individuals can clearly be helpful. This point arises also much later in the book, where Chapman refers to “autistic customs”, when describing the rise of the neurodiversity movement; the very idea of “customs” seems to only apply to groups of individuals, so this focus on clustering rather than on individuals seems essential to any argument made on this basis, rather than solely around the need to accommodate each unique individual. A reference is made late in the book to Sartre’s notion of ‘serial collectives’, which may hold out a basis for further analysis of this point. “Far from the mean” is also often taken implicitly to correspond to “far below the mean”, and I think the explicit thesis of “the mean is ideal”, explicitly linked to middle-class values by Chapman, is missing a fleshed out discussion of where this leaves “far above the mean” in terms of valuation.

The “pathology paradigm” is introduced as a set of assumptions: that mental and cognitive functioning are individual (whereas, as Chapman convincingly argues, there is a clear social element), that they are based on natural abilities, and that they can be totally ordered (or, as Chapman states it, ‘ranked in ration to a statistical norm across the species’).

Chapman does an excellent job in explaining how modern production leads to separation and sorting by neurotype. The claim made that “increasingly, new forms of domination have less to do with social class, which now, to an extent, is more fluid than in the 19th century, and much more to do with where each of us falls on the new cognitive hierarchies of capitalism” seems weaker if – by domination – it is intended to mean that the separation is self-reinforcing, akin to the traditional Marxist view of capital accumulation. This conclusion is hinted at, but I don’t think the case is strongly made – nor am I sure it is correct.

Chapman’s Marxian argument is strong when utilising the concept of alienation in the “post-Fordist” service economies to argue that some traits that were typically relatively benign became associated with disablement as a result of this economic shift. What is less clear to me is whether the same could be said the other way round: are there traits we can identify that were disabling under a “Fordist” economy but are now relatively benign? I hear elsewhere, for example, much said about shift to home-working benefiting some autistic workers significantly, especially in the software engineering industries. Their linking of disability to capitalism’s historical growth is clear and well argued, though weaker I felt when looking to the future. For example, it is stated that “capitalist logics both produce and require disablement”; here I am unsure whether “require” is meant in the sense of directly flowing from commodification of labour or in the sense of requiring a disabled “surplus class”. I think the argument for the former appears much stronger. Similarly the term “physician class” is used at several points in the text, and I am not completely sure how physicians would constitute a class per se.

I found the discussion of the rise, and revision, of the Diagnostic and Statistical Manual of Mental Disorders (DSM) fascinating, especially the arguments for including – and then removing – homosexuality from the manual (it appeared in DSM-I and DSM-II but was eliminated under DSM-III) on the basis that it did not meet the new criterion that “the disturbance is not only in the relationship between the individual and society”. I also did not know that, according to Chapman, some 25% of the UK population has been prescribed psychotropic medication – a statistic I find extraordinary.

Some Thoughts to be Developed

I think that there is plenty that could be said about what I would describe as the scalarisation of value through the one-dimensional lens of price and how this potentially moves to limit and devalue variety, not only in neurotype but across political economy. The point hinted at here, via ideas around the commodification of labour, but perhaps Chapman or others could be convinced to develop it further.

Given my background in school governance, I was struck by the commentary around differences in development in comparison to one’s age group becoming increasingly salient during the Fordist period. There is, I suspect, a lot more to say also about the shift to ‘teaching-to-the-age’ in the education world, rather than ‘teaching-to-the-individual’, a trend that was in England markedly accelerated by the 2014 national curriculum.

I feel like I need to read more about behaviourism. As someone who has studied the behaviour of dynamical systems, there is a difference between the philosophical approach Chapman ascribes to Watson, “science should focus only on what can objectively be observed”, which strikes me as a call to extensionality, and the conclusions claimed to follow (whether the inference is by Watson or by Chapman, I am not clear due to my ignorance of the field), that behaviour can be usefully directly controlled (e.g. via Applied Behaviour Analysis, an approach with a very bad reputation in the autistic community). The first claim can hold without the second having any validity, in the presence of stateful behaviour.

Looking to the Future

In contrast to the history of neurodiversity, the history of the Soviet Union Chapman provides is weak and doesn’t really seem to follow the materialist approach set out in the earlier subject matter in the book. We are told that not being career politicians contributed to the Bolshevik failure. While much space is given to the ‘state capitalist’ nature of the early Soviet state, the almost absolutely discontinuity between that state and the Stalinist bureaucracy is not discussed, rather we are told that Stalin “gave up on shifting beyond state capitalism [and] simply declared that communism had been achieved”. The idea that Stalin inherited the post-revolutionary state whole seems very odd, and I am also pretty sure he never proclaimed that (full) communism had been achieved.

The arguments Chapman makes for liberation I find less convincing than their materialist analysis of the emergence of neurodiversity as a form of disablement. In particular, there appears to me to be a jump from the well argued case that capitalism brings about groups of disadvantaged neurodiverse, to the less well argued case that such groups can form agents of change by organising in those groups – as Chapman puts it, “neurodivergent workers organising as neurodivergents” and to “empower the surplus as surplus”. Despite the materialist approach taken to historical analysis of neurodiversity and disability, this materialism appears somewhat missing from the calls to “turn[…] everyday comportment and behaviour into forms of resistance” or the statement that “we [disabled] have collective cognitive power … that could be harnessed no less than the collective power of the working class”, or that envisioning the future requires “mass consciousness-raising, critique, and collective imagining”.

Overall

Despite my gripes above, I would recommend this book. I learnt a lot of history, and I think it’s great to see this being approached from a materialist perspective. I also get the sense of Chapman, a neurodivergent philosopher, being determined to live their philosophy, and this is definitely to be celebrated in my book.

Nick Higham

I am devastated to be told today that my colleague Nick Higham has died. Nick was an outstanding research leader in numerical analysis, but he was also a deeply kind man and feel I need to say something about the limited time I spent with him.

I have always worked on the boundary of electronic engineering and computer science, often working on efficient hardware implementations of numerical algorithms. While as a research student I was aware of the early work of Jim Wilkinson, it was only after my PhD was complete in 2001/2 that I became aware of Nick’s work via his outstanding book Accuracy and Stability of Numerical Algorithms. There are only a few books I have purposely bought multiple copies of over the years, and this is one. I have at least three – one for home, one for work, one for my PhD students. It happens to be sitting on my desk at work next to my laptop as a write this. As does an early Muldivo mechanical calculator – another shared joy we bonded over: Nick’s office featured a Brunsviga model last time I was there.

I can’t remember how or when I first met Nick in person, but we hit it off very quickly. Despite being as eminent as they come, he took a deep interest in the work of a junior member of academic staff in a different discipline. He invited me to present in Manchester on a number of occasions, included me in SIAM conference mini-symposia he was organising, and involved me in a Royal Society event he co-organised. We had deep technical discussions, yes, but we also spoke about careers, about my role as a school governor (and about the school his son was then attending), about Jim Wilkinson’s work, and he encouraged me to think about fellowship of SIAM. One of my favourite conference experiences was attending SIAM CSE in Spokane, Washington, at Nick’s invitation, primarily because it was so different to the usual Electronics and Computing conferences I tend to attend.

Nick was a profoundly practical person as well as being deeply involved in theoretical analysis. After I told him about some experimental observations a PhD student and I had made some years earlier regarding finite precision effects in certain Krylov methods, he was surprised and a few days later had coded up an example, verified this behaviour, and written a SIAM News column about it! We connected over the interrelation of theory and practice, and he told me that he very much liked the suggestion I made that the rise of custom hardware and FPGA-based compute meant that time had come for computer science, numerical analysis, and electronic engineering to come back together, after a long period of not talking as much as disciplines as I would like. He was also amused by my interest in “very pure” (his words) real algebraic geometry.

Nick taught me – whether he knew it or not – how to communicate to mathematicians, as well as how to use Latex for slides… even down to emailing me suggestions for Latex code hacks to make my slides better. His presentations were always a model of clarity and depth.

Many colleagues in the Numerical Analysis and Linear Algebra world will know Nick much better than I did, but he touched my research life more deeply than most. Somewhere in the back of my mind, I had the idea that I’d like to take a sabbatical leave with Nick in the future. I thought there would be plenty of time. 

PolyLUT: Ultra low latency inference

This week my PhD student Marta Andronic will present our paper “PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference” at the International Conference on Field Programmable Technology (FPT), in Yokohama, Japan, where it has been short-listed for the best paper prize. In this blog post I hope to provide an accessible introduction to this exciting work.

FPGAs are a great platform for high-performance (and predictable performance) neural network inference. There’s a huge amount of work that’s gone into both the architecture of these devices and into the development of really high quality logic synthesis tools. In our work we look afresh at the question of how to best leverage both of these in order to do really low-latency inference, inspired by the use of neural networks in environments like CERN, where the speed of particle physics experiments and the volume of data generated demands really fast classification decisions: “is this an interesting particle interaction we’ve just observed?”

Our colleagues at Xilinx Research Labs (now part of AMD) published a great paper called LogicNets in 2020 that aimed to pack the combination of linear computation and activation function in a neural network into lookup tables in Verilog, a hardware description language, which then get implemented using the logic synthesis tools on the underlying soft-logic of the FPGA. Let’s be a bit more precise. The operation of a typical neuron in an artificial neural network is to compute the real-valued function y = \phi( w^T x + b ) for its inputs x and some learned weight vector w and bias b, where \phi is typically a fixed nonlinear function such as a ReLU. In practice, of course we use finite precision approximations to x and y. The Xilinx team noted that if they restrict the length of the vector x to be small enough, then the quantised version of this entire function can be implemented by simply enumerating all possible values of the input vector x to form a truth table for y, and using the back-end logic synthesis tools to implement this function efficiently, stitching together the neurons constructed in this way to form a hardware neural network. Moreover, note that in this setting one does not need to quantise the weights and bias at all – since the computation is absorbed in a truth table, arbitrary real-valued weights are just fine.

Marta and I have generalised this approach considerably. Firstly, we note that once we’re down at the level of enumerating truth tables, there’s no particular reason to limit ourselves to functions of the form y = \phi( w^T x + b ) – why not use an arbitrary function instead? From the perspective of the enumeration process and logic synthesis, it makes no difference. But from the perspective of neural network training, it certainly does. If we really wanted to look at arbitrary functions, we’d need a number of parameters to train that scales exponentially with the number of bits used to represent the entire vector x. This might be OK for very small vectors – indeed, with my former student Erwei Wang, I looked at something similar for the tiny physical lookup tables inside FPGA devices – but at the scale of neurons this is infeasible. So what family of functions should we use? Marta and I propose to use polynomials, where the total degree is fixed as a user-defined parameter we call D. In this way, we can tune a knob: turn down to D=1 for minimal expressivity but the smallest number of parameters to train, and recover LogicNets as a special case; turn up D and you get much more expressive functions you can pack into your LUTs, at the cost of more parameters to train. In a classical neural network, composition of linear layers and ReLU functions gives rise to the overall computation of a continuous piecewise linear function. In our networks, we have continuous piecewise polynomial functions. The important thing, though, is that we never have to do all the expensive multiplications and additions one typically associates with evaluating a polynomial – the implementation is all just a table lookup for each neuron, just like in the linear case.

So what does higher degree actually buy us? It’s well known that deep neural networks of the classical form are universal function approximators, so with enough neurons (enough depth or width), we can arbitrarily closely approximate a continuous function anyway. But by packing in more functionality into each neuron, which anyway gets enumerated and implemented using lookup tables in the FPGA (just like the classical case), we can get the same accuracy with fewer layers of network. And fewer layers of network means fewer layers of Boolean logic, which means lower latency. In fact, we show in our results that one can often at least halve the latency by using our approach: we run 2x as fast! This two-fold speedup comes from the more intricate decision boundaries one can implement with polynomial compute; look at this beautiful near-separation of the two dot colours using a curved boundary, and imagine how many small linear segments you would need to do a similar job – this provides the intuition for just why our networks perform so well.

PolyLUT is available for others to use and experiment with. Don’t just take our word for it, download Marta’s code and try it out for yourself! I’m also delighted that Marta’s paper at FPT is one of just three at the conference to have earned all available ACM badges, verifying that not only is her work public, but it has been independently verified as reproducible. Thank you to the artifact review team for their work on the verification!

Industry Coheres around MX?

In recent years there has been a rediscovery (sometimes multiple times!) of block floating point, a number representation used in the early days of the electronic computer for general purpose compute, but also in specialist areas such as signal processing, throughout the field’s history. The main reason for this modern rediscovery is the value of block floating point, and related ideas like block minifloat, in modern machine learning implementations.

Just like in the early days of floating point, a number of different approaches and implementations have flourished in recent years, so it was interesting to see the recent launch of the Open Compute Project OCP Microscaling Formats (MX) Specification, coauthored across Microsoft, AMD, Arm, Intel, Meta, NVIDIA and Qualcomm by a number of well-known figures (including Imperial’s very own Martin Langhammer). This document will be referred to as “the specification” in the rest of this blog post. At the same time, a paper was released on arXiv, supported by a GitHub repo, to show that when used for training deep neural networks (DNNs), MX-based implementations can produce high quality results.

In this blog post, I will briefly review a few of the features of the specification that I found particularly interesting as an “old hand” in fixed point and block floating point computer arithmetic, and raise some issues that the community may wish to take forward.

Firstly, I should say that it’s great to see efforts like this to raise the profile and awareness of block floating point and related number systems. My hope is that efforts like this will drive further work on the extremely important topic of efficient numerical hardware.

Storage

  1. Building block types. The focus of the specification is primarily one of storage, although certain computations are mentioned. The specification primarily defines the construction of a type for MX computations from three parameters: a (scalar) type for scalings, a (scalar) type for elements, and a (natural) number of elements. A value inhabiting this type is a scalar scaling X together with k elements, P_i, for 1 \leq i \leq k. The numbers represented are (except special values), v_i = X \cdot P_i. This is a very flexible starting point, which is to be welcomed, and there is much to explore under this general hood. In all the examples given, the scalar data types are either floating point or integer/fixed point, but there seems to be nothing in the specification specifically barring other representations, which could of course go far beyond traditional block floating point and more recent block exponent biases.
  2. Concrete types. In addition to specifying a generic type constructor, as above, the specification also defines certain concrete instantiations, always with a power-of-two scaling stored in a conventional 8-bit biased integer exponent, and with either floating point or integer elements. By my reading of the specification, to obtain an MX-compliant implementation, one does not seem to need to implement any of these concrete types, however.
  3. Subnormals mandated. If implementing a concrete format specified, subnormal representations must be implemented.

Computation

1. Dot products

Apart from conversion into MX, the only MX operation discussed in the specification is dot products of MX vectors of equal number of elements. This appears to leave some basic operations undefined, for example simple addition of two MX vectors. (It is, of course, possible to provide a reasonable semantics for this operation consistent with the scalar base types, but it is not in the specification.)

The definition given of the “Dot” operation is interesting. The specification says “The following semantics must be minimally supported”, and gives the following equation:

C = \text{Dot}(A,B) = X^{(A)}X^{(B)}\sum_{i=1}^k P_i^{(A)} \times P_i^{(B)}

where P_i^{A} denotes the ith element of vector A, X^{(A)} denotes the scaling of A, etc.

To my mind, the problem with this definition is that \times (on elements), the (elided) product on scalings and the sum on element products is undefined. The intention, of course, is that we want the dot product to approximate a real dot product, in the sense that \llbracket C \rrbracket \approx \llbracket X^{(A)} \rrbracket \llbracket X^{(B)} \rrbracket \sum_{i=1}^k \llbracket P_i^{(A)} \rrbracket \times \llbracket P_i^{(B)} \rrbracket, where the semantic brackets correspond to interpretation in the base scalar types and arithmetic operations are over the reals. But the nature of approximation is left unstated. Note, in particular, that it is not a form of round-to-nearest defined on MX, as hinted at by the statement “The internal precision of the dot product and order of operations is implementation-defined.” This at least suggests that there’s an assumed implementation of the multi-operand summation via two-input additions executed in some (identical?) internal precision in some defined order, though this is a very specific – and rather restrictive – way of performing multi-operand addition in hardware. There’s nothing wrong with having this undefined, of course, but an MX-compliant dot product would need to have a very clear statement of its approximation properties – perhaps something that can be considered for the standard in due course.

2. General Dot Products

The specification also defines an operation called “DotGeneral” which is a dot product of two vectors made up of MX-vector components. The specification defines it in the following way:

C = \text{DotGeneral}(A,B) = \sum_{j=1}^n \text{Dot}(A_j, B_j)

“where A_j and B_j are the jth MX-compliant subvector of A and B.” Like in the “Dot” definition, the summation notation is also not defined in the specification, and also this definition doesn’t carry the same wording as for “Dot” on internal precision and order, but I am assuming it is intended to. It is left undefined how to form component subvectors. I am assuming that the authors envisaged the vector A_j consisting of the jth block of k contiguous elements in the vector A, but of course there are many other ways this could be done (e.g. strided access, etc.)

I would be interested to hear from any blog readers who think I have misinterpreted the standard, or who think other components of the specification – not discussed here – are more interesting. I will be watching for uptake of this specification: please do alert me of interesting uses.

Supercharging Formal with E-Graphs

Designing circuits is not easy. Mistakes can be made, and mistakes made are not easily fixed. Even leaving aside safety-critical applications, product recalls can cost upwards of hundreds of millions of US$, and nobody wants to be responsible for that. As a result, the industry invests a huge effort in formally verifying hardware designs, that is coming up with computer-generated or computer-assisted proofs of hardware correctness.

My Intel-funded PhD student Sam Coward (jointly advised by Theo Drane) is about to head off to FMCAD 2023 in Ames, Iowa, to present a contribution to making this proof generation process speedier and more automated, while still relying on industry-standard tried and trusted formal verification tools. Together with his Intel colleague Emiliano Morini and then-intern (and Imperial student) Bryan Tan, Sam noticed that the e-graph techniques we have been using in his PhD for optimising datapath [1, 2] also have a natural application to verifying datapath designs, akin – but distinct – to their long-time use as a data-structure in SMT solvers.

The basic idea is straight-forward. Ever since our earliest ARITH paper, we’ve developed a toolbox of Intel-inspired rewrite rules that can be applied to optimise designs across parametric bitwidths. Our optimisation flow, called ROVER, applies these to a single expression we wish to optimise, and then we extract a minimal cost implementation. Our new paper asks a different question: what if we start, not from a single expression we wish to optimise, but from two expressions we’re trying to prove to be equivalent?

Following the approach I blogged about here from ROVER, if the expressions are not equivalent then they should never end up in the same equivalence class after multiple rewrite iterations. If they are equivalent then they may or may not end up in the same equivalence class: we provide no guarantees of completeness. So let’s think about what happens in each of these cases.

If they end up in the same equivalence class, then our tool thinks it has proven the two expressions to be equivalent. But no industrial design team will trust our tool to just say so – and rightly so! – tool writers make mistakes too. However, work by Sam and other colleagues from the Universities of Washington and Utah, has provided a methodology by which we can use our tool’s reasoning to export hints, in the form of intermediate proof steps, to a trusted industrial formal verification tool. And so this is how we proceed: our tool will generate many, sometimes hundreds, of hints which will help state-of-the-art industrial formal verification tools to find proofs more quickly – sometimes much more quickly – than they were able to do so beforehand.

So what if the two expressions don’t end up in the same equivalence class, despite actually being equivalent? No problem! We can still find two expressions for which their proof of equivalence, via some external tool, makes the overall proof we’re aiming for much easier… crunching down the proof to its hard essence, stripping away what we can. The main technical idea here is how to extract the expressions from the e-graph. Rather than aiming for the best possible implementation cost, as we did with ROVER, we aim to minimise the distance between the two implementations that are left for the industrial tool to verify. The overall process is shown in the figure below, where “EC” refers to an equivalence check conducted by the industrially-trusted tool and S^* and I^* are the extracted “close” expressions.

We show one example where, without the hints provided by our tool, the industrial formal verifier does not complete within a 24 hour timeout. Even when results are achievable with the industrial formal verifier, we still get an up to a 6x speedup in proof convergence.

This work forms a really nice “power use” of e-graphs. If you are at FMCAD 2023, come along and hear Sam talk about it in person!

Discovering Special Cases

My PhD student Sam Coward (jointly advised by Theo Drane from Intel) is about to head off on a speaking tour, where he will be explaining some of our really exciting recent developments. We’ve developed an approach that allows the discovery, exploitation, and generation of datapath hardware that works well in certain special cases. We’ve developed some theory (based on abstract interpretation), some software (based on egg), and some applications (notably in floating-point hardware design). Sam will be formally presenting this work at SOAP, EGRAPHS, and DAC over the coming weeks. In this blog post, I will try to explain the essence of the work. More detail can be found in the papers, primarily [1,2], with [3] for background.

We know that sometimes we can take shortcuts in computation. As a trivial example, we know that \text{abs}(x) can just be replaced by x for non-negative values of x. Special cases abound, and are often used in complex ways to create really efficient hardware. A great example of this is the near/far-path floating-point adder. Since the publication of this idea by Oberman and Flynn in the late 1990s, designs based on this have become standard in modern hardware. These designs use the observation that there are two useful distinct regimes to consider when adding two values of differing sign. If the numbers are close in magnitude then very little work has to be done to align their mantissas, yet a lot of work might be required to renormalise the result of the addition. On the other hand, if the two numbers are far in magnitude then a lot of work might be needed to align their mantissas, yet very little is required to renormalise the result of the addition. Thus we never see both alignment and renormalisation being significant computational steps.

Readers of this blog may remember that Sam, Theo and I published a paper at ARITH 2022 that demonstrated that e-graphs can be used to discover hardware that is equivalent in functionality, but better in power, performance, and area. E-graphs are built by repeatedly using rewrite rules of the form \ell \to r, e.g. \texttt{x + x} \to \texttt{2*x}. But our original ARITH paper wasn’t able to consider special cases. What we really need for that is some kind of conditional rewrite rules, e.g. x \geq 0 \Rightarrow \texttt{abs(x)} \to \texttt{x}, where I am using math script to denote the value of a variable and teletype script to denote an expression.

So we set out to answer:

  1. how can we deal with conditional rewrites in our e-graphs?
  2. how can we evaluate whether a condition is true in a certain context?
  3. how can we make use of this to discover and optimise special cases in numerical hardware?

Based on an initial suggestion from Pavel Panchekha, Sam developed an approach to conditionality by imagining augmenting the domain in which we’re working with an additional element, let’s call it *. Now let’s imagine introducing a new operator \texttt{assume} that takes two expressions, the second of which is interpreted as a Boolean. Let’s give \texttt{assume} the following semantics: \llbracket \texttt{assume(x,c)} \rrbracket = \llbracket \texttt{x} \rrbracket \text{ if } \llbracket c \rrbracket \text{ , and } * \text{, otherwise}. In this way we can ‘lift’ equivalences in a subdomain to equivalences across the whole domain, and use e-graphs without modification to reason about these equivalences. Taking the absolute value example from previously, we can write this equivalence as \texttt{assume( abs(x), x >= 0 )} \to \texttt{assume( x, x >= 0 )}. These \texttt{assume} function symbols then appear directly within the e-graph data structure. Note that the assume on the right-hand side here is important: we need both sides of the rewrite to evaluate to the same value for all possible values of x, and they do: for negative values they both evaluate to * and for non-negative values they both evaluate to x.

So how do we actually evaluate whether a condition is true in a given context? This is essentially a program analysis question. Here we make use of a variation of classical interval arithmetic. However, traditionally interval arithmetic has been a fairly weak program analysis method. As an example, if we know that x \in [-1,1], then a classical evaluation of x - x would give me [-2,2], due to the loss of information about the correlation between the left and right-hand sides of the subtraction operator. Once again, our e-graph setting comes to the rescue! Taking this example, a rewrite \mathtt{x - x} \to \mathtt{0} would likely fire, resulting in zero residing in the same e-class as \mathtt{x - x}. Since the interval associated with \mathtt{0} is [0,0], the same interval will automatically be associated with \mathtt{x - x} by our software, leading to a much more precise analysis.

This interaction between rewrites and conditions goes even further: a more precise analysis leads to the possibility to fire more conditional rewrite rules, as more conditions will be known to hold; firing more rewrite rules results in an even more precise analysis. The two techniques reinforce each other in a virtuous cycle:

A virtuous cycles: greater precision leads to more rewrites leads to greater precision.

Our technique is able to discover, and generate RTL for, near/far-path floating-point adders from a naive RTL implementation (left transformed to right, below), resulting in a 33% performance advantage for the hardware.

Left: A naive floating-point subtractor. Right: The subtractor produced by our software.

I’m really excited by what Sam’s been able to achieve, as I really think that this kind of approach has the potential to lead to huge leaps forward in electronic design automation for word-level designs.