FPGA & HPCA 2025

I recently returned from two back-to-back conferences, FPGA 2025 in Monterey, California and HPCA 2025 in Las Vegas, Nevada. In this blog post, I will summarise some of the things I found most interesting at these conferences.

Before I even got to the first conference, I was delighted to have the chance to meet in San Francisco with Cole Harry, who runs the new Imperial Global USA. They have an exciting plan of work to develop links between Imperial and academics, industrialists, alumni and VCs in the Bay Area. I would strongly recommend reaching out to Cole if you are based in the Bay Area and would like to get involved.

FPGA, the ACM International Symposium on FPGAs, is always a joy to attend. This year we had a great balance of industry and academia attending, as is often the case. The conference recently moved to introduce keynote talks. I’m on the fence about the value of keynotes at FPGA, but this year they were both exceptionally good. The first was from Steve Reinhardt (Senior Fellow, AMD) and the second was from my long-term colleague John Wawrzynek (UC Berkeley). It was very gratifying that both keynote speakers singled out our work on LUT-based machine learning, started by my PhD student Erwei Wang (now with AMD) with his 2019 paper LUTNet, as an example of where the field should be heading in the future. In Steve’s case, this was part of his overall summary of architectures for AI. In John’s case, this was part of his call to follow Carver Mead‘s advice to “listen to what the silicon is telling you!” John’s keynote was a wonderful trip down memory lane for me – he highlighted many times in the last 30 years or so where the FPGA community has been well ahead of the broader community in working through and adopting various technologies and ideas. It was great to be reminded of the papers I had seen presented – and got excited about – when I was a PhD student myself (1998-2001). John also gave a personal shout out to my PhD student Marta Andronic for the great work she is doing.

Session 1 of FPGA was on AI for FPGAs. The first paper, FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs, was a collaboration between various Chinese universities. This paper won the best paper prize at the conference. I liked their unusual packing of DSP blocks for non-standard word-lengths. The second paper was from Alireza Khataei, the PhD student of my colleague and friend Kia Bazargan. They presented an intriguing approach to using decision trees as FPGA inference engines. The results were good, and have left me wondering how brittle they may be to linear transformations of the input space, given that DNN based work will be invariant to these transformations (modulo quantisation error) whereas the axis-alignment of these decision tree boundaries will not. The third paper was a collaboration with us at Imperial (and others) led by Olivia Weng from Ryan Kastner‘s group at UCSD. Olivia presented an empirical exploration of the ensembling of weak classifiers, including our LUT-based classifiers. The final presentation of this session was from our group, a short paper by Olly Cassidy, our undergraduate student, which I describe in an earlier blog post.

Session 2 of FPGA was on CAD. It began with my former PhD student David Boland presenting a collaboration he has undertaken with my other former PhD student Kan Shi and others on efficient (simulation-based) verification of High-level Synthesis (HLS) designs, using FPGA-based acceleration.

Session 3 of FPGA was on HLS. This session included some interesting work presented by Jinming Zhuang, from Brown (with the excellent Peipei Zhou) and Cornell (including my friend Zhiru Zhang) on MLIR for compilation targeting AMD’s AI engines, and also great work from Suhail Basalama (who used to be affiliated with my EPSRC SpatialML centre) and his advisor and my long-term colleague Jason Cong from UCLA. It was really nice to see the shared-buffer to FIFO conversion in this work.

Session 5 of FPGA was on architecture. Readers of this blog may remember the work on dynamic HLS started by Lana Josipović (now at ETH) when she was Paolo Ienne‘s PhD student at EPFL. Authors of the first paper, presented by Louis Coulon from EPFL asked the question of how one may wish to redesign FPGA architecture to better suit this model of HLS. I also liked the second talk, a collaboration between several universities, presenting a paper on incorporating $n$ -out-of- $k$ element sparsity in tensor processing tiles.

Session 6 of FPGA was also on High-level Synthesis. Notable contributions included a presentation from Stéphane Pouget (UCLA) on work with Louis-Noël Pouchet (Colorado State) and Jason Cong that proposed a MINLP to combine pragma insertion with loop transformations; back in 2009, I began to look at nonlinear programming for HLS transformations with my former student Qiang Liu – the paper at FPGA this year explored a really interesting practical design space for modern HLS. My former PhD student David Boland again presented in this session, this time presenting a collaboration between two of my other former PhD students who could not make the conference: Jianyi Cheng and Kan Shi (the latter mentioned above) and others, this time on verification of dynamically-scheduled high-level synthesis. The third talk on this session was presented by Robert Szafarczyk and coauthors from the University of Glasgow, looking at dynamic loop fusion in high-level synthesis based on an interesting monotonicity program analysis; dynamic transformations hold out a lot of promise – I began to look at this with my student Junyi Liu in 2015 – the paper at FPGA this year provides an interesting and different new direction.

HPCA in Las Vegas was a new conference to me. I was prompted to attend due to a collaboration I had with Mohamed Abdelfattah‘s group at Cornell Tech, under the auspices of my EPSRC Centre-to-Centre grant. This collaboration led to my PhD student Marta Andronic spending some months embedded in Mohamed’s group at Cornell Tech, and out of this grew a paper presented at HPCA this year by Yuzong Chen, Mohamed’s PhD student. Yuzong presented both a memory-efficient encoding and a corresponding accelerator architecture for LLM acceleration.

HPCA was co-located with PPoP and CGO, and they shared keynote sessions. Charles Leiserson – a household name in computing – gave the first keynote, associated with PPoP. The presentational style was uniquely engaging. The message was simple: the end of Moore’s Law demands more focus on performance engineering. It’s a message that’s not new, but was wonderfully delivered.

The second keynote, associated with CGO, was given by John Regehr. This was also excellent, spanning work (of his group and others) on compiler correctness and fuzzing, formalisation of compilers, alive2, souper, and bit-width independent rewrites. The latter topic is one that John’s group and mine have communicated over in the past, as it arose in the context of my PhD student Sam Coward‘s work, where we would ideally have liked bit-width independent rewrites, but settled for proving rewrites correct for reasonable bit-width ranges. John’s talk emphasised the social and economic foundations of impact in the compiler world.

The final keynote, associated with HPCA, was given by Cliff Young from Google. This talk was truly outstanding in both content and delivery. He started with a good summary of ML algorithms from an architect’s perspective. He spoke about TPUs and systems built out of TPUs at Google. Perhaps more significant, from my perspective, than the (great) technical content was the non-technical content of his talk. Cliff spoke about how the academic community is key to the long-term health of the field, and how even at Google it is literally only a handful of people who have the ability to think as long-term as academics, as people are just too busy building things. He emphasised the need for major algorithmic developments: “modern ML is not efficient, it is only effective“, was his slogan, alongside the tongue-in-cheek “I very much doubt that the transformer is all we need”. He reflected on the fallow periods in his career, and emphasised that they were absolutely necessary to enable the productive periods – a lesson that could be well learnt by research assessment processes across the world: “the only thing you can actually optimise in your career is, ‘are you enjoying yourself and learning?'” – a great manifesto. He spoke about his own imposter syndrome and about the social anxiety prevalent amongst some of the most impressive international researchers – he also had a mitigation: working together across discipline boundaries allows people to be ‘the respected expert’ in their area without the internalised expectation that you must know everything, providing an element of psychological safety. He spoke about his preference for simplicity over building complex things (something I very much share). And amusingly shared “the best piece of professional advice I was ever given”, which turned out to be “Wow, if you’d only shut the fuck up a bit, you’d be 1000x better!” This lecture was a joy to listen to.

In addition to meeting new and old friends over the past couple of weeks, it was a wonderful to meet students of former students. This year, I got to meet Muhammad Ali Farooq, who has just started off on a PhD programme with Aman Arora, but before that was the student of my former PhD student Abid Rafique at NUST.

The least joyful part of my trip was Las Vegas – a city that seems to have been designed to induce sensory overload. But no matter: the conferences definitely made up for it. And the highlight of my trip was most definitely the weekend between the two conferences where I got to spend time with the lovely family of my former PhD student Sam Bayliss in the Bay Area.

Machine Learning with Large Lookup Tables

Readers of this blog will know that I have been interested in how to bridge the worlds of Boolean logic and machine learning, ever since I published a position paper in 2019 arguing that this was the key to hardware-efficient ML.

Since then, I have been working on these ideas with several of my PhD students and collaborators, most recently my PhD student Marta Andronic‘s work forms the leading edge of the rapidly growing area of LUT-based neural networks (see previous blog posts). Central to both Marta’s PolyLUT and NeuraLUT work (and also LogicNets from AMD/Xilinx) is the idea that one should train Boolean truth tables (which we call L-LUTs for logical LUTs) which then, for an FPGA implementation, get mapped into the underlying soft logic (which we call P-LUTs, for physical LUTs).

Last Summer, Marta and I had the pleasure of supervising a bright undergraduate student at Imperial, Olly Cassidy, who worked on adapting some ideas for compressing large lookup tables coming out of the lab of my friend and colleague Kia Bazargan, together with his student Alireza Khataei at the University of Minnesota, to our setting of efficient LUT-based machine learning. Olly’s paper describing his summer project has been accepted by FPGA 2025 – the first time I’ve had the pleasure to send a second-year undergraduate student to a major international conference to present their work! In this blog post, I provide a simple introduction to Olly’s work, and explain my view of one of the most interesting aspects, ahead of the conference.

A key question in the various LUT-based machine learning frameworks we have introduced, is how to parameterise the space of the functions implemented in the LUTs. Our first work in this area, LUTNet, with my former PhD student Erwei Wang (now with AMD), took a fully general approach: if you want to learn a $K$ -input Boolean function, then learn all $2^K$ lines in that function’s truth table. Since then, Marta and I have been exploring ways of parameterising that space to decouple the complexity of the function-classes implemented from the number of inputs. This gave rise to PolyLUT (parameterised as polynomials) and NeuraLUT (parameterised as small neural networks). Once we have learnt a function $f$ , all these methods enumerate the inputs of the function for the discrete space of quantised activations to produce the L-LUT. Olly’s work introduces `don’t cares’ into the picture: if a particular combination of inputs to the function is never, or rarely, seen in the training data, then the optimisation is allowed to treat the function as a don’t care at that point.

Olly picked up CompressedLUT from Khataei and Bazargan, and investigated the injection of don’t care conditions into their decomposition process. The results are quite impressive: up to a 39% drop in the P-LUTs (area) required to implement the L-LUTs, with near zero loss in classification accuracy of the resulting neural network.

To my mind, one of the most interesting aspects of Olly’s summer work is the observation that aggressively targeting FPGA area reduction through don’t care conditions without explicitly modelling the impact on accuracy, nevertheless has a negligible or even a positive impact on test accuracy. This can be interpreted as a demonstration that (i) the generalisation capability of the LUT-based network is built into the topology of the NeuraLUT network and (ii) that, in line with Occam’s razor, simple representations – in this case, simple circuits – generalise better.

Our group is very proud of Olly!

School Accountability Reform

Following hot on the heels of the Ofsted consultation, the Department for Education has launched a new consultation on English school accountability reform. This is a system that very much does need reform! In this post, I will briefly summarise my view on the Government’s proposals.

Firstly, I think it’s unfortunate that the phrase “school accountability” has stuck. It smacks too much of “we will give you the slack to let you fail, but woe betide you if you do!”. I would much prefer something like “school improvement framework”.

Having said that, I think “Purposes and Principles” outlined by the government in the consultation are sound. But what about the detailed measures proposed?

Profiles

The proposal for School Profiles, incorporating but going beyond the new Ofsted report card (my very brief comments on Ofsted proposals here) is perfectly reasonable, but nothing particularly new (check out GIAS!). More fundamental, in my view, is the need to revisit what counts as “school performance data” (hint: Attainment 8 ain’t it!) But sadly there is nothing in the consultation about this. One would at least hope that certain data hidden behind an ASP login might become public in the short term.

Intervention

It is disappointing that by default, a maintained school placed in special measures will become an academy but there is no scope for an academy placed in special measures to become a maintained school.

On the other hand RISE Teams are a good idea, especially sign-posting of best practice, regional events for school staff, etc. They are not a new idea – remember when local authorities could actually afford support teams, anyone? – but they are a good idea nevertheless. I support the mandatory nature of some interventions, which for academies will presumably come via Section 43 of the Children’s Wellbeing and Schools Bill. I am a little worried, though, that currently the RISE teams seem to be described as brokers of support “with a high-quality organisation” rather than as actually having in-house expertise – there is a danger that RISE Teams become ways to mandate schools to buy in services from favoured MATs. The devil will be in the implementation.

It is disappointing that there has been no focus so far by this government on the structural problems present in the sector. The former government’s failed Schools Bill 2022, while having many significant problems, did at least aim to replace the patchwork of academy funding contracts signed at different times with different models with a uniform footing. And the peculiar nature of the Single Academy Trust remains an untackled issue to this day.

Overall

Overall, I would say the proposals are OK. More of a tinkering around the edges than anything profound, although the RISE proposals have some promise and could – with the right resourcing, local democratic control, and remit, genuinely help the sector with self-sustaining school improvement.

Ontology and Oppression

This Autumn I read Katharine Jenkins’ book Ontology and Oppression. The ideas and approaches taken by Jenkins resonated with me, and I find myself consciously or subconsciously applying them in many contexts beyond those she studies. I therefore thought it was worth a quick blog post to summarise the key ideas, in case others find them helpful and to recommend you also read Jenkins’ work if you do.

Jenkins studies the ontology of “social kinds” from a pluralist perspective – that there can be many different definitions of social kinds of the same name, e.g. ‘woman’, ‘Black’ – and that several of them can be useful and/or the right tool to understand the world in the right circumstances. After a general theoretical introduction, she focuses on gender and race to find examples of such kinds, but the idea is clearly applicable much more broadly.

Jenkins begins by describing her “Constraints and Enablements” framework, arguing that what it means to be a member of a social kind is at least partly determined by being subject to certain social constraints and enablements, which Jenkins classifies in certain ways. These can be imposed on you by (some subset of) society or can even be self-imposed through self-identification as a member of a given social kind. Jenkins defines two types of wrong that can come about as a result of being considered a member of a given social kind, ‘ontic injustice’, where the constraints and enablements constitute a ‘wrong’, and a proper subclass, ‘ontic oppression’, where the constraints and enablements additionally “steer individuals in this kind towards exploitation, marginalisation, powerlessness, cultural domination, violence and/or communicative curtailment”. She argues that a pluralist framework can be valuable as a philosophical tool for liberation, and studies how intersectionality arises naturally in her approach.

The race and gender kinds Jenkins studies, she classifies as ‘hegemonic kinds’, ‘interpersonal kinds’ and ‘identity kinds’. I find this classification compelling for wanting to really understand power structures and help people rather than simply shout about identity politics from the sidelines – a form of intervention that sadly characterises much of the ‘debate’ in ‘culture wars’ at the moment. It also provides a useful toolbox to understand how a social kind (e.g. ‘Black’, ‘woman’) can be both hegemonically oppressive and yet corresponding interpersonal and identity kinds can sometimes serve an emancipatory function.

Ultimately, Jenkins’ description allows us to break away from some of the more ridiculous lines of argument we’ve seen in recent years, trying to ‘define away’ issues. At the end of the book, Jenkins takes aim at the ‘ontology-first approach’: the idea that one should first settle ‘the’ meaning of a social kind, e.g. ‘what is a woman?’ and from that apply appropriate (in this case gendered) social practices. This approach, so widespread in society, Jenkins shows does not fit with her framework. She challenges us to ask: what do we actually want to change about society? And from that, to understand what kinds make sense to talk about, and in what context, and how.

NeuraLUT: Networks inside LUTs

In early September, my PhD student Marta Andronic will be off to Turin to present our latest work “NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions” at the Field-Programmable Logic and Applications conference. Ahead of the detailed presentation at the conference, this blog post provides a short accessible summary of her exciting work.

In 2019 I first proposed making better use of FPGA lookup tables by exposing them as trainable hardware, together with my then PhD student Erwei Wang and coauthors, in our LUTNet work. In common with AMD’s LogicNets and our PolyLUT, our new work NeuraLUT hides certain aspects of a neural network within a synthesisable Boolean lookup table (which we call an L-LUT), to achieve very efficient and very low latency inference. LogicNets hid a dot product and activation function – the clever thing in LogicNets was that, as a result, the weights can be real-valued – no quantisation needs to be performed, because the only thing that’s important is the finite truth table of the lookup table; once this has been enumerated, the real-valued weights are irrelevant, the only quantisation is at the inputs and outputs of the L-LUT. The tradeoff here is that LogicNets networks needed to be extremely sparse.

NeuraLUT takes this a step beyond by hiding whole neural networks inside Boolean lookup tables! These internal neural networks can be fully dense – or even irregularly connected – and real-valued in both weight and activation, for the same reason. The only thing that’s important is that the inputs and outputs of these “sub networks” are quantised and connections between sub networks are sparse, because these are the only parts that get exposed to the hardware design itself. One can interpret the resulting network as a standard deep neural network, with a specific hardware-friendly sparsity pattern, as illustrated in the figure below.

The increased expressive power of NeuraLUT leads to considerable reductions in latency. We’re targeting here very low latency applications like you may find in particle physics. 12 nanosecond MNIST classification, anyone? 3 nanoseconds to tag jet substructures in a particle accelerator? Come and listen to Marta’s talk to find out how!

Open Source MX Library

Readers of this blog may be aware that several key industrial players recently released the MX standard for low-precision computation, mainly targeting its use in machine learning. I reviewed the standard in an earlier blog post.

I’m pleased to report that my PhD student Ebby Samson has released an open source RTL hardware implementation of the key operators from the standard. In joint work with Naveen Mellempudi from AMD, Wayne Luk from Imperial and myself, he describes the library in our forthcoming paper at the International Conference on Field-Programmable Logic and Applications. If you will be in Turin in early September, please come and hear Ebby talking about his work.

The library supports all the concrete formats in the standard and more besides. Ebby has also released an extension to the AMD Brevitas quantisation-aware training PyTorch library that lets you train your models with eventual MX implementation in mind.

Please do read our paper, integrate our hardware designs into your work, and use our Brevitas library to do your neural network training! Links to all in the paper.

Notes on Computational Learning Theory

This blog collects some of my notes on classical computational learning theory, based on my reading of Kearns and Vazirani. The results are (almost) all from their book, the sloganising (and mistakes, no doubt) are mine.

The Probably Approximately Correct (PAC) Framework

Definition (Instance Space). An instance space is a set, typically denoted $X$ . It is the set of objects we are trying to learn about.

Definition (Concept). A concept $c$ over $X$ is a subset of the instance space $X$ .

Although not covered in Kearns and Vazirani, in general it is possible to generalise beyond Boolean membership to some degree of uncertainty or fuzziness – I hope to cover this in a future blog post.

Definition (Concept Class). A concept class ${\mathcal C}$ is a set of concepts, i.e. ${\mathcal C} \subset \mathcal{P}(X)$ , where $\mathcal P$ denotes power set. We will follow Kearns and Vazirani and also use $c$ to denote the corresponding indicator function $c : X \to \{0,1\}$ .

In PAC learning, we assume ${\mathcal C}$ is known, but the target class $c \in {\mathcal C}$ is not. However, it doesn’t seem a jump to allow for unknown target class, in an appropriate approximation setting – I would welcome comments on established frameworks for this.

Definition (Target Distribution). A target distribution ${\mathcal D}$ is a probability distribution over $X$ .

In PAC learning, we assume ${\mathcal D}$ is unknown.

Definition (Oracle). An oracle is a function $EX(c,{\mathcal D})$ taking a concept class and a distribution, and returning a labelled example $(x, c(x))$ where $x$ is drawn randomly and independently from ${\mathcal D}$ .

Definition (Error). The error of a hypothesis concept class $h \in {\mathcal C}$ with reference to a target concept class $c \in {\mathcal C}$ and target distribution ${\mathcal D}$ , is $\text{error}(h) = Pr_{x \in {\mathcal D}}\left\{ c(x) \neq h(x) \right\}$ , where $Pr$ denotes probability.

Definition (Representation Scheme). A representation scheme for a concept class ${\mathcal C}$ is a function ${\mathcal R} : \Sigma^* \to {\mathcal C}$ where $\Sigma$ is a finite alphabet of symbols (or – following the Real RAM model – a finite alphabet augmented with real numbers).

Definition (Representation Class). A representation class is a concept class together with a fixed representation scheme for that class.

Definition (Size). We associate a size $\text{size}(\sigma)$ with each string from a representation alphabet $\sigma \in \Sigma^*$ . We similarly associate a size with each concept $c$ via the size of its minimal representation $\text{size}(c) = \min_{R(\sigma) = c} \text{size}(\sigma)$ .

Definition (PAC Learnable). Let ${\mathcal C}$ and ${\mathcal H}$ be representation classes classes over $X$ , where ${\mathcal C} \subseteq {\mathcal H}$ . We say that concept class ${\mathcal C}$ is PAC learnable using hypothesis class ${\mathcal H}$ if there exists an algorithm that, given access to an oracle, when learning any target concept $c \in {\mathcal C}$ over any distribution ${\mathcal D}$ on $X$ , and for any given $0 < \epsilon < 1/2$ and $0 < \delta < 1/2$ , with probability at least $1-\delta$ , outputs a hypothesis $h \in {\mathcal H}$ with $\text{error}(h) \leq \epsilon$ .

Definition (Efficiently PAC Learnable). Let ${\mathcal C}_n$ and ${\mathcal H}_n$ be representation classes classes over $X_n$ , where ${\mathcal C}_n \subseteq {\mathcal H}_n$ for all $n$ . Let $X_n = \{0,1\}^n$ or $X_n = {\mathbb R}^n$ . Let $X = \cup_{n \geq 1} X_n$ , ${\mathcal C} = \cup_{n \geq 1} {\mathcal C_n}$ , and ${\mathcal H} = \cup_{n \geq 1} {\mathcal H_n}$ . We say that concept class ${\mathcal C}$ is efficiently PAC learnable using hypothesis class ${\mathcal H}$ if there exists an algorithm that, given access to a constant time oracle, when learning any target concept $c \in {\mathcal C}_n$ over any distribution ${\mathcal D}$ on $X$ , and for any given $0 < \epsilon < 1/2$ and $0 < \delta < 1/2$ :

Runs in time polynomial in $n$ , $\text{size}(c)$ , $1/\epsilon$ , and $1/\delta$ , and
With probability at least $1-\delta$ , outputs a hypothesis $h \in {\mathcal H}$ with $\text{error}(h) \leq \epsilon$ .

There is much of interest to unpick in these definitions. Firstly, notice that we have defined a family of classes parameterised by dimension $n$ , allowing us to talk in terms of asymptotic behaviour as dimensionality increases. Secondly, note the key parameters of PAC learnability: $\delta$ (the ‘probably’ bit) and $\epsilon$ (the ‘approximate’ bit). The first of these captures the idea that we may get really unlucky with our calls to the oracle, and get misleading training data. The second captures the idea that we are not aiming for certainty in our final classification accuracy, some pre-defined tolerance is allowable. Thirdly, note the requirements of efficiency: polynomial scaling in dimension, in size of the concept (complex concepts can be harder to learn), in error rate (the more sloppy, the easier), and in probability of algorithm failure to find a suitable hypothesis (you need to pay for more certainty). Finally, and most intricately, notice the separation of concept class from hypothesis class. We require the hypothesis class to be at least as general, so the concept we’re trying to learn is actually one of the returnable hypotheses, but it can be strictly more general. This is to avoid the case where the restricted hypothesis classes are harder to learn; Kearns and Vazirani, following Pitt and Valiant, give the example of learning the concept class 3-DNF using the hypothesis class 3-DNF is intractable, yet learning the same concept class with the more general hypothesis class 3-CNF is efficiently PAC learnable.

Occam’s Razor

Definition (Occam Algorithm). Let $\alpha \geq 0$ and $0 \leq b < 1$ be real constants. An algorithm is an $(\alpha,\beta)$ -Occam algorithm for ${\mathcal C}$ using ${\mathcal H}$ if, on an input sample $S$ of cardinality $m$ labelled by membership in $c \in {\mathcal C}_n$ , the algorithm outputs a hypothesis $h \in {\mathcal H}$ such that:

$h$ is consistent with $S$ , i.e. there is no misclassification on $S$
$\text{size}(h) \leq \left(n \cdot \text{size}(c)\right)^\alpha m^\beta$

Thus Occam algorithms produce succinct hypotheses consistent with data. Note that the size of the hypothesis is allowed to grow only mildly – if at all – with the size of the dataset (via $\beta$ ). Note, however, that there is nothing in this definition that suggests predictive power on unseen samples.

Definition (Efficient Occam Algorithm). An $(\alpha,\beta)$ -Occam algorithm is efficient iff its running time is polynomial in $n$ , $m$ , and $\text{size}(c)$ .

Theorem (Occam’s Razor). Let $A$ be an efficient $(\alpha,\beta)$ -Occam algorithm for ${\mathcal C}$ using ${\mathcal H}$ . Let ${\mathcal D}$ be the target distribution over $X$ , let $c \in {\mathcal C}_n$ be the target concept, $0 < \epsilon, \delta \leq 1$ . Then there is a constant $a > 0$ such that if $A$ is given as input a random sample $S$ of $m$ examples drawn from oracle $EX(c,{\mathcal D})$ , where $m$ satisfies $m \geq a \left( \frac{1}{\epsilon} \log \frac{1}{\delta} + \left(\frac{\left( n \cdot \text{size}(c) \right)^\alpha}{\epsilon}\right)^\frac{1}{1-\beta}\right)$ , then $A$ runs in time polynomial in $n$ , $\text{size}(c)$ , $1/\epsilon$ and $\frac{1}{\delta}$ and, with probability at least $1 - \delta$ , the output $h$ of $A$ satisfies $error(h) \leq \epsilon$ .

This is a technically dense presentation, but it’s a philosophically beautiful result. Let’s unpick it a bit, so its essence is not obscured by notation. In summary, simple rules that are consistent with prior observations have predictive power! The ‘simple’ part here comes from $(\alpha,\beta)$ , and the predictive power comes from the bound on $\text{error}(h)$ . Of course, one needs sufficient observations (the complex lower bound on $m$ ) for this to hold. Notice that as $\beta$ approaches 1, and so – by the definition of an Occam algorithm – we get close to being able to memorise our entire training set – we need an arbitrarily large training set (memorisation doesn’t generalise).

Vapnik-Chervonenkis (VC) Dimension

Definition (Behaviours). The set of behaviours on $S = \{x_1, \ldots, x_m\}$ that are realised by ${\mathcal C}$ , is defined by $\Pi_{\mathcal C}(S) = \left\{ \left(c(x_1), \ldots, c(x_m)\right) | c \in {\mathcal C} \right\}$ .

Each of the points in $S$ is either included in a given concept or not. Each tuple $\left(c(x_1), \ldots, c(x_m)\right)$ then forms a kind of fingerprint of $X$ according to a particular concept. The set of behaviours is the set of all such fingerprints across the whole concept class..

Definition (Shattered). A set $S$ is shattered by ${\mathcal C}$ iff $\Pi_{\mathcal C}(S) = \{0,1\}^{|S|}$ .

Note that $\{0,1\}^{|S|}$ is the maximum cardinality that’s possible, i.e. the set of behaviours is all possible behaviours. So we can think of a set as being shattered by a concept class iff there’s no combination of inclusion/exclusion in the concepts that isn’t represented at least once in the set.

Definition (Vapnik-Chervonenkis Dimension). The VC dimension of ${\mathcal C}$ , denoted $VCD({\mathcal C})$ , is the cardinality of the largest set shattered by ${\mathcal C}$ . If arbitrarily large finite sets can be shattered by ${\mathcal C}$ , then $VDC({\mathcal C}) = \infty$ .

VC dimension in this sense captures the ability of ${\mathcal C}$ to discern between samples.

Theorem (PAC-learning in Low VC Dimension). Let ${\mathcal C}$ be any concept class. Let ${\mathcal H}$ be any representation class off of VC dimension $d$ . Let $A$ be any algorithm taking a set of $m$ labelled examples of a concept $c \in {\mathcal C}$ and producing a concept in ${\mathcal H}$ that is consistent with the examples. Then there exists a constant $c_0$ such that $A$ is a PAC learning algorithm for ${\mathcal C}$ using ${\mathcal H}$ when it is given examples from $EX(c,{\mathcal D})$ , and when $m \geq c_0 \left( \frac{1}{\epsilon} \log \frac{1}{\delta} + \frac{d}{\epsilon} \log \frac{1}{\epsilon} \right)$ .

Let’s take a look at the similarity between this theorem and Occam’s razor, presented in the last section of this blog post. Both bounds have a similar feel, but the VCD-based bound does not depend on $\text{size}(c)$ ; indeed it’s possible that the size of hypotheses is infinite and yet the VCD is still finite.

As the theorem below shows, the linear dependence on VCD achieved in the above theorem is actually the best one can do.

Theorem (PAC-learning Minimum Samples). Any algorithm for PAC-learning a concept class of VC dimension $d$ must use $\Omega(d/\epsilon)$ examples in the worst case.

Definition (Layered DAG). A layered DAG is a DAG in which each vertex is associated with a layer $\ell \in {\mathbb N}$ and in which the edges are always from some layer $\ell$ to the next layer $\ell+1$ . Vertices at layer 0 have indegree 0 and are referred to as input nodes. Vertices at other layers are referred to as internal nodes. There is a single output node of outdegree 0.

Definition ( $G$ -composition). For a layered DAG $G$ and a concept class ${\mathcal C}$ , the G-composition of ${\mathcal C}$ is the class of all concepts that can be obtained by: (i) associating a concept $c_i \in {\mathcal C}$ with each vertex $N_i$ in $G$ , (ii) applying the concept at each node to its predecessor nodes.

Notice that this way we can think of the internal nodes as forming a Boolean circuit with a single output; the $G$ -composition is the concept class we obtain by restricting concepts to only those computable with the structure $G$ . This is a very natural way of composing concepts – so what kind of VCD arises through this composition? This theorem provides an answer:

Theorem (VCD Compositional Bound). Let $G$ be a layered DAG with $n$ input nodes and $s \geq 2$ internal nodes, each of indegree $r$ . Let ${\mathcal C}$ be a concept class over ${\mathbb R}^r$ of VC dimension $d$ , and let ${\mathcal C}_G$ be the $G$ -composition of ${\mathcal C}$ . Then $VCD({\mathcal C}_G) \leq 2ds \log(es)$ .

Weak PAC Learnability

Definition (Weak PAC Learning). Let ${\mathcal C}$ be a concept class and let $A$ be an algorithm that is given access to $EX(c,{\mathcal D})$ for target concept $c \in {\mathcal C}_n$ and distribution ${\mathcal D}$ . $A$ is a weak PAC learning algorithm for ${\mathcal C}$ using ${\mathcal H}$ if there exist polynomials $p(\cdot,\cdot)$ and $q(\cdot,\cdot)$ such that $A$ outputs a hypothesis $h \in {\mathcal H}$ that with probability at least $1/q(n,\text{size}(c))$ satisfies $\text{error}(h) \leq 1/2 - 1/p(n,\text{size}(c))$ .

Kearns and Vazirani justifiably describe weak PAC learning as “the weakest demand we could place on an algorithm in the PAC setting without trivialising the problem”: if these were exponential rather than polynomial functions in $n$ , the problem is trivial: take a fixed-size random sample of the concept and memorise it, randomly guess with probability 50% outside the memorised sample. The remarkable result is that efficient weak PAC learnability and efficient PAC learnability coincide for an appropriate PAC hypothesis class, based on ternary majority trees.

Definition (Ternary Majority Tree). A ternary majority tree with leaves from ${\mathcal H}$ is a tree where each non-leaf node computes a majority (voting) function of its three children, and each leaf is labelled with a hypothesis from ${\mathcal H}$ .

Theorem (Weak PAC learnability is PAC learnability). Let ${\mathcal C}$ be any concept class and ${\mathcal H}$ any hypothesis class. Then if ${\mathcal C}$ is efficiently weakly PAC learnable using ${\mathcal H}$ , it follows that ${\mathcal C}$ is efficiently PAC learnable using a hypothesis class of ternary majority trees with leaves from ${\mathcal H}$ .

Kearns and Varzirani provide an algorithm to learn this way. The details are described in their book, but the basic principle is based on “boosting”, as developed in the lemma to follow.

Definition (Filtered Distributions). Given a distribution ${\mathcal D}$ and a hypothesis $h_1$ we define ${\mathcal D_2}$ to be the distribution obtained by flipping a fair coin and, on a heads, drawing from $EX(c,{\mathcal D})$ until $h_1$ agrees with the label; on a tails, drawing from $EX(c,{\mathcal D})$ until $h_1$ disagrees with the label. Invoking a weak learning algorithm on data from this new distribution yields a new hypothesis $h_2$ . Similarly, we define ${\mathcal D_3}$ to be the distribution obtained by drawing examples from $EX(c,{\mathcal D})$ until we find an example on which $h_1$ and $h_2$ disagree.

What’s going on in these constructions is quite clever: $h_2$ has been constructed so that it must contain new information about $c$ , compared to $h_1$ ; $h_1$ has, by construction, no advantage over a coin flip on ${\mathcal D}_2$ . Similarly, $h_3$ contains new information about $c$ not already contained in $h_1$ and $h_2$ , namely on the points where they disagree. Thus, one would expect that hypotheses that work in these three cases could be combined to give us a better overall hypothesis. This is indeed the case, as the following lemma shows.

Lemma (Boosting). Let $g(\beta) = 3 \beta^2 - 2 \beta^3$ . Let the distributions ${\mathcal D}$ , ${\mathcal D}_2$ , ${\mathcal D}_3$ be defined above, and let $h_1$ , $h_2$ and $h_3$ satisfy $\text{error}_{\mathcal D}(h_1) \leq \beta$ , $\text{error}_{{\mathcal D}_2}(h_2) \leq \beta$ , $\text{error}_{{\mathcal D}_3}(h_3) \leq \beta$ . Then if $h = \text{majority}(h_1, h_2, h_3)$ , it follows that $\text{error}_{\mathcal D}(h) \leq g(\beta)$ .

The function $g$ is monotone and strictly decreasing over $[0,1/2)$ . Hence by combining three hypotheses with only marginally better accuracy than flipping a coin, the boosting lemma tells us that we can obtain a strictly stronger hypothesis. The algorithm for (strong) PAC learnability therefore involves recursively calling this boosting procedure, leading to the majority tree – based hypothesis class. Of course, one needs to show that the depth of the recursion is not too large and that we can sample from the filtered distributions with not too many calls to the overall oracle $EX(c,{\mathcal D})$ , so that the polynomial complexity bound in the PAC definition is maintained. Kearns and Vazirani include these two results in the book.

Learning from Noisy Data

Up until this point, we have only dealt with correctly classified training data. The introduction of a noisy oracle allows us to move beyond this limitation.

Definition (Noisy Oracle). A noisy oracle $\hat{EX}^\eta( c, {\mathcal D})$ extends the earlier idea of an oracle with an additional noise parameter $0 \leq \eta < 1/2$ . This oracle behaves in the identical way to $EX$ except that it returns the wrong classification with probability $\eta$ .

Definition (PAC Learnable from Noisy Data). Let ${\mathcal C}$ be a concept class and let ${\mathcal H}$ be a representation class over $X$ . Then ${\mathcal C}$ is PAC learnable from noisy data using ${\mathcal H}$ if there exists and algorithm such that: for any concept $c \in {\mathcal C}$ , any distribution ${\mathcal D}$ on $X$ , any $0 \leq \eta < 1/2$ , and any $0 < \epsilon < 1$ , $0 < \delta < 1$ and $\eta_0$ with $\eta \leq \eta_0 < 1/2$ , given access to a noisy oracle $\hat{EX}^\eta( c, {\mathcal D})$ and inputs $\epsilon$ , $\delta$ , $\eta_0$ , with probability at least $1 - \delta$ the algorithm outputs a hypothesis concept $h \in {\mathcal H}$ with $\text{error}(h) \leq \epsilon$ . If the runtime of the algorithm is polynomial in $n$ , $1/\epsilon$ , $1/\delta$ and $1/(1 - 2\eta_0)$ then ${\mathcal C}$ is efficiently learnable from noisy data using ${\mathcal H}$ .

Let’s unpick this definition a bit. The main difference from the PAC definition is simply the addition of noise via the oracle and an additional parameter $\eta_0$ which bounds the error of the oracle; thus the algorithm is allowed to know in advance an upper bound on the noisiness of the data, and an efficient algorithm is allowed to take more time on more noisy data.

Kearns and Vazirani address PAC learnability from noisy data in an indirect way, via the use of a slightly different framework, introduced below.

Definition (Statistical Oracle). A statistical oracle $STAT(c, {\mathcal D})$ takes queries of the form $(\chi, \tau)$ where $\chi : X \times \{0,1\} \to \{0,1\}$ and $0 < \tau \leq 1$ , and returns a value $\hat{P}_\chi$ satisfying $P_\chi - \tau \leq \hat{P}_\chi \leq P_\chi + \tau$ where $P_\chi = Pr_{x \in {\mathcal D}}[ \chi(x, c(x)) = 1 ]$ .

Definition (Learnable from Statistical Queries). Let ${\mathcal C}$ be a concept class and let ${\mathcal H}$ be a representation class over $X$ . Then ${\mathcal C}$ is efficiently learnable from statistical learning queries using ${\mathcal H}$ if there exists a learning algorithm $A$ and polynomials $p(\cdot, \cdot, \cdot)$ , $q(\cdot, \cdot, \cdot)$ and $r(\cdot,\cdot,\cdot)$ such that: for any $c \in {\mathcal C}$ , any distribution ${\mathcal D}$ over $X$ and any $0 < \epsilon < 1/2$ , if given access to $STAT(c,{\mathcal D})$ , the following hold. (i) For every query $(\chi,\tau)$ made by $A$ , the predicate $\chi$ can be evaluated in time $q(1/\epsilon, n, \text{size}(c))$ , and $\tau \leq r(1/\epsilon, n, \text{size}(c))$ , (ii) $A$ has execution time bounded by $p(1/\epsilon, n, \text{size}(c))$ , (iii) $A$ outputs a hypothesis $h \in {\mathcal H}$ that satisfies $\text{error}(h) \leq \epsilon$ .

So a statistical oracle can be asked about a whole predicate $\chi$ , for any given tolerance $\tau$ . The oracle must return an estimate of the probability that this predicate holds (where the probability is over the distribution over $X$ ). It is, perhaps, not entirely obvious how to relate this back to the more obvious noisy oracle used above. However, it is worth noting that one can construct a statistical oracle that works with high probability by taking enough samples from a standard oracle, and then returning the relative frequency of $\chi$ evaluating to 1 on that sample. Kearns and Vazirani provide an intricate construction to efficiently sample from a noisy oracle to produce a statistical oracle with high probability. In essence, this then allows an algorithm that can learn from statistical queries to be used to learn from noisy data, resulting in the following theorem.

Theorem (Learnable from Statistical Queries means Learnable from Noisy Data). Let ${\mathcal C}$ be a concept class and let ${\mathcal H}$ be a representation class over $X$ . Then if ${\mathcal C}$ is efficiently learnable from statistical queries using ${\mathcal H}$ , ${\mathcal C}$ is also efficiently PAC learnable using ${\mathcal H}$ in the presence of classification noise.

Hardness Results

I mentioned earlier in this post that Pitt and Valiant showed that sometimes we want more general hypothesis classes than concept classes: the concept class 3-DNF using the hypothesis class 3-DNF is intractable, yet learning the same concept class with the more general hypothesis class 3-CNF is efficiently PAC learnable. So in their chapter Inherent Unpredictability, Kearns and Vazirani turn their attention to the case where a concept class is hard to learn independent of the choice of a hypothesis class. This leads to some quite profound results for those of us interested in Boolean circuits.

We will need some kind of hardness assumption to develop hardness results for learning. In particular, note that if $P = NP$ , then by Occam’s Razor (above) polynomially evaluable hypothesis classes are also polynomially-learnable ones. So we will need to do two things: focus our attention on polynomially evaluable hypothesis classes (or we can’t hope to learn them polynomially), and make a suitable hardness assumption. The latter requires a very brief detour into some results commonly associated with cryptography.

Let ${\mathbb Z}_N^* = \{ i \; | \; 0 < i < N \; \wedge \text{gcd}(i, N) = 1 \}$ . We define the cubing function $f_N : {\mathbb Z}_N^* \to {\mathbb Z}_N^*$ by $f_N(x) = x^3 \text{ mod } N$ . Let $\varphi$ define Euler’s totient function. Then if $\varphi$ is not a multiple of three, it turns out that $f_N$ is bijective, so we can talk of a unique discrete cube root.

Definition (Discrete Cube Root Problem). Let $p$ and $q$ be two $n$ -bit primes with $\varphi(N)$ not a multiple of 3, where $N = pq$ . Given $N$ and $f_N(x)$ as input, output $x$ .

Definition (Discrete Cube Root Assumption). For every polynomial $P$ , there is no algorithm that runs in time $P(n)$ that solves the discrete cube root problem with probability at least $1/P(n)$ , where the probability is taken over randomisation of $p$ , $q$ , $x$ and any internal randomisation of the algorithm $A$ . (Where $N = pq$ ).

This Discrete Cube Root Assumption is widely known and studied, and forms the basis of the learning complexity results presented by Kearns and Vazirani.

Theorem (Concepts Computed by Small, Shallow Boolean Circuits are Hard to Learn). Under the Discrete Cube Root Assumption, the representation class of polynomial-size, log-depth Boolean circuits is not efficiently PAC learnable (using any polynomially evaluable hypothesis class).

The result also holds if one removes the log-depth requirement, but this result shows that even by restricting ourselves to only log-depth circuits, hardness remains.

In case any of my blog readers knows: please contact me directly if you’re aware of any resource of positive results on learnability of any compositionally closed non-trivial restricted classes of Boolean circuits.

The construction used to provide the result above for Boolean circuits can be generalised to neural networks:

Theorem (Concepts Computed by Neural Networks are Hard to Learn). Under the Discrete Cube Root Assumption, there is a polynomial $p$ and an infinite family of directed acyclic graphs (neural network architectures) $G = \{G_{n^2}\}_{n \geq 1}$ such that each $G_{n^2}$ has $n^2$ Boolean inputs and at most $p(n)$ nodes, the depth of $G_{n^2}$ is a constant independent of $n$ , but the representation class ${\mathcal C}_G = \cup_{n \geq 1} {\mathcal C}_{G_{n^2}}$ is not efficiently PAC learnable (using any polynomially evaluable hypothesis class), and even if the weights are restricted to be binary.

Through an appropriate natural definition of reduction in PAC learning, Kearns and Vazirani show that the PAC-learnability of all these classes reduce to functions computed by deterministic finite automata. So, in particular:

Theorem (Concepts Computed by Deterministic Finite Automata are Hard to Learn). Under the Discrete Cube Root Assumption, the representation class of Deterministic Finite Automata is not efficiently PAC learnable (using any polynomially evaluable hypothesis class).

It is this result that motivates the final chapter of the book.

Experimentation in Learning

As discussed above, PAC model utilises an oracle that returns labelled samples $(x, c(x))$ . An interesting question is whether more learning power arises if we allow the algorithms to be able to select $x$ themselves, with the oracle returning $c(x)$ , i.e. not just to be shown randomly selected examples but take charge and test their understanding of the concept.

Definition (Membership Query). A membership query oracle takes any instance $x$ and returns its classification $c(x)$ .

Definition (Equivalence Query). An equivalence query oracle takes a hypothesis concept $h \in {\mathcal C}$ and determines whether there is an instance $x$ on which $c(x) \neq h(x)$ , returning this counterexample if so.

Definition (Learnable From Membership and Equivalence Queries). The representation class ${\mathcal C}$ is efficiently exactly learnable from membership and equivalence queries if there is a polynomial $p(\cdot,\cdot)$ and an algorithm with access to membership and equivalence oracles such that for any target concept $c \in {\mathcal C}_n$ , the algorithm outputs the concept $c$ in time $p(\text{size}(c),n)$ .

There are a couple of things to note about this definition. It appears to be a much stronger requirement than PAC learning, as the concept must be exactly learnt. On the other hand, the existence of these more sophisticated oracles, especially the equivalence query oracle, appears to narrow the scope. Kearns and Vazirani encourage the reader to prove that the true strengthening over PAC-learnability is in the membership queries:

Theorem (Exact Learnability from Membership and Equivalence means PAC-learnable with only Membership). For any representation class ${\mathcal C}$ , if ${\mathcal C}$ is efficiently exactly learnable from membership and equivalence queries, then ${\mathcal C}$ is also efficiently learnable in the PAC model with membership queries.

They then provide an explicit algorithm, based on these two new oracles, to efficiently exactly learn deterministic finite automata.

Theorem (Experiments Make Deterministic Finite Automata Efficiently Learnable). The representation class of Deterministic Finite Automata is efficiently exactly learnable from membership and equivalence queries.

Note the contrast with the hardness result of the previous section: through the addition of experimentation, we have gone from infeasible learnability to efficient learnability. Another very philosophically pleasing result.

Review: Neuroqueer Heresies

I recently picked up this enjoyable and thought-provoking collection of essays from Nick Walker. The book consists of a diverse collection of essays she has written over the years, starting from questions of neurodiversity, and ending by exploring how queer theory can be brought to the neurodiversity movement. Some of these essays have appeared before, especially on Walker’s website, and some are newly written for this book. Given that the original essays appeared as far back as 2011, many of them come with a partner essay in the book, reflecting on the changes since first published and on the impact they have had.

The book is split into three sub-collections: The Neurodiversity Paradigm, Autistic Empowerment, and Postnormal Possibilities.

Part I, The Neurodiversity Paradigm, deals with neurodiversity terminology and ideas. “Throw Away the Master’s Tools”, the first essay, sets the scene by raging against the pathology paradigm in neurodiversity, the idea that neurodivergence is “less than” neurotypicality somehow. She instead locates neurodiversity as natural, positive, and similar to racial and cultural diversity. Readers of this blog who are interested can read a lot more detail on this in my review of Chapman’s recent book Empire of Normality, where Chapman identifies possible explanations for the rise of this phenomenon. Walker argues that using the tools and language of medicalised approaches to neurodivergence are not going to lead to neurodivergent liberation. For Walker, using the right language is a key part (perhaps even the key part, given the prominence she gives it) of working towards self-empowerment of marginalised groups.

This sets the scene quite nicely for the next few essays, starting with Neurodiversity: Some Basic Terms and Definitions. These essays deal with definitional questions: neurodivergence versus neurodiversity versus the neurodiversity paradigm, neurotypicality, etc. Neurominority is an interesting term defined here by Walker – it seems to plug a hole in language I was looking for when reviewing Chapman’s work, when I was talking about “clusters”; it would be interesting to compare Chapman’s ideas on “serial collectives” to the “neurominority” definition of Walker.

In the essay Neurodivergence and Disability, Walker deals with the still controversial question of the relationship between these two concepts. I know autistic colleagues who strongly identify as disabled as a result of their neurodivergence, as well as those who shy away from the disability label. To my mind, Walker’s view on this makes perfect sense: to view disability in the social model, leading to the idea that disabled should be considered the opposite of enabled. When someone says they are enabled, we tend to imagine that someone or something outside of them (a mentor, a social structure, or even simply a mechanical tool) allows them to achieve something. Meanwhile, when someone says they are disabled, many people read that as a deficit in the individual, rather than the failure of someone or something outside of that individual to allow them to achieve something. It is also in this context that we first get a preview of Walker’s very strong rejection of “person first” language: “person with a disability” locates the disability with the person, whilst “disabled person” does not. To quote Walker, “if you ask me whether autism is a disability, I’ll say no, but if you ask me whether autistic people are disabled, I’ll say yes”.

Part I of the book finishes with an essay Reflections on Neurocosmopolitanism, which imagines a future in which the neurodiversity paradigm has been embraced by individuals and society. While I love the forward-looking nature of this, I feel it lacks the materialist grounding in class of Chapman’s work, which makes it feel somewhat disconnected from reality for me.

Part II is entitled Autistic Empowerment, and includes a diverse set of essays specifically focused on autism. This ranges from the short, accessible essay What is Autism? to the strongly-titled Person-First Language is the Language of Autistiphobic Bigots. One interesting claim made in this part of the book, in an essay entitled Autism and Social Trauma, is that autistic communication difficulties with allistics do not equally apply to communication with other autistics: that this is not an issue of autistic difficulty but rather of inter-neurotype communication difficulty. This claim is very interesting, but (in common with the other essays) a reference is not provided for evidence. If any of my readers are aware of evidence supporting or disputing this claim, I’d be really interested to hear more. Other essays are included on stimming, on parenting autistic children, on guidance for neurotypical psychotherapists working with autistic clients and a set of principles for those constructing a course on autism, amongst others.

The shortest part of the book, Part III, is called Postnormal Possibilities, and as I understand it takes a turn towards the application of queer theory to neurodiversity. This is my first exposure to queer theory, so my comments here need to be read in this context. Walker introduces the verb to neuroqueer (also, and less prominent in Walker’s thinking, is the adjective neuroqueer). To neuroqueer covers, for Walker, a wide variety of activities, broadly involving transgressing and defying neuronormative and heteronormative behaviours simultaneously. To my mind, Walker’s best illustration of this might be provided in the final, and most lengthy essay of the section, entitled A Horizon of Possibility: Some Notes on Neuroqueer Theory, in which she talks about the “policing of hands”; that neuronormative and heteronormative society insists (at least in the societies Walker and I live in), a certain acceptable way of holding and using one’s hands. As I understand it, to consciously free oneself to stim with one’s hands, or to hold them in a manner that might cause disapproval as it “looks gay” or “looks autistic”, would be a clear example of neuroqueering.

I would recommend this book to others. It’s certainly thought provoking, and there are parts of it that I have already found myself repeating to others, e.g. at work, I found myself in a meeting arguing that disabled is the opposite of enabled, and drawing the appropriate conclusions for inclusive practices. From an academic perspective, I find the lack of references to experimental literature or even other lived experience literature to be frustrating. Walker does herself talk about the problematic nature of academic literature in the field in this very collection, but I would love to see a version of some of these essays referencing other work, so I can build a more fully rounded picture of the evidence base for some of the claims. Having said that, I think it’s still a great jumping off point, perhaps including for researchers into neurodivergence seeking that very evidence base. Walker comes across as a passionate advocate for her community.

Energy: Rewriting the Possibilities

In early June, my PhD student Sam Coward (co-advised by Theo Drane from Intel), will travel to ARITH 2024 in Málaga to present some of our most recent work, “Combining Power and Arithmetic Optimization via Datapath Rewriting”, a joint paper with Emiliano Morini, also of Intel. In this blog post, I will describe the fundamental idea of our work.

It’s well-known that ICT is driving a significant amount of energy consumption in the modern world. The core question of how to organise the fundamental arithmetic operations in a computer in order to reduce power (energy per unit time) has been studied for a long time, and continues to be a priority for designers across industry, including the group at Intel with whom this work has been conducted.

Readers of this blog will know that Sam has been doing great work on how to explore the space of behaviourally equivalent hardware designs automatically. First for area, then for performance, and now for power consumption!

In our latest work, Sam looks at how we can use the e-graph data structure, and the related egg tool, to tightly integrate arithmetic optimisations (like building multi-input adders in hardware) with clock gating and data gating, two techniques for power saving. Clock gating avoids clocking new values into registers in hardware if we know they’re not going to be used in a given cycle; this avoids the costly switching activity associated with propagating unused information in a digital circuit. Data gating also avoids switching, but in a different way – by replacing operands with values inducing low switching: for example, if I do not end up using a result of $a \times b$ , then I may as well be computing $a \times 0$ . In both cases, the fundamental issue becomes how to identify whether a value will be unused later in a computation. Intriguingly, this question is tightly related to the way a computation is performed: there are many ways of computing a given mathematical computation, and each one will have its own redundancies to exploit.

In our ARITH 2024 paper, Sam has shown how data gating and clock gating can be expressed as rewrites over streams of Boolean data types, lifting our previous work that looks at equivalences between bit vectors, to equivalences over streams of bit vectors. In this way, he’s able to express both traditional arithmetic equivalences like $a + (b + c) = (a + b) + c$ and equivalences expressing clock and data gating within the same rewriting framework. A collection of these latter equivalences are shown in the table below from our paper.

Some of the rewrites between equivalent expressions used in our ARITH 2024 paper

Sam has been able to show that by combining the rewrites creatively, using arithmetic rewrites to expose new opportunity for gating, our tool ROVER is able to save some 15% to 30% of power consumption over a range of benchmark problems of industrial interest. Moreover, ROVER will automatically adjust the whole design to better suit different switching profiles, knowing that rarely-switching circuit components are less problematic for energy, and prioritising exposing rewrites where they are needed.

I think this is really interesting work, and shows just how general the e-graph approach to circuit optimisation can be. If you’re going to ARITH 2024, do make sure to talk to Sam and find out more. If not, make sure to read his paper!

FPGA 2024

I have just returned from a wonderful trip to California with colleagues and students, into which I managed to pack: visiting AMD for a day of presentations to each other, co-running a workshop on Spatial Machine Learning, attending the ACM FPGA 2024 conference at which my student Martin Langhammer presented, and catching up with international colleagues and with Imperial College alumni and their families. In this blog post I’ll briefly summarise some of the work-related key takeaways from this trip for me.

A few of our alumni in the San Francisco Bay Area. Immediately behind me, with white hair, is Dr Nan Zhuang who first taught me logic synthesis when I was a first-year undergraduate and he was a PhD student at Imperial College

AMD and Imperial

At AMD’s Xilinx campus, I was able to share some of our most recent work. I chose to focus on two aspects: the work we’ve done on e-graphs as an EDA data-structure, and how we have developed these for datapath optimisation, and the development of new highly-efficient LUT-based neural networks. It was great to catch up with AMD on the latest AI Engines developments, especially given the imminent open-sourcing of a complete design flow here. My former PhD students Sam Bayliss and Erwei Wang have been working hard on this – amongst many others, of course – and it was great to get a detailed insight into their work at AMD.

SpatialML Workshop

Readers of this blog may know that I currently lead an international centre on spatial machine learning (http://spatialml.net). This year we ran an open workshop, sharing the work we have been doing in the centre, and bringing in additional colleagues from outside the centre too. This was exceptionally well attended: we had delegates from academia and industry internationally in attendance: A*STAR, AMD, Ciena, Cornell, DRW, Essen, Groq, Imperial College, Intel, KAUST, Mangoboost, Minnesota, Rhode Island, Seoul, Simon Fraser, Southampton, Tsinghua, Toronto, and UCLA. We structured the day around a number of excellent academic talks from Zhiru Zhang (Cornell), Vaughn Betz (University of Toronto), Paul Chow (University of Toronto), Lei Xun (University of Southampton), Atefeh Sohrabizadeh (UCLA), and Alex Montgomerie (Imperial College), combined with industrial keynotes from Satnam Singh (Groq) and Sam Bayliss (AMD), closing with a panel discussion on the topic of working across abstraction layers; we had Jason Anderson (U of T), Jason Cong (UCLA), Steve Neuendorffer (AMD) and Wayne Luk (Imperial College) on the panel, with me chairing (sadly Theo Drane from Intel could not make it due to poor weather conditions in Northern California). Slides that are publicly sharable will be made available on the workshop website.

We learned about the use of machine learning (graph representation learning, GNNs) in EDA flows, about scaling LLM implementations across multiple devices in industry and academia, about dynamic DNNs and resource scaling, about programming models and flows for deep learning acceleration in both academia and industry (often using MLIR), and about FPGA architectural enhancements to support more efficient deep learning.

We received very positive feedback from workshop attendees. I would like to express particular thanks to my workshop co-organisers, especially the astoundingly productive Andrew Boutros of the University of Toronto, for all his hard work making this workshop happen.

The panel discussion at the workshop. L-R: Jason H. Anderson, Jason Cong, George A. Constantinides, Wayne Luk, Stephen Neuendorffer

FPGA 2024

I consider the ACM FPGA conference as my “home” conference: I’m a steering committee member and former chair, but this year my only roles were as a member of the best paper committee and as a session chair, so I could generally sit back and enjoy listening to the high quality talks and interacting with the other delegates. This year Andrew Putnam from Microsoft was technical program chair and Zhiru Zhang from Cornell was general chair. There is of course too much presented in a conference to try to summarise it all, but here are some highlights for me.

Alireza Khataei and Kia Bazargan had a nice paper, “CompressedLUT: An Open Source Tool for Lossless Compression of Lookup Tables for Function Evaluation and Beyond”, on lossless compression of large lookup tables for function evaluation.
Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, and Paolo Ienne have done some great work, “Survival of the Fastest: Enabling More Out-of-Order Execution in Dataflow Circuits”, on enabling (and localising) out-of-order execution in dynamic high-level synthesis, extending the reach of out-of-order execution beyond the approach I took with Jianyi Cheng and John Wickerson.
Louis-Nöel Pouchet, Emily Tucker and coauthors have developed a specialised approach to checking equivalence of two HLS programs, “Formal Verification of Source-to-Source Transformations for HLS”, for the case where there are no dynamic control decisions (common in today’s HLS code), based on symbolic execution, rewriting, and syntactic equivalence testing. It basically does what KLEE does for corner cases of particular interest to HLS, but much faster. Their paper won the best paper award at FPGA 2024.
Jiahui Xu and Lana Josipović had a nice paper, “Suppressing Spurious Dynamism of Dataflow Circuits via Latency and Occupancy Balancing”, allowing for a more smooth tradeoff between dynamic and static execution in high-level synthesis than was possible from the early work my PhD student Jianyi Cheng was able to achieve on this topic. They get this through balancing paths for latency and token occupancy in a hardware-efficient way.
Daniel Gerlinghoff and coauthors had a paper, “Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic”, building on our LUTNet work and extensions thereof, introducing various approaches to scale down the logic utilisation via sequentialisation over some tensor dimensions and over bits in the linear (LogicNets) case.
My PhD student Martin Langhammer (also with Altera) presented his work, “A Statically and Dynamically Scalable Soft GPGPU”, on a super high clock frequency soft GPU for embedding into FPGA designs.

Martin Langhammer presenting our work on a super-high clock frequency compact soft GPU

Reflections

I’ve come away with many ideas from all three events: the AMD visit, the workshop, and the FPGA conference. In person attendance at conferences is still the best for this; I didn’t get nearly as many ideas when attending FPGA 2021 or 2022 remotely. It was also particularly satisfying to see our work on soft-logic efficient deep neural networks (starting with LUTNet, most recently PolyLUT) being cited by so many people at the conference; this work appears to have really made a long-term impact.

Finally, it is always a joy to visit Monterey. This year FPGA was held at a hotel on Cannery Row, described by John Steinbeck as “a poem, a stink, a grating noise, a quality of light, a tone, a habit, a nostalgia, a dream”.