We Are Not Machines: Craft, Care and Collective Agency

Sarah O’Connor’s We Are Not Machines: The Fight for the Future of Work accompanied me on my summer holiday, but it has continued to unsettle how I think about work. It is an engaging, accessible book, and precisely the kind that prompts a recalibration of what you want your own work to be about. In the age of AI, it offers a useful lens for deciding what we should automate, what we should protect, and who gets to make those choices.

I am not anti-AI; I am pro-human. I routinely use AI tools, my research involves designing hardware for AI, and I welcome their use where they genuinely enhance life. The difficult questions are which parts of life they enhance, who gets to decide, and what exactly we mean by “enhancement”.

O’Connor examines the intersection of work and technology through our collective struggle to preserve some of our most cherished human traits: “to pass skill and knowledge from one generation to the next; to create, and to delight one another with our creations; to care for each other when we are too weak to care for ourselves”. The book is structured into three sections: “Mind”, “Body”, and “Soul”, which explore autonomy in cognitive work, safety in physical labour, and the intrinsic value of skill, creation, and care. Moving across warehouses, mines, care homes, and creative studios, O’Connor repeatedly returns to three questions: who shapes technology and its adoption, under what incentives, and in whose interests?

For me, the book’s deeper lesson is that these choices are rarely individual. What workers can protect depends on whether they have any collective power over how technology is introduced. The title itself comes from the 1969-70 strike by miners at the Swedish state-owned company LKAB. Their rallying cry was Vi är ej maskiner: we are not machines. Tellingly, the book’s most hopeful narratives are those in which people act collectively to reclaim agency over how their work is organised.

Craft and “Vibe Knitters”

A concept that particularly resonated with me is O’Connor’s use of Karri Saarinen’s account of craft: “the deliberate attention put into making something excellent, not because someone is checking, but because it matters to the maker”. O’Connor also cites Saarinen’s wider economic point that when something becomes cheaper to build, the default outcome is often simply to build more of it, making us less critical of what actually deserves to exist. I recognise this tension from my own vibe coding. The speed and ease are exhilarating, yet they risk detaching the act of production from the rigorous exercise of judgement.

O’Connor restores the historical context around the stockingers who became Luddites. Far from being ignorant reactionaries smashing unfamiliar machines, they were highly skilled framework knitters reacting against particular uses of machinery that degraded both their trade and its products. She refers to the employment of unapprenticed “colts” to turn out cheap goods as “vibe knitters” of their day. The historical question, then as now, is about the social relations in which machines are deployed, the kinds of work they enable, and the products they are used to create.

This discussion connects to a much older debate about the division of labour. O’Connor cites Frederick Winslow Taylor, whose The Principles of Scientific Management sought to convert workers’ accumulated craft knowledge into “rules, laws, and formulae” held and applied by management, displacing individual judgement. Against this, she places John Ruskin’s 1853 warning in “The Nature of Gothic”: it is not merely labour that is divided, but people themselves, broken into “fragments and crumbs of life”.

Karl Marx’s theory of alienation has much to add to this discussion. In the 1844 manuscripts, Marx describes four connected dimensions of alienation: the worker is estranged from the product, from the activity of labour, from other people, and from the human capacity for free, conscious creation. His observation that the worker’s activity “belongs to another” and is “a loss of his self” is therefore not simply a claim about who owns the product. It highlights the worker’s estrangement from their own activity – work experienced as external, compelled, and no longer self-directed. By 1848, Marx and Friedrich Engels wrote in The Communist Manifesto that the worker “becomes an appendage of the machine”.

Each of these thinkers helps illuminate what changes when workers are separated from meaningful responsibility for both their labour and its result. Specialisation enables production on a scale impossible for a solo craftsperson. But those gains ring hollow when work is so fractured that nobody can see, or care about, the whole. Something human is extracted from the process, even as measurable output soars. Unless, perhaps, there is collective and living control over the entire enterprise.

What Should We Automate?

O’Connor quotes an interviewee who poses a striking question: why are we so eager to automate the cognitive activities our minds excel at, and which nourish our minds, rather than the physical tasks our bodies struggle with and which cause physical damage? This question stayed with me throughout the holiday.

Expanding on this, she cites Andreas Schleicher on generative AI in education. A system can effortlessly throw an answer back at us without revealing its provenance or how it was constructed. I recently experienced my own somewhat jarring version of this. I found myself spending more time investigating the intellectual lineage of a mathematical proof generated by GPT-5.5 than I spent proving the theorem from scratch in Lean. It inverted my sense of what takes time in research. Producing the formal mathematical object was almost instantaneous; establishing its origins, understanding its mechanics, and deciding whether I actually thought it illuminating remained stubbornly slow.

This is not necessarily a negative development, but it shifts where the craft lies. Unless we notice that shift, we risk mistaking possession of an answer for understanding.

Seeing the Whole Person

Perhaps the book’s most compelling case study is the Dutch home-care organisation Buurtzorg. Its model relies on small, self-managing teams of nurses who take holistic responsibility for the entire care process. A highly trained nurse might administer complex medication, dress a wound, and then make the patient a sandwich. Viewed through a strictly Taylorist lens, making a sandwich is a gross misallocation of expensive skill and ought to be delegated to cheaper labour. Viewed as part of human care, however, it becomes an opportunity to observe how that person is living and assess their broader needs. The simple task is inseparable from the skilled one.

The same pattern extends beyond nursing. I see it in the creeping deskilling of teaching in England, where the educator’s role is increasingly carved into separately managed tasks. An Ofsted study of teacher wellbeing found that limited influence over policy – teachers feeling “done to” rather than “worked with” – contributed to a sense of de-professionalisation. The same logic shapes the ticketing systems through which large institutions, including my own, manage HR and ICT support. It also shapes higher education, where interactions with staff and students are increasingly packaged, routed, and siloed. In each case, relationships that depend on continuity and judgement are divided into discrete, measurable transactions.

Specialists matter, and a functioning ticketing system is preferable to chaos. Yet any honest accounting of efficiency must include the hidden cost of nobody seeing the person – or the problem – as a whole. It must also account for what workers lose when their roles become too narrow for them to exercise judgement or care about the outcome – and, centrally, when they cannot reach beyond those roles through collective control of the production process as a whole.

Who Is the “We”?

O’Connor skewers what she describes as a cliché that technologies are “only tools” and that what matters is how we choose to use them. Her counter-question is simple: who exactly belongs to the “we” making those choices?

It is striking that almost all the battles for human-centred work detailed in the book are collective ones, even when their victories are ultimately codified as individual rights. The Swedish miners did not optimise their relationship with machinery through individual negotiation. More recently, during its 148-day strike, the Writers Guild of America won rules in the 2023 agreement governing AI: AI could not write or rewrite literary material under the agreement; AI-generated material was not source material; and writers could not be required to use AI. These were foundational choices about technology, won through organised labour.

Because such victories are codified as rights exercised by individuals, their collective origins are easy to forget. Nicos Poulantzas is useful here. In State, Power, Socialism, he treats the state as a material condensation of the balance of forces among classes and class fractions. This offers one way of seeing how rights that appear to belong to isolated individuals can bear the imprint of collective struggle. The right may be granted and exercised individually; the power that made it possible was collective.

The “we” defining technological choice must include those actually performing the work. For people divorced from the ownership and control of the technology, exhortations to preserve their craft are nearly useless if they control neither the tools nor the targets and metrics by which their work is judged.

More Than Machines

The final chapters include perhaps the book’s most unsettling idea: that what some call artificial general intelligence (AGI) might arrive not because machines advance, but because humans retreat from what we are capable of when faced with apparently superior machine performance. Organisations may treat us as machines, but we may also come to understand ourselves as defective machines, slow and inconsistent approximations of systems whose strengths we have adopted as the gold standard.

I closed the book wanting to identify – and ruthlessly prioritise – the crafts I most enjoy, at work and beyond. I remain eager to use AI to automate genuine human drudgery. Yet the current temptation is to automate whatever is easiest to automate, rather than what we actually wish to relinquish, on the vague promise that doing so will free us up for… what exactly?

Much of life’s joy resides in craft. Those of us fortunate enough to retain some autonomy over our work and lives should think carefully before surrendering it. O’Connor’s narratives show that preserving human craft requires both individual choice and collective action. Remaining more than machines is not merely a personal preference, it is a collective and deeply political task.

Sources and further reading

Quotations not otherwise linked are drawn from Sarah O’Connor’s We Are Not Machines: The Fight for the Future of Work.

People

Primary texts

Broader context

AI tools were used to support copy-editing and to help collect and organise the links above.

Directionality in Low Precision

In a couple of recent posts [1,2], I have been trying to reason through the geometric properties of block number formats. The basic idea is that when a group of numbers shares a scale factor, the small low-precision numbers inside the block are no longer meaningfully trying to approximate scalar magnitudes, as the scale has already taken care of much of the magnitude information. What remains is, to a significant extent, a question of direction.

This post is about a new preprint I have recently put out with Bardia Zadeh, “Direction-Preserving Number Representations”. The question we ask is “If a block scale has already taken care of magnitude, what should the scalar values inside the block look like if our aim is to preserve direction?”

That is not the usual question in computer arithmetic, and I can’t really find a historical precedent for this question in the usual places I would look (please let me know if you know others who have looked at this question from an arithmetic perspective!) We usually ask about absolute error, relative error, dynamic range, rounding behaviour, underflow, overflow, and so on. But in many modern machine learning settings, it is also natural to ask a more geometric question: how well do the values available inside a block cover the possible directions of a vector?

On the research methodology side, there is another aspect of this paper that is new for me personally. This is the first paper I have written where AI tools (namely GPT-5.4, GPT-5.5 and Aristotle) made a substantive contribution to the development of proof ideas, as well as to the Lean formalisation, which I am delighted we have open sourced. The AI tools definitely did not replace mathematical judgement or lead to a push button approach: the process still required many manual reformulations, checking, rewriting, and false starts. But it was a new and positive experience for me. In particular, the combination of exploratory AI assistance and Lean’s rock solid proof checking made for a very different style of research interaction from the one I am used to.

I will come back to Lean and AI below. First, let me describe the mathematical object we studied.

From scalar alphabets to directions

Suppose we have a finite scalar alphabet $A$ . For example, $A$ might be the set of real values represented by a 4-bit floating-point format, or by a 4-bit integer format.

If a block has dimension $n$ , then the unscaled vectors we can represent are the product set $A^n.$

But if the block also has an independent positive scale factor, then multiplying the whole vector by a positive scalar should not really change the information encoded by the low-precision elements. The scale can absorb that. What matters is the direction.

So the relevant object becomes

$P_n(A)=\left\{\frac{x}{|x|_2}:x\in A^n,\;x\ne 0\right\}.$

This is a finite set of points on the unit sphere. But these points can’t be placed arbitrarily on the sphere. The set has product structure, because each coordinate is chosen independently from the scalar alphabet.

The natural worst-case measure is the covering radius

$F_n(A)=\sup_{u\in S^{n-1}}\min_{c\in P_n(A)}\angle(u,c).$

This asks: given any true direction $u$ , how close is the nearest direction obtainable from the alphabet $A$ ? Smaller $F_n(A)$ is better.

This lens changes the problem of how to select the alphabet. Instead of asking “which scalar values approximate real numbers well?”, we ask: which scalar values, when used coordinatewise, cover directions well?

You can see the two-dimensional directionality coverage of the floating-point alphabet with two exponent bits, one mantissa bit and one sign bit (E2M1), illustrated below. Each intersection of grid lines corresponds to a 2-vector with elements drawn from this alphabet. We may then find where the line between the origin and that intersection meets a circle (marked as red points). Note the non-uniform spacing around the circle: some regions are better covered than others.

If you want to play with designing your own two-dimensional alphabet in a graphical user interface, you can get more intuition into this problem using this widget Bardia created: https://bardia01.github.io/directional_coverage_explorer/.

A Lean Interlude

As I mentioned above, I’m super pleased that we’ve open-sourced formalisations of all our theorems and definitions in Lean. The formalisation is organised so that the top-level declarations correspond closely to the paper.

The key definitions look like this:

abbrev Aq (q : Nat) : Set (Finset Real) :=
  {A : Finset Real | A.card = q}

abbrev P (n : Nat) (A : Finset Real) :
  Finset (OptimalAlphabets.SpherePoint n) :=
  OptimalAlphabets.AsymmetricProduct.asymProdSphericalCode n A

abbrev F (n : Nat) (A : Finset Real) : Real :=
  OptimalAlphabets.AsymmetricProduct.F_asym n A

abbrev rhoSph (n m : Nat) : Real :=
  OptimalAlphabets.rho_sph n m

Here Aq q is the class of scalar alphabets with $q$ elements. The definition P n A is the finite spherical code obtained by normalising nonzero elements of $A^n$ . The quantity F n A is the product-code covering objective $F_n(A)$ . The definition rhoSph n m is the optimal covering radius of an unconstrained spherical code with $m$ points.

I’m giving these Lean definitions now so we can roughly follow along in Lean as we go for the rest of this blog post.

Product Codes versus Spherical Codes

It seems natural, almost obvious, that a product code should give worse directional coverage than a vector code chosen specifically to optimise directional coverage. Of course, the latter may not be practical, as the decoding process may be significant, but it still forms a useful baseline comparison point. Our first theoretical contribution is to quantify the gap between these two code classes.

Notice that the product structure induces a very severe geometric constraint. Even if $A$ has $q$ values, so that $A^n$ has up to $q^n$ raw vectors, those vectors are not arbitrary. They arise from independent coordinate choices from the same scalar alphabet. Meanwhile, a spherical code with $q^n$ points is free to place those points anywhere on the sphere.

The harmonic witness

A central construction in the paper is a direction that is hard for product codes to cover. Let

$H_n=\sum_{i=1}^{n}\frac{1}{i}$

be the $n$ th harmonic number, and define the unit vector

$u^{(n)}=\left(\frac{1}{\sqrt{H_n}},\frac{1}{\sqrt{2H_n}},\ldots,\frac{1}{\sqrt{nH_n}}\right).$

The entries of this vector decay slowly, making the vector awkward for a finite scalar alphabet. The resulting theorem gives a lower bound on the worst-case angular error. If $m(A)$ is the smaller of the number of positive and negative nonzero values in $A$ , then

$F_n(A)\ge \arccos\left(\min\left\{1,\;2\sqrt{\frac{m(A)}{H_n}}\right\}\right).$

The Lean statement is compact enough to include directly:

theorem theorem2_sign_count_bound {n : Nat} (hn : 2 <= n)
    (A : Finset Real) :
  Real.arccos
    (min 1 (2 * Real.sqrt (mSign A : Real) / Real.sqrt (H n))) <=
  F n A

The notation mSign A is the Lean name for $m(A)$ , the smaller nonzero sign count. The theorem is written as a lower bound on F n A, just as in the paper.

Since $H_n$ grows like $\log n$ , this bound tends towards $\pi/2$ for any fixed alphabet. In plain language: in sufficiently high dimension, every fixed scalar alphabet has some direction that it represents very badly.

This is a worst-case theorem. We’re not claiming that real neural network tensors look like the harmonic witness. What the theorem tells us is that if the metric is worst-case angular coverage of the entire sphere, product-structured scalar alphabets have an inherent limitation.

So What about Spherical Codes?

A product code built from a $q$ -element alphabet has at most $q^n$ raw codewords before normalisation. The fair unconstrained comparison is therefore a spherical code with $q^n$ points.

The paper proves that, for any fixed $q$ , sufficiently high-dimensional spherical codes beat every $q$ -element product code in worst-case angular covering radius.

Here is the Lean statement:

theorem theorem4_asymptotic_strict_separation_fixed_alphabet_size
    {q : Nat} (hq : 2 <= q) :
  ∃ N : Nat, 2 <= N ∧
    forall n, N <= n ->
      forall A : Finset Real, A ∈ Aq q ->
        rhoSph n (q ^ n) < F n A

Read from right to left, this says: take any scalar alphabet $A$ with $q$ elements. In all sufficiently large dimensions $n$ , the best unconstrained spherical code with $q^n$ points has strictly smaller covering radius than the product-code direction set induced by $A$ .

What about Floating and Fixed Point?

The next question we answer is more practical. Within the product-code world, are the usual scalar alphabets the best ones?

The paper studies standard floating-point, fixed-point, and two’s complement alphabets. The answer is that these conventional choices are asymptotically suboptimal for the worst-case directional metric.

Because the worst-case angle for any product code tends towards $90^\circ$ in high dimension (see above), it is more informative when comparing product codes to look at a normalised quantity such as:

$\sqrt{H_n}\cos F_n(A).$

Very roughly, this measures how slowly the worst-case angle approaches $90^\circ$ . Larger is better.

The Lean statement packages the relevant comparison as a liminf/limsup chain:

theorem theorem5_liminf_limsup_chain {b : Nat} (hb : 3 <= b) :
  arbConst b <=
      Filter.liminf (fun n : Nat => normBestAlpha n (2 ^ b)) atTop ∧
  fpConst b < arbConst b ∧
  Filter.limsup (fun n : Nat => normBestFpCos n b) atTop <= fpConst b

This is a theorem about $b$ -bit alphabets. The quantity normBestAlpha n (2 ^ b) is the best normalised performance obtainable by an arbitrary $2^b$ -element scalar alphabet. The quantity normBestFpCos n b is the corresponding floating-point quantity, optimised over valid splits of exponent and mantissa bits. The constants arbConst b and fpConst b are the two asymptotic constants being compared.

The middle inequality fpConst b < arbConst b is the key point. For $b\ge 3$ , arbitrary scalar alphabets can do strictly better than the floating-point family in this asymptotic directional metric.

For four bits, the paper obtains a concrete constant-factor separation of at least $\sqrt{7/3}\approx 1.528.$

So if the design objective is “choose scalar levels whose product code covers directions well”, there are better alphabets than the standard ones, at least in this worst-case asymptotic sense.

AI and Lean

It feels worth saying a little more about the process followed to reach the proofs of these theorems. This paper is the first time I have had a genuinely substantive AI contribution to the development of proof ideas, not just text polishing and review. AI was useful both in the Lean formalisation and in exploring how some of the mathematical arguments might be structured.

The AI tools we used needed lots of iteration, but the workflow was unexpectedly productive. GPT-5.4 and 5.5 and Aristotle from Harmonic could suggest possible routes, propose intermediate lemmas, help with translation between informal mathematics and Lean statements, and generate candidate proof fragments.

This combination was new for me. I am used to mathematical collaboration involving conversations with people, paper sketches, whiteboards, and eventually LaTeX. Here there was another kind of interaction: a fast, imperfect, but useful assistant for exploring the proof space, coupled with Lean as a formal system that refused to accept anything vague.

Mathematical judgement still matters, in what we wanted to prove as well as what counts as an informative proof. But I came away from the experience more positive about the role these tools can play in research, especially when paired with formal verification rather than used as a substitute for it.

The Experimental Side: Exploring 4-bit Alphabets

The theory says that better scalar alphabets should exist. The experiments in the paper ask what they look like. For four-bit alphabets, we impose sign symmetry and include zero. Since multiplying all scalar levels by a common positive factor does not change the represented directions, we normalise the smallest positive value to one.

For block dimension $d=16$ (as used in NVIDIA NVFP4), the optimised positive levels found in the paper are approximately

$1,\;2.12,\;3.40,\;5.04,\;7.25,\;10.5,\;13.2.$

The optimised alphabet is best across the tested dimensions. But what I find most interesting is how close E2M1 is to the optimum, especially compared with integer/fixed-point and pure powers-of-two formats.

E2M1 is the four-bit format used in NVFP4. The results suggest a geometric explanation for why it works well in block-scaled machine learning settings. The key point is that, for this bit-width and block size, the E2M1 levels lie surprisingly close to the levels obtained by directly optimising the product-code directional covering problem.

Conclusion

The main message is that the geometric lens provides value when considering how to design low-precision number formats for machine learning. Once a block scale is present, the scalar values inside the block are not merely approximating real numbers independently. Together, they are choosing a direction. The scalar alphabet therefore determines a product-structured spherical code.

There are three consequences I find useful.

First, product structure has unavoidable limitations. A coordinatewise scalar alphabet cannot cover directions as well as an unconstrained spherical code with the same number of raw codewords.

Second, standard scalar formats are not forced by the geometry. Floating-point, fixed-point, and two’s complement are natural formats for many reasons, but the directional covering objective points to other possibilities.

Third, E2M1 comes out looking very good. The optimised alphabet is better in the sampled worst-case metric, but E2M1 is close enough that its empirical success in block-scaled low-precision settings has a clean geometric explanation.

The Lean formalisation matters because it pins down the definitions, checks the asymptotic comparisons, verifies the separation from standard formats, and formalises the scale-search theorem used in the experiments.

The AI aspect matters to me for a different reason. It changed the way this paper was developed. The experience was not one of handing mathematics over to a machine, but of using AI as part of a proof-development workflow. For me, that was new, and it was positive.

So perhaps a promising design question for future low-precision formats should be phrased less like the traditional “Which scalar values approximate real numbers best?” and more like “Which scalar values, when used coordinatewise inside a block, represent directions best?” That seems like a useful question in the world of block-scaled arithmetic.

AI in Education

In the UK, the Parliamentary Select Committee on Education is currently holding an inquiry into “The use of Artificial Intelligence and EdTech in Education”. I am reproducing my submission to this inquiry below, in case it is of value to others. The text can also be found – alongside all other submissions made – at the official parliamentary website.

AIE0024 Download

Block Number Formats are (Still!) Direction Preservers

In my previous post, I argued that block number formats can be understood geometrically as direction preservers. That argument relied on an idealization: once a block direction had been chosen, its scale could be set optimally as an arbitrary real number.

Real hardware formats do not usually work that way. In many practical schemes, block scales are quantized very coarsely, sometimes all the way down to powers of two. In particular, in the MX specification, all the concrete compliant formats use E8M0 scaling.

So does the directional picture I painted in my last post survive this brutal scaling? Here I will argue, in the first of what I hope will be a short sequence of follow-up blog posts, that it does.

From ideal block scales to quantized block scales

Recall the setup from the earlier post. A vector is partitioned into blocks, $v = (v_1,\dots,v_B)$ , and each block is approximated as $\hat v_b = \beta_b m_b$ , where $m_b$ is a low-precision mantissa vector and $\beta_b$ is a scalar block scale.

In the earlier post, I assumed that $\beta_b$ was an arbitrary real value, chosen optimally in the least-squares sense. That gave the ideal blockwise representation $\hat v$ .

Now let us keep the same mantissa vectors $m_b$ , but suppose that the scale factors themselves must be quantized. Write the implemented scale as $\tilde \beta_b$ , so that the represented block becomes $\tilde v_b = \tilde \beta_b m_b$ .

It is convenient to define the multiplicative scale error $x_b = \frac{\tilde \beta_b}{\beta_b}$ . Then $\tilde v_b = x_b \hat v_b$ .

Note that, of course, quantizing the block scale does not change the chosen direction of a block at all; it only changes its length. So the only directional distortion comes from the relative rescaling of different blocks.

An exact cosine formula

Let $\alpha_b = \frac{\|\hat v_b\|^2}{\sum_j \|\hat v_j\|^2}$ , so that $\alpha_b$ is the fraction of the ideal projected vector’s energy contained in block $b$ .

Then it can be shown that $\cos(\hat v,\tilde v) = \frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}$ (the proof is included at the end of this post).

So the effect of scale quantization on direction depends only on how uneven the factors $x_b$ are across blocks. If all blocks were rescaled by the same factor, direction would be unchanged.

Exponent-only power-of-two scaling

Now consider the coarsest plausible case: each block scale is rounded to the nearest power of two. Then each multiplicative error satisfies $x_b \in [2^{-1/2},\,2^{1/2}]$ .

So, from our exact cosine formula, we are interested in how small $\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}$ can be, when all the $x_b$ lie in the interval $[2^{-1/2},\,2^{1/2}]$ .

A simple inequality shows that the answer depends only on the two extreme values of the interval. If all the block rescaling factors lie in $[\ell,u]$ , then

$\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}\ge \frac{2\sqrt{\ell u}}{\ell+u}$ (proof at the end of this blog post).

In the power-of-two case we have $\ell=2^{-1/2}$ , $u=2^{1/2}$ , so $\ell u=1$ , and therefore

$\cos(\hat v,\tilde v)\ge \frac{2}{2^{-1/2}+2^{1/2}} = \frac{2\sqrt2}{3}\approx 0.943$ .

Equivalently, $\angle(\hat v,\tilde v)\le 20^\circ$ .

So even if every block scale is rounded to the nearest power of two, the resulting vector remains within about $20^\circ$ of the ideally scaled one.

That is the main result of this post.

One striking feature of the bound is that it does not depend on the dimension of the vector. The reason is that the worst case is already attained by a two-group energy split: some blocks rounded up, others rounded down. Once those two groups exist, adding more blocks or more dimensions does not make the bound worse, as is apparent from the proof below.

20 degrees is less than it sounds

Our everyday intuition may tell us that this angle is not huge, but it’s not that small either. In a sense, that’s true. But angles behave very differently in high-dimensional spaces. In high dimension, most random vectors are almost orthogonal to one another: their angle is close to $90^\circ$ , so a guarantee that an approximation remains within $20^\circ$ of the original vector is much stronger than it would sound in two or three dimensions.

Beyond power-of-two

We’ve analysed power-of-two scaling here for two reasons: because it’s in a sense the crudest possible floating-point rounding, and because it’s commonly used in real hardware designs.

That does not mean it’s optimal. But it does raise two further questions. Firstly, we’ve assumed here that the exponent range is sufficiently wide – what if it’s not? Secondly – and relatedly – how much better can this angular bound get by spending some of the scale bits on greater precision?

My view is that the answer becomes clearer once a tensor-wide high-precision scale is introduced, something NVIDIA has recently done. In that setting, the block scales get relieved of their additional duty to capture global magnitude. This will be the subject of the next post on the topic!

Proofs

Readers not interested in the algebra can safely skip this section.

Cosine formula

Recall that $\tilde v_b = x_b \hat v_b$ for each block $b$ .

Then, because the blocks occupy disjoint coordinates, $\langle \hat v,\tilde v\rangle = \sum_b \langle \hat v_b,\tilde v_b\rangle = \sum_b x_b \|\hat v_b\|^2$ .

Therefore $\cos(\hat v,\tilde v) = \frac{\langle \hat v,\tilde v\rangle}{\|\hat v\|\,\|\tilde v\|} = \frac{\sum_b x_b \|\hat v_b\|^2}{\sqrt{\sum_b \|\hat v_b\|^2}\sqrt{\sum_b x_b^2 \|\hat v_b\|^2}}$ .

Now, as per the main blog post, define $\alpha_b = \frac{\|\hat v_b\|^2}{\sum_j \|\hat v_j\|^2}$ .

Writing $S=\sum_j \|\hat v_j\|^2$ , so that $\|\hat v_b\|^2=\alpha_b S$ , the numerator becomes $S\sum_b \alpha_b x_b$ and the denominator becomes $S\sqrt{\sum_b \alpha_b x_b^2}$ , giving

$\cos(\hat v,\tilde v)=\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}$ .

$\square$

20 degree bound

Assume that all the multiplicative error factors lie in an interval $x_b \in [\ell,u]$ with $u > \ell > 0$ .

Let $\mu := \sum_b \alpha_b x_b,\qquad q := \sum_b \alpha_b x_b^2$ .

Then the cosine is just $\mu/\sqrt q$ . Since each $x_b\in[\ell,u]$ , we have

$(x_b-\ell)(x_b-u)\le 0$ .

Expanding this gives

$x_b^2 \le (\ell+u)x_b - \ell u$ .

Multiplying by $\alpha_b$ and summing over $b$ gives

$q \le (\ell+u)\mu - \ell u$ .

Therefore $\frac{\mu^2}{q}\ge \frac{\mu^2}{(\ell+u)\mu-\ell u}$ .

Now the weighted mean $\mu$ also lies in the interval $[\ell,u]$ , so it remains to minimize

$\frac{\mu^2}{(\ell+u)\mu-\ell u}$ over $\mu\in[\ell,u]$ .

Differentiating shows that the minimum occurs at $\mu=\frac{2\ell u}{\ell+u}$ , the harmonic mean of $\ell$ and $u$ .

Substituting this value gives

$\frac{\mu^2}{q}\ge \frac{4\ell u}{(\ell +u)^2}$ ,

and therefore

$\frac{\mu}{\sqrt q}\ge \frac{2\sqrt{\ell u}}{\ell +u}$ .

So we have proved that

$\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}\ge \frac{2\sqrt{\ell u}}{\ell+u}$ .

Finally, in the power-of-two case we have

$\ell=2^{-1/2}$ and $u=2^{1/2}$ , so $\ell u=1$ , and hence

$\cos(\hat v,\tilde v)\ge \frac{2}{2^{-1/2}+2^{1/2}} = \frac{2\sqrt2}{3}$ .

Numerically,

$\frac{2\sqrt2}{3}\approx 0.943$ ,

$\angle(\hat v,\tilde v)\le \arccos\left(\frac{2\sqrt2}{3}\right) < 20^\circ$ .

$\square$

Block Number Formats are Direction Preservers

I’ve recently returned from the SIAM PP 2026 conference and as always, conferences help provide time for research reflection. One thing I’ve been reflecting on during my journey back is the various explanations people give for why the machine learning world is so keen on block number formats (MX, NVFP, etc.) – see my earlier blog post on MX if you need a primer. Many hardware engineers tend to answer that they lead to efficient storage, or efficient arithmetic, or improved data transfer bandwidth, which are all true. But I think there’s another complementary answer that’s less well discussed (if indeed it is discussed at all). I hope this blog post might help stimulate some discussion of this complementary take.

On the numerical side, at first glance it might seem surprising that despite these formats representing numbers with very limited precision, large neural networks often tolerate them remarkably well, with little loss in accuracy. In my experience, most explanations focus on dynamic range, quantization noise, the inherent noise robustness of neural networks, or calibration techniques. But I suspect there is also a simple geometric way to think about what these formats are doing: Block number formats help preserve vector direction. And for many machine learning computations, preserving direction matters far more than preserving exact numerical values.

Block formats inherently represent direction and magnitude

Consider a vector $v$ whose coordinates are partitioned into blocks $v = (v_1, v_2, \dots, v_B)$ .

In a block format, each block is represented using a shared scale and low-precision mantissas. For ease of discussion, we’ll consider the simplest case here, where scales are allowed to be arbitrary real-valued. In general, they may be much more restricted, e.g. powers of two.

Each block is approximated as $\hat v_b = \beta_b m_b$

where

$m_b$ is a vector of low-precision mantissas, and

$\beta_b$ is a scalar shared scaling factor.

In other words, each block can be thought of as a direction (encoded by the mantissas) multiplied by a magnitude (the shared scale). Strictly speaking, the mantissa vectors $m_b$ need not be normalized, and in many formats their entries may have quite different magnitudes (for example in integer mantissa formats such as MXINT). However this does not change the geometry. The representation $\hat{v}_b = \beta_b m_b$ is invariant to rescaling of $m_b$ : multiplying $m_b$ by any constant simply rescales $\beta_b$ by the inverse factor. What matters for the approximation is therefore only the direction of $m_b$ , i.e. the one-dimensional subspace it spans.

Often we don’t think of it like this, but broadly speaking this is what has happened: block scaling allows us to decouple magnitude and direction representation. This resembles the familiar decomposition $v = \|v\|\frac{v}{\|v\|}$ of a vector into its magnitude and direction, but applied locally within blocks.

If the mantissa vector $m_b$ points roughly in the same direction as the original block $v_b$ , then scaling it appropriately produces a good approximation of that block.

OK, but does preserving directions block by block actually preserve the direction of the whole vector? It turns out that the answer is yes.

Direction Preservation

Let us make the reasonable assumption that the scale of each block $\beta_b$ is not chosen arbitrarily, but rather is the best possible scale for that block in the least squares sense, for whatever mantissa vector we choose, i.e. $\beta_b = \arg\min_{\beta} \|v_b - \beta m_b\|^2$ . Then $\hat v_b$ is the orthogonal projection of $v_b$ onto the line spanned by $m_b$ .

So to what extent do the approximate and the original block vector point in the same direction? We can measure the block cosine similarities of the blocks as: $\rho_b = \frac{\langle v_b,\hat v_b\rangle}{\|v_b\|\|\hat v_b\|}$ .

Equally, we can measure the the cosine similarity of the full vectors (the concatenation of the original blocks versus the concatenation of the approximated blocks): $\rho = \frac{\langle v,\hat v\rangle}{\|v\|\|\hat v\|}$ .

My aim here is to explain why small error in direction at block level leads to small error at vector level.

First, let’s define $w_b = \frac{\|v_b\|^2}{\|v\|^2}$ , which we can think of as the fraction of the vector’s energy contained in block $b$ ; these add to 1 over the whole vector. Now we can state the result:

Theorem (Block Cosines)

Under the blockwise least-squares scaling, $\rho = \sqrt{\sum_{b=1}^{B} w_b \rho_b^2 }$ .

For proof, see end of post.

In simple terms, this theorem states that the cosine similarity of the whole vector is the energy-weighted RMS of the block cosine similarities.

What are the implications?

The weights $w_b$ represent how much of the vector’s energy lies in each block. Blocks that contain very little energy contribute very little to the final direction. The important consequence is that direction errors do not accumulate catastrophically across blocks. Instead, the overall directional error simply depends on a weighted average of the block direction errors. In other words, if block number formats preserve the directions of individual blocks, they automatically preserve the direction of the entire vector.

Many core operations in machine learning depend heavily on vector direction. Notably, during training, stochastic gradient descent updates are already in the form of magnitude (learning rate) + direction. We already have a knob controlling magnitude (the learning rate); what matters is that the direction is preserved. In attention mechanisms and embedding, directional similarity measures are very important. Even for the humble dot product, the workhorse of inference, preservation of direction means that small perturbations in input give rise to only small perturbations in output, so the dot product behaves robustly.

Conclusion

Block floating-point and similar formats like block mini-float, MX, NVFP, are usually explained in terms of dynamic range and quantization noise. But geometrically, I like the perspective that they do something simpler: they approximate each block of a vector as direction × magnitude.

And as long as the block directions are preserved reasonably well, the direction of the whole vector is preserved too.

I think this is a useful intuition as to why very low-precision formats can work so well in modern machine learning systems. Block number formats are, in a very real sense, direction preservers. From this perspective, such low-precision block formats succeed not because they represent individual numbers accurately, but because they preserve the geometry of vectors.

Lots of extensions of this kind of analysis are of course possible. To name just a few:

We’ve focused on vectors, but tensor-level scaling may have interesting interplay with batching during training, for example
We made the simplifying assumption that scaling factors were real valued, but these can be restricted, most significantly to powers of two, and the analysis would need to be modified to incorporate that change.
We’ve not discussed mantissas at all, lots more of interest could be said here.
Potentially this approach could help provide some guidance to the empirical sizing of blocks in a block representation.

If anyone would like to work with me on this topic, do let me know your ideas.

Proof of the theorem

Readers not interested in the algebra can safely skip this section.

For each block $b$ , the approximation $\hat v_b = \beta_b m_b$ with $\beta_b$ chosen by least squares is the orthogonal projection of $v_b$ onto the line spanned by $m_b$ .

So we can write $v_b = \hat v_b + r_b$ where $r_b$ is orthogonal to $\hat v_b$ .

Taking the inner product with $\hat v_b$ gives $\langle v_b,\hat v_b\rangle = \|\hat v_b\|^2$ .

Now sum over blocks. Because the blocks correspond to disjoint coordinates,

$\langle v,\hat v\rangle = \sum_b \langle v_b,\hat v_b\rangle = \sum_b \|\hat v_b\|^2 = \|\hat v\|^2$ .

Therefore

$\rho = \frac{\langle v,\hat v\rangle}{\|v\|\|\hat v\|} = \frac{\|\hat v\|}{\|v\|}$ .

Recall $\rho_b = \frac{\langle v_b,\hat v_b\rangle}{\|v_b\|\|\hat v_b\|}$ .

Using $\langle v_b,\hat v_b\rangle=\|\hat v_b\|^2$ , we obtain

$\rho_b = \frac{\|\hat v_b\|}{\|v_b\|}$ .

Hence

$\|\hat v_b\|^2 = \rho_b^2 \|v_b\|^2$ .

Summing over blocks gives

$\|\hat v\|^2 = \sum_b \|\hat v_b\|^2 = \sum_b \rho_b^2 \|v_b\|^2$ .

Dividing by $\|v\|^2$ , and writing

$w_b = \frac{\|v_b\|^2}{\|v\|^2}$ ,

gives

$\frac{\|\hat v\|^2}{\|v\|^2} = \sum_b w_b \rho_b^2$ .

Since $\rho = \|\hat v\|/\|v\|$ , we obtain

$\rho = \sqrt{\sum_b w_b \rho_b^2 }$ .

$\square$

California / FPGA 2026

This month I took a trip to California for the FPGA 2026 conference, together with my two new PhD students Ben Zhang and Bardia Zadeh, which I combined with a number of visits in the San Francisco Bay Area in the days preceding the conference. This post provides a brief summary of my visit.

First up on our travels was dinner with Rocco Salvia. Rocco was my Research Assistant – we worked together some years ago on automating the analysis of average-case numerical behaviour of reduced-precision floating-point computation. He is now working for Zoox, the robotaxi company owned by Amazon where his first-rate engineering skills are being put to good use!

The following day we went to visit Max Willsey at UC Berkeley and his PhD student Russel Arbore. Max and I (together with others) organised a Dagstuhl workshop on e-graphs recently, and we went to pick up the research conversations we left behind a month ago in Germany and spend some good quality whiteboard time together. Russel and Max are working on some really exciting problems in program analysis.

That afternoon, we had the chance to catch up with my old friend and colleague Satnam Singh, now working for the startup harmonic.fun. Harmonic is a really exciting company, combining modern AI tools with Lean-based formal theorem proving. Expect great things here.

George, Satnam, Bardia and Ben, enjoying coffee near the Harmonic office.

The following morning, we went to visit AMD, with whom I have longstanding collaborations. Amongst others, we met my two former PhD students Sam Bayliss and Erwei Wang there, and discussed our ongoing work on e-graphs and on efficient machine learning, as well as finding out the latest work in Sam’s team at AMD including their release of Triton-XDNA.

That afternoon we visited NVIDIA’s stunning HQ to meet with Rajarshi Roy and Atefeh Sohrabizadeh. I know both of them through the EPSRC International Centre-to-Centre grant I led: Rajarshi was introduced to me by Bryan Catanzaro as the author of some really interesting work on reinforcement learning for computer arithmetic design, and spoke at our EPSRC project’s annual workshop. Atefeh was a PhD student affiliated with the Centre (advised by Jason Cong, UCLA) and spent some time visiting my research group. We heard about the recent NVIDIA work on AI models to aid software engineering and of combined speech and language.

Rajarshi, Ben, Bardia, George and Atefeh at NVIDIA

It has long been a tradition that Peter Cheung, when in the Bay Area, organises a get-together of alumni of the Circuits and Systems Group (formerly Information Engineering group, when Bob Spence was Head of Group). This time was no exception – we met up with many of our department’s former students, and some came to a great dinner too. It’s always a delight to hear about the activity of our alumni, spread across the tech companies in the Bay Area.

After a flying visit to my former PhD student’s family, we then made it down to Monterey for FPGA 2026. Regular readers of this blog will know that I’ve been attending FPGA for more than 20 years, have been Program Chair, General Chair, Finance Chair and am now Steering Committee member of the conference. So it always feels a little like “coming home”. I also love Monterey – despite the touristy bits – and am a fan of Steinbeck‘s writing in which he immortalised Monterey with some of the best opening lines ever (of Cannery Row): “Cannery Row in Monterey in California is a poem, a stink, a grating noise, a quality of light, a habit, a nostalgia, a dream.”

This year, the general chair of FPGA 2026 was Jing Li, and the great programme was put together by Grace Zgheib.

My favourite paper at FPGA this year also won the best paper prize. Duc Hoang and colleagues identified that Kolmogorov-Arnold Networks are a natural fit to the LUT-based neural networks my group pioneered e.g. [1,2]. They form a really interesting design point, overcoming the exponential scaling of area with the product of precision and neuron fanin present in both my SOTA work with Marta Andronic and earlier work like Xilinx LogicNets, to produce a design that scales exponentially only in the precision. I very much enjoyed reading this paper and seeing it presented, and I think it opens up new areas of future work in this area.

Duc Hoang and his coauthor Aarush Gupta, receiving the best paper award from Grace and Jing

I also particularly enjoyed the work of Shun Katsumi, Emmet Murphy and Lana Josipović (ETH Zurich) on eager execution in elastic circuits. I previously collaborated with Lana on elastic circuits, and it’s great to see the latest work in this area and the use of formal verification tools to prove correctness of performance enhancements. I had a very nice discussion with Lana about possible ways to take this work further.

Rouzbeh Pirayadi, Ayatallah Elakhras, Mirjana Stojilović and Paolo Ienne (EPFL) had a really interesting paper on avoiding the overhead of load-store queues in dynamic high-level synthesis. (This paper was also the runner-up best paper).

From my own institution, Oliver Cosgrove, Ally Donaldson and John Wickerson had a great paper on fuzzing FPGA place and route tools, which has led a vendor to fix a bug they uncovered through their tool.

There were many other good papers, but just to mention a couple that I found particularly aligned to my own interests: EdgeSort on the design of line-rate streaming sorters and HACE on extracting CDFGs from RTL were both really interesting to hear presented.

It was great to be reunited with so many international colleagues and to provide my new students Bardia and Ben with the chance to begin their journey of integration into this welcoming community.

Me, John, Bardia, Oliver and Ben after the end of the final conference session

FCCM 2025

I’ve recently returned from the IEEE International Symposium on Field-Programmable Custom Computing Machines (known as FCCM). I used to attend FCCM regularly in the early 2000s, and while I have continued to publish there, I have not attended myself for some years. I tried a couple of years ago, but ended up isolated with COVID in Los Angeles. In contrast, I am pleased to report that the conference is in good health!

The conference kicked off on the the evening of the 4th May, with a panel discussion on the topic of “The Future of FCCMs Beyond Moore’s Law”, of which I was invited be be part, alongside industrial colleagues Chris Lavin and Madhura Purnaprajna from AMD, Martin Langhammer from Altera, and Mark Shand from Waymo. Many companies have tried and failed to produce lasting post-Moore alternatives to the FPGA and the microprocessor over the decades I’ve been in the field and some of these ideas and architectures (less commonly, associated compiler flows / design tools) have been very good. But, as Keynes said, “markets can remain irrational longer than you can remain solvent”. So instead of focusing on commercial realities, I tried to steer the panel discussion towards the genuinely fantastic opportunities our academic field has for a future in which power, performance and area innovation changes become a matter of intellectual advances in architecture and compiler technology rather than riding the wave of technology miniaturisation (itself, of course, the product of great advances by others).

The evening panel, as imagined by AI. I’m 2nd to left. The AI tool was clearly unaware of Martin’s height difference!

The following day, the conference proper kicked off. Some highlights for me from other authors included the following papers aligned with my general interests:

AutoNTT: Automatic Architecture Design and Exploration for Number Theoretic Transform Acceleration on FPGAs from Simon Fraser University, presented by Zhenman Fang.
RealProbe: An Automated and Lightweight Performance Profiler for In-FPGA Execution of High-Level Synthesis Designs from Georgia Tech, presented by Jiho Kim from Callie Hao‘s group.
High Throughput Matrix Transposition on HBM-Enabled FPGAs from the University of Southern California (Viktor Prasanna‘s group).
ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference Through Iterative Tensor Decomposition from my colleague Christos Bouganis‘ group at Imperial College, presented by Keran Zheng.
Guaranteed Yet Hard to Find: Uncovering FPGA Routing Convergence Paradox from Mirjana Stojilovic‘s group at EPFL – and winner of this year’s best paper prize!

In addition, my own group had two full papers at FCCM this year:

Banked Memories for Soft SIMT Processors, joint work between Martin Langhammer (Altera) and me, where Martin has been able to augment his ultra-high-frequency soft-processor with various useful memory structures. This is probably the last paper of Martin’s PhD – he’s done great work in both developing a super-efficient soft-processor and in forcing the FPGA community to recognise that some published clock frequency results are really quite poor and that people should spend a lot longer thinking about the physical aspects of their designs if they want to get high performance.
NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT Inference, joint work between my PhD student Marta Andronic and me. I think this is a landmark paper in terms of the results that Marta has been able to achieve. Compared to her earlier NeuraLUT work which I’ve blogged on previously, she has added a way to break down large LUTs into trees of smaller LUTs, and a hardware-aware way to learn sparsity patterns that work best, localising nonlinear interactions in these neural networks to within lookup tables. The impact of these changes on the area and delay of her designs is truly impressive.

Martin explaining efficient
memory structures for soft processors

Overall, it was well worth attending. Next year, Callie will be hosting FCCM in Atlanta.

Notes on Computational Learning Theory

This blog collects some of my notes on classical computational learning theory, based on my reading of Kearns and Vazirani. The results are (almost) all from their book, the sloganising (and mistakes, no doubt) are mine.

The Probably Approximately Correct (PAC) Framework

Definition (Instance Space). An instance space is a set, typically denoted $X$ . It is the set of objects we are trying to learn about.

Definition (Concept). A concept $c$ over $X$ is a subset of the instance space $X$ .

Although not covered in Kearns and Vazirani, in general it is possible to generalise beyond Boolean membership to some degree of uncertainty or fuzziness – I hope to cover this in a future blog post.

Definition (Concept Class). A concept class ${\mathcal C}$ is a set of concepts, i.e. ${\mathcal C} \subset \mathcal{P}(X)$ , where $\mathcal P$ denotes power set. We will follow Kearns and Vazirani and also use $c$ to denote the corresponding indicator function $c : X \to \{0,1\}$ .

In PAC learning, we assume ${\mathcal C}$ is known, but the target class $c \in {\mathcal C}$ is not. However, it doesn’t seem a jump to allow for unknown target class, in an appropriate approximation setting – I would welcome comments on established frameworks for this.

Definition (Target Distribution). A target distribution ${\mathcal D}$ is a probability distribution over $X$ .

In PAC learning, we assume ${\mathcal D}$ is unknown.

Definition (Oracle). An oracle is a function $EX(c,{\mathcal D})$ taking a concept class and a distribution, and returning a labelled example $(x, c(x))$ where $x$ is drawn randomly and independently from ${\mathcal D}$ .

Definition (Error). The error of a hypothesis concept class $h \in {\mathcal C}$ with reference to a target concept class $c \in {\mathcal C}$ and target distribution ${\mathcal D}$ , is $\text{error}(h) = Pr_{x \in {\mathcal D}}\left\{ c(x) \neq h(x) \right\}$ , where $Pr$ denotes probability.

Definition (Representation Scheme). A representation scheme for a concept class ${\mathcal C}$ is a function ${\mathcal R} : \Sigma^* \to {\mathcal C}$ where $\Sigma$ is a finite alphabet of symbols (or – following the Real RAM model – a finite alphabet augmented with real numbers).

Definition (Representation Class). A representation class is a concept class together with a fixed representation scheme for that class.

Definition (Size). We associate a size $\text{size}(\sigma)$ with each string from a representation alphabet $\sigma \in \Sigma^*$ . We similarly associate a size with each concept $c$ via the size of its minimal representation $\text{size}(c) = \min_{R(\sigma) = c} \text{size}(\sigma)$ .

Definition (PAC Learnable). Let ${\mathcal C}$ and ${\mathcal H}$ be representation classes classes over $X$ , where ${\mathcal C} \subseteq {\mathcal H}$ . We say that concept class ${\mathcal C}$ is PAC learnable using hypothesis class ${\mathcal H}$ if there exists an algorithm that, given access to an oracle, when learning any target concept $c \in {\mathcal C}$ over any distribution ${\mathcal D}$ on $X$ , and for any given $0 < \epsilon < 1/2$ and $0 < \delta < 1/2$ , with probability at least $1-\delta$ , outputs a hypothesis $h \in {\mathcal H}$ with $\text{error}(h) \leq \epsilon$ .

Definition (Efficiently PAC Learnable). Let ${\mathcal C}_n$ and ${\mathcal H}_n$ be representation classes classes over $X_n$ , where ${\mathcal C}_n \subseteq {\mathcal H}_n$ for all $n$ . Let $X_n = \{0,1\}^n$ or $X_n = {\mathbb R}^n$ . Let $X = \cup_{n \geq 1} X_n$ , ${\mathcal C} = \cup_{n \geq 1} {\mathcal C_n}$ , and ${\mathcal H} = \cup_{n \geq 1} {\mathcal H_n}$ . We say that concept class ${\mathcal C}$ is efficiently PAC learnable using hypothesis class ${\mathcal H}$ if there exists an algorithm that, given access to a constant time oracle, when learning any target concept $c \in {\mathcal C}_n$ over any distribution ${\mathcal D}$ on $X$ , and for any given $0 < \epsilon < 1/2$ and $0 < \delta < 1/2$ :

Runs in time polynomial in $n$ , $\text{size}(c)$ , $1/\epsilon$ , and $1/\delta$ , and
With probability at least $1-\delta$ , outputs a hypothesis $h \in {\mathcal H}$ with $\text{error}(h) \leq \epsilon$ .

There is much of interest to unpick in these definitions. Firstly, notice that we have defined a family of classes parameterised by dimension $n$ , allowing us to talk in terms of asymptotic behaviour as dimensionality increases. Secondly, note the key parameters of PAC learnability: $\delta$ (the ‘probably’ bit) and $\epsilon$ (the ‘approximate’ bit). The first of these captures the idea that we may get really unlucky with our calls to the oracle, and get misleading training data. The second captures the idea that we are not aiming for certainty in our final classification accuracy, some pre-defined tolerance is allowable. Thirdly, note the requirements of efficiency: polynomial scaling in dimension, in size of the concept (complex concepts can be harder to learn), in error rate (the more sloppy, the easier), and in probability of algorithm failure to find a suitable hypothesis (you need to pay for more certainty). Finally, and most intricately, notice the separation of concept class from hypothesis class. We require the hypothesis class to be at least as general, so the concept we’re trying to learn is actually one of the returnable hypotheses, but it can be strictly more general. This is to avoid the case where the restricted hypothesis classes are harder to learn; Kearns and Vazirani, following Pitt and Valiant, give the example of learning the concept class 3-DNF using the hypothesis class 3-DNF is intractable, yet learning the same concept class with the more general hypothesis class 3-CNF is efficiently PAC learnable.

Occam’s Razor

Definition (Occam Algorithm). Let $\alpha \geq 0$ and $0 \leq b < 1$ be real constants. An algorithm is an $(\alpha,\beta)$ -Occam algorithm for ${\mathcal C}$ using ${\mathcal H}$ if, on an input sample $S$ of cardinality $m$ labelled by membership in $c \in {\mathcal C}_n$ , the algorithm outputs a hypothesis $h \in {\mathcal H}$ such that:

$h$ is consistent with $S$ , i.e. there is no misclassification on $S$
$\text{size}(h) \leq \left(n \cdot \text{size}(c)\right)^\alpha m^\beta$

Thus Occam algorithms produce succinct hypotheses consistent with data. Note that the size of the hypothesis is allowed to grow only mildly – if at all – with the size of the dataset (via $\beta$ ). Note, however, that there is nothing in this definition that suggests predictive power on unseen samples.

Definition (Efficient Occam Algorithm). An $(\alpha,\beta)$ -Occam algorithm is efficient iff its running time is polynomial in $n$ , $m$ , and $\text{size}(c)$ .

Theorem (Occam’s Razor). Let $A$ be an efficient $(\alpha,\beta)$ -Occam algorithm for ${\mathcal C}$ using ${\mathcal H}$ . Let ${\mathcal D}$ be the target distribution over $X$ , let $c \in {\mathcal C}_n$ be the target concept, $0 < \epsilon, \delta \leq 1$ . Then there is a constant $a > 0$ such that if $A$ is given as input a random sample $S$ of $m$ examples drawn from oracle $EX(c,{\mathcal D})$ , where $m$ satisfies $m \geq a \left( \frac{1}{\epsilon} \log \frac{1}{\delta} + \left(\frac{\left( n \cdot \text{size}(c) \right)^\alpha}{\epsilon}\right)^\frac{1}{1-\beta}\right)$ , then $A$ runs in time polynomial in $n$ , $\text{size}(c)$ , $1/\epsilon$ and $\frac{1}{\delta}$ and, with probability at least $1 - \delta$ , the output $h$ of $A$ satisfies $error(h) \leq \epsilon$ .

This is a technically dense presentation, but it’s a philosophically beautiful result. Let’s unpick it a bit, so its essence is not obscured by notation. In summary, simple rules that are consistent with prior observations have predictive power! The ‘simple’ part here comes from $(\alpha,\beta)$ , and the predictive power comes from the bound on $\text{error}(h)$ . Of course, one needs sufficient observations (the complex lower bound on $m$ ) for this to hold. Notice that as $\beta$ approaches 1, and so – by the definition of an Occam algorithm – we get close to being able to memorise our entire training set – we need an arbitrarily large training set (memorisation doesn’t generalise).

Vapnik-Chervonenkis (VC) Dimension

Definition (Behaviours). The set of behaviours on $S = \{x_1, \ldots, x_m\}$ that are realised by ${\mathcal C}$ , is defined by $\Pi_{\mathcal C}(S) = \left\{ \left(c(x_1), \ldots, c(x_m)\right) | c \in {\mathcal C} \right\}$ .

Each of the points in $S$ is either included in a given concept or not. Each tuple $\left(c(x_1), \ldots, c(x_m)\right)$ then forms a kind of fingerprint of $X$ according to a particular concept. The set of behaviours is the set of all such fingerprints across the whole concept class..

Definition (Shattered). A set $S$ is shattered by ${\mathcal C}$ iff $\Pi_{\mathcal C}(S) = \{0,1\}^{|S|}$ .

Note that $\{0,1\}^{|S|}$ is the maximum cardinality that’s possible, i.e. the set of behaviours is all possible behaviours. So we can think of a set as being shattered by a concept class iff there’s no combination of inclusion/exclusion in the concepts that isn’t represented at least once in the set.

Definition (Vapnik-Chervonenkis Dimension). The VC dimension of ${\mathcal C}$ , denoted $VCD({\mathcal C})$ , is the cardinality of the largest set shattered by ${\mathcal C}$ . If arbitrarily large finite sets can be shattered by ${\mathcal C}$ , then $VDC({\mathcal C}) = \infty$ .

VC dimension in this sense captures the ability of ${\mathcal C}$ to discern between samples.

Theorem (PAC-learning in Low VC Dimension). Let ${\mathcal C}$ be any concept class. Let ${\mathcal H}$ be any representation class off of VC dimension $d$ . Let $A$ be any algorithm taking a set of $m$ labelled examples of a concept $c \in {\mathcal C}$ and producing a concept in ${\mathcal H}$ that is consistent with the examples. Then there exists a constant $c_0$ such that $A$ is a PAC learning algorithm for ${\mathcal C}$ using ${\mathcal H}$ when it is given examples from $EX(c,{\mathcal D})$ , and when $m \geq c_0 \left( \frac{1}{\epsilon} \log \frac{1}{\delta} + \frac{d}{\epsilon} \log \frac{1}{\epsilon} \right)$ .

Let’s take a look at the similarity between this theorem and Occam’s razor, presented in the last section of this blog post. Both bounds have a similar feel, but the VCD-based bound does not depend on $\text{size}(c)$ ; indeed it’s possible that the size of hypotheses is infinite and yet the VCD is still finite.

As the theorem below shows, the linear dependence on VCD achieved in the above theorem is actually the best one can do.

Theorem (PAC-learning Minimum Samples). Any algorithm for PAC-learning a concept class of VC dimension $d$ must use $\Omega(d/\epsilon)$ examples in the worst case.

Definition (Layered DAG). A layered DAG is a DAG in which each vertex is associated with a layer $\ell \in {\mathbb N}$ and in which the edges are always from some layer $\ell$ to the next layer $\ell+1$ . Vertices at layer 0 have indegree 0 and are referred to as input nodes. Vertices at other layers are referred to as internal nodes. There is a single output node of outdegree 0.

Definition ( $G$ -composition). For a layered DAG $G$ and a concept class ${\mathcal C}$ , the G-composition of ${\mathcal C}$ is the class of all concepts that can be obtained by: (i) associating a concept $c_i \in {\mathcal C}$ with each vertex $N_i$ in $G$ , (ii) applying the concept at each node to its predecessor nodes.

Notice that this way we can think of the internal nodes as forming a Boolean circuit with a single output; the $G$ -composition is the concept class we obtain by restricting concepts to only those computable with the structure $G$ . This is a very natural way of composing concepts – so what kind of VCD arises through this composition? This theorem provides an answer:

Theorem (VCD Compositional Bound). Let $G$ be a layered DAG with $n$ input nodes and $s \geq 2$ internal nodes, each of indegree $r$ . Let ${\mathcal C}$ be a concept class over ${\mathbb R}^r$ of VC dimension $d$ , and let ${\mathcal C}_G$ be the $G$ -composition of ${\mathcal C}$ . Then $VCD({\mathcal C}_G) \leq 2ds \log(es)$ .

Weak PAC Learnability

Definition (Weak PAC Learning). Let ${\mathcal C}$ be a concept class and let $A$ be an algorithm that is given access to $EX(c,{\mathcal D})$ for target concept $c \in {\mathcal C}_n$ and distribution ${\mathcal D}$ . $A$ is a weak PAC learning algorithm for ${\mathcal C}$ using ${\mathcal H}$ if there exist polynomials $p(\cdot,\cdot)$ and $q(\cdot,\cdot)$ such that $A$ outputs a hypothesis $h \in {\mathcal H}$ that with probability at least $1/q(n,\text{size}(c))$ satisfies $\text{error}(h) \leq 1/2 - 1/p(n,\text{size}(c))$ .

Kearns and Vazirani justifiably describe weak PAC learning as “the weakest demand we could place on an algorithm in the PAC setting without trivialising the problem”: if these were exponential rather than polynomial functions in $n$ , the problem is trivial: take a fixed-size random sample of the concept and memorise it, randomly guess with probability 50% outside the memorised sample. The remarkable result is that efficient weak PAC learnability and efficient PAC learnability coincide for an appropriate PAC hypothesis class, based on ternary majority trees.

Definition (Ternary Majority Tree). A ternary majority tree with leaves from ${\mathcal H}$ is a tree where each non-leaf node computes a majority (voting) function of its three children, and each leaf is labelled with a hypothesis from ${\mathcal H}$ .

Theorem (Weak PAC learnability is PAC learnability). Let ${\mathcal C}$ be any concept class and ${\mathcal H}$ any hypothesis class. Then if ${\mathcal C}$ is efficiently weakly PAC learnable using ${\mathcal H}$ , it follows that ${\mathcal C}$ is efficiently PAC learnable using a hypothesis class of ternary majority trees with leaves from ${\mathcal H}$ .

Kearns and Varzirani provide an algorithm to learn this way. The details are described in their book, but the basic principle is based on “boosting”, as developed in the lemma to follow.

Definition (Filtered Distributions). Given a distribution ${\mathcal D}$ and a hypothesis $h_1$ we define ${\mathcal D_2}$ to be the distribution obtained by flipping a fair coin and, on a heads, drawing from $EX(c,{\mathcal D})$ until $h_1$ agrees with the label; on a tails, drawing from $EX(c,{\mathcal D})$ until $h_1$ disagrees with the label. Invoking a weak learning algorithm on data from this new distribution yields a new hypothesis $h_2$ . Similarly, we define ${\mathcal D_3}$ to be the distribution obtained by drawing examples from $EX(c,{\mathcal D})$ until we find an example on which $h_1$ and $h_2$ disagree.

What’s going on in these constructions is quite clever: $h_2$ has been constructed so that it must contain new information about $c$ , compared to $h_1$ ; $h_1$ has, by construction, no advantage over a coin flip on ${\mathcal D}_2$ . Similarly, $h_3$ contains new information about $c$ not already contained in $h_1$ and $h_2$ , namely on the points where they disagree. Thus, one would expect that hypotheses that work in these three cases could be combined to give us a better overall hypothesis. This is indeed the case, as the following lemma shows.

Lemma (Boosting). Let $g(\beta) = 3 \beta^2 - 2 \beta^3$ . Let the distributions ${\mathcal D}$ , ${\mathcal D}_2$ , ${\mathcal D}_3$ be defined above, and let $h_1$ , $h_2$ and $h_3$ satisfy $\text{error}_{\mathcal D}(h_1) \leq \beta$ , $\text{error}_{{\mathcal D}_2}(h_2) \leq \beta$ , $\text{error}_{{\mathcal D}_3}(h_3) \leq \beta$ . Then if $h = \text{majority}(h_1, h_2, h_3)$ , it follows that $\text{error}_{\mathcal D}(h) \leq g(\beta)$ .

The function $g$ is monotone and strictly decreasing over $[0,1/2)$ . Hence by combining three hypotheses with only marginally better accuracy than flipping a coin, the boosting lemma tells us that we can obtain a strictly stronger hypothesis. The algorithm for (strong) PAC learnability therefore involves recursively calling this boosting procedure, leading to the majority tree – based hypothesis class. Of course, one needs to show that the depth of the recursion is not too large and that we can sample from the filtered distributions with not too many calls to the overall oracle $EX(c,{\mathcal D})$ , so that the polynomial complexity bound in the PAC definition is maintained. Kearns and Vazirani include these two results in the book.

Learning from Noisy Data

Up until this point, we have only dealt with correctly classified training data. The introduction of a noisy oracle allows us to move beyond this limitation.

Definition (Noisy Oracle). A noisy oracle $\hat{EX}^\eta( c, {\mathcal D})$ extends the earlier idea of an oracle with an additional noise parameter $0 \leq \eta < 1/2$ . This oracle behaves in the identical way to $EX$ except that it returns the wrong classification with probability $\eta$ .

Definition (PAC Learnable from Noisy Data). Let ${\mathcal C}$ be a concept class and let ${\mathcal H}$ be a representation class over $X$ . Then ${\mathcal C}$ is PAC learnable from noisy data using ${\mathcal H}$ if there exists and algorithm such that: for any concept $c \in {\mathcal C}$ , any distribution ${\mathcal D}$ on $X$ , any $0 \leq \eta < 1/2$ , and any $0 < \epsilon < 1$ , $0 < \delta < 1$ and $\eta_0$ with $\eta \leq \eta_0 < 1/2$ , given access to a noisy oracle $\hat{EX}^\eta( c, {\mathcal D})$ and inputs $\epsilon$ , $\delta$ , $\eta_0$ , with probability at least $1 - \delta$ the algorithm outputs a hypothesis concept $h \in {\mathcal H}$ with $\text{error}(h) \leq \epsilon$ . If the runtime of the algorithm is polynomial in $n$ , $1/\epsilon$ , $1/\delta$ and $1/(1 - 2\eta_0)$ then ${\mathcal C}$ is efficiently learnable from noisy data using ${\mathcal H}$ .

Let’s unpick this definition a bit. The main difference from the PAC definition is simply the addition of noise via the oracle and an additional parameter $\eta_0$ which bounds the error of the oracle; thus the algorithm is allowed to know in advance an upper bound on the noisiness of the data, and an efficient algorithm is allowed to take more time on more noisy data.

Kearns and Vazirani address PAC learnability from noisy data in an indirect way, via the use of a slightly different framework, introduced below.

Definition (Statistical Oracle). A statistical oracle $STAT(c, {\mathcal D})$ takes queries of the form $(\chi, \tau)$ where $\chi : X \times \{0,1\} \to \{0,1\}$ and $0 < \tau \leq 1$ , and returns a value $\hat{P}_\chi$ satisfying $P_\chi - \tau \leq \hat{P}_\chi \leq P_\chi + \tau$ where $P_\chi = Pr_{x \in {\mathcal D}}[ \chi(x, c(x)) = 1 ]$ .

Definition (Learnable from Statistical Queries). Let ${\mathcal C}$ be a concept class and let ${\mathcal H}$ be a representation class over $X$ . Then ${\mathcal C}$ is efficiently learnable from statistical learning queries using ${\mathcal H}$ if there exists a learning algorithm $A$ and polynomials $p(\cdot, \cdot, \cdot)$ , $q(\cdot, \cdot, \cdot)$ and $r(\cdot,\cdot,\cdot)$ such that: for any $c \in {\mathcal C}$ , any distribution ${\mathcal D}$ over $X$ and any $0 < \epsilon < 1/2$ , if given access to $STAT(c,{\mathcal D})$ , the following hold. (i) For every query $(\chi,\tau)$ made by $A$ , the predicate $\chi$ can be evaluated in time $q(1/\epsilon, n, \text{size}(c))$ , and $\tau \leq r(1/\epsilon, n, \text{size}(c))$ , (ii) $A$ has execution time bounded by $p(1/\epsilon, n, \text{size}(c))$ , (iii) $A$ outputs a hypothesis $h \in {\mathcal H}$ that satisfies $\text{error}(h) \leq \epsilon$ .

So a statistical oracle can be asked about a whole predicate $\chi$ , for any given tolerance $\tau$ . The oracle must return an estimate of the probability that this predicate holds (where the probability is over the distribution over $X$ ). It is, perhaps, not entirely obvious how to relate this back to the more obvious noisy oracle used above. However, it is worth noting that one can construct a statistical oracle that works with high probability by taking enough samples from a standard oracle, and then returning the relative frequency of $\chi$ evaluating to 1 on that sample. Kearns and Vazirani provide an intricate construction to efficiently sample from a noisy oracle to produce a statistical oracle with high probability. In essence, this then allows an algorithm that can learn from statistical queries to be used to learn from noisy data, resulting in the following theorem.

Theorem (Learnable from Statistical Queries means Learnable from Noisy Data). Let ${\mathcal C}$ be a concept class and let ${\mathcal H}$ be a representation class over $X$ . Then if ${\mathcal C}$ is efficiently learnable from statistical queries using ${\mathcal H}$ , ${\mathcal C}$ is also efficiently PAC learnable using ${\mathcal H}$ in the presence of classification noise.

Hardness Results

I mentioned earlier in this post that Pitt and Valiant showed that sometimes we want more general hypothesis classes than concept classes: the concept class 3-DNF using the hypothesis class 3-DNF is intractable, yet learning the same concept class with the more general hypothesis class 3-CNF is efficiently PAC learnable. So in their chapter Inherent Unpredictability, Kearns and Vazirani turn their attention to the case where a concept class is hard to learn independent of the choice of a hypothesis class. This leads to some quite profound results for those of us interested in Boolean circuits.

We will need some kind of hardness assumption to develop hardness results for learning. In particular, note that if $P = NP$ , then by Occam’s Razor (above) polynomially evaluable hypothesis classes are also polynomially-learnable ones. So we will need to do two things: focus our attention on polynomially evaluable hypothesis classes (or we can’t hope to learn them polynomially), and make a suitable hardness assumption. The latter requires a very brief detour into some results commonly associated with cryptography.

Let ${\mathbb Z}_N^* = \{ i \; | \; 0 < i < N \; \wedge \text{gcd}(i, N) = 1 \}$ . We define the cubing function $f_N : {\mathbb Z}_N^* \to {\mathbb Z}_N^*$ by $f_N(x) = x^3 \text{ mod } N$ . Let $\varphi$ define Euler’s totient function. Then if $\varphi$ is not a multiple of three, it turns out that $f_N$ is bijective, so we can talk of a unique discrete cube root.

Definition (Discrete Cube Root Problem). Let $p$ and $q$ be two $n$ -bit primes with $\varphi(N)$ not a multiple of 3, where $N = pq$ . Given $N$ and $f_N(x)$ as input, output $x$ .

Definition (Discrete Cube Root Assumption). For every polynomial $P$ , there is no algorithm that runs in time $P(n)$ that solves the discrete cube root problem with probability at least $1/P(n)$ , where the probability is taken over randomisation of $p$ , $q$ , $x$ and any internal randomisation of the algorithm $A$ . (Where $N = pq$ ).

This Discrete Cube Root Assumption is widely known and studied, and forms the basis of the learning complexity results presented by Kearns and Vazirani.

Theorem (Concepts Computed by Small, Shallow Boolean Circuits are Hard to Learn). Under the Discrete Cube Root Assumption, the representation class of polynomial-size, log-depth Boolean circuits is not efficiently PAC learnable (using any polynomially evaluable hypothesis class).

The result also holds if one removes the log-depth requirement, but this result shows that even by restricting ourselves to only log-depth circuits, hardness remains.

In case any of my blog readers knows: please contact me directly if you’re aware of any resource of positive results on learnability of any compositionally closed non-trivial restricted classes of Boolean circuits.

The construction used to provide the result above for Boolean circuits can be generalised to neural networks:

Theorem (Concepts Computed by Neural Networks are Hard to Learn). Under the Discrete Cube Root Assumption, there is a polynomial $p$ and an infinite family of directed acyclic graphs (neural network architectures) $G = \{G_{n^2}\}_{n \geq 1}$ such that each $G_{n^2}$ has $n^2$ Boolean inputs and at most $p(n)$ nodes, the depth of $G_{n^2}$ is a constant independent of $n$ , but the representation class ${\mathcal C}_G = \cup_{n \geq 1} {\mathcal C}_{G_{n^2}}$ is not efficiently PAC learnable (using any polynomially evaluable hypothesis class), and even if the weights are restricted to be binary.

Through an appropriate natural definition of reduction in PAC learning, Kearns and Vazirani show that the PAC-learnability of all these classes reduce to functions computed by deterministic finite automata. So, in particular:

Theorem (Concepts Computed by Deterministic Finite Automata are Hard to Learn). Under the Discrete Cube Root Assumption, the representation class of Deterministic Finite Automata is not efficiently PAC learnable (using any polynomially evaluable hypothesis class).

It is this result that motivates the final chapter of the book.

Experimentation in Learning

As discussed above, PAC model utilises an oracle that returns labelled samples $(x, c(x))$ . An interesting question is whether more learning power arises if we allow the algorithms to be able to select $x$ themselves, with the oracle returning $c(x)$ , i.e. not just to be shown randomly selected examples but take charge and test their understanding of the concept.

Definition (Membership Query). A membership query oracle takes any instance $x$ and returns its classification $c(x)$ .

Definition (Equivalence Query). An equivalence query oracle takes a hypothesis concept $h \in {\mathcal C}$ and determines whether there is an instance $x$ on which $c(x) \neq h(x)$ , returning this counterexample if so.

Definition (Learnable From Membership and Equivalence Queries). The representation class ${\mathcal C}$ is efficiently exactly learnable from membership and equivalence queries if there is a polynomial $p(\cdot,\cdot)$ and an algorithm with access to membership and equivalence oracles such that for any target concept $c \in {\mathcal C}_n$ , the algorithm outputs the concept $c$ in time $p(\text{size}(c),n)$ .

There are a couple of things to note about this definition. It appears to be a much stronger requirement than PAC learning, as the concept must be exactly learnt. On the other hand, the existence of these more sophisticated oracles, especially the equivalence query oracle, appears to narrow the scope. Kearns and Vazirani encourage the reader to prove that the true strengthening over PAC-learnability is in the membership queries:

Theorem (Exact Learnability from Membership and Equivalence means PAC-learnable with only Membership). For any representation class ${\mathcal C}$ , if ${\mathcal C}$ is efficiently exactly learnable from membership and equivalence queries, then ${\mathcal C}$ is also efficiently learnable in the PAC model with membership queries.

They then provide an explicit algorithm, based on these two new oracles, to efficiently exactly learn deterministic finite automata.

Theorem (Experiments Make Deterministic Finite Automata Efficiently Learnable). The representation class of Deterministic Finite Automata is efficiently exactly learnable from membership and equivalence queries.

Note the contrast with the hardness result of the previous section: through the addition of experimentation, we have gone from infeasible learnability to efficient learnability. Another very philosophically pleasing result.