Block Number Formats are (Still!) Direction Preservers

In my previous post, I argued that block number formats can be understood geometrically as direction preservers. That argument relied on an idealization: once a block direction had been chosen, its scale could be set optimally as an arbitrary real number.

Real hardware formats do not usually work that way. In many practical schemes, block scales are quantized very coarsely, sometimes all the way down to powers of two. In particular, in the MX specification, all the concrete compliant formats use E8M0 scaling.

So does the directional picture I painted in my last post survive this brutal scaling? Here I will argue, in the first of what I hope will be a short sequence of follow-up blog posts, that it does.

From ideal block scales to quantized block scales

Recall the setup from the earlier post. A vector is partitioned into blocks, v = (v_1,\dots,v_B), and each block is approximated as \hat v_b = \beta_b m_b, where m_b is a low-precision mantissa vector and \beta_b is a scalar block scale.

In the earlier post, I assumed that \beta_b was an arbitrary real value, chosen optimally in the least-squares sense. That gave the ideal blockwise representation \hat v.

Now let us keep the same mantissa vectors m_b, but suppose that the scale factors themselves must be quantized. Write the implemented scale as \tilde \beta_b, so that the represented block becomes \tilde v_b = \tilde \beta_b m_b.

It is convenient to define the multiplicative scale error x_b = \frac{\tilde \beta_b}{\beta_b}. Then \tilde v_b = x_b \hat v_b.

Note that, of course, quantizing the block scale does not change the chosen direction of a block at all; it only changes its length. So the only directional distortion comes from the relative rescaling of different blocks.

An exact cosine formula

Let \alpha_b = \frac{\|\hat v_b\|^2}{\sum_j \|\hat v_j\|^2}, so that \alpha_b is the fraction of the ideal projected vectorโ€™s energy contained in block b.

Then it can be shown that \cos(\hat v,\tilde v) = \frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}} (the proof is included at the end of this post).

So the effect of scale quantization on direction depends only on how uneven the factors x_b are across blocks. If all blocks were rescaled by the same factor, direction would be unchanged.

Exponent-only power-of-two scaling

Now consider the coarsest plausible case: each block scale is rounded to the nearest power of two. Then each multiplicative error satisfiesย x_b \in [2^{-1/2},\,2^{1/2}].

So, from our exact cosine formula, we are interested in how small \frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}ย can be, when all the x_b lie in the interval [2^{-1/2},\,2^{1/2}].

A simple inequality shows that the answer depends only on the two extreme values of the interval. If all the block rescaling factors lie in [\ell,u], then

\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}\ge \frac{2\sqrt{\ell u}}{\ell+u} (proof at the end of this blog post).

In the power-of-two case we haveย \ell=2^{-1/2}, u=2^{1/2}, so \ell u=1, and therefore

\cos(\hat v,\tilde v)\ge \frac{2}{2^{-1/2}+2^{1/2}} = \frac{2\sqrt2}{3}\approx 0.943.

Equivalently,ย  \angle(\hat v,\tilde v)\le 20^\circ.

So even if every block scale is rounded to the nearest power of two, the resulting vector remains within about 20^\circ of the ideally scaled one.

That is the main result of this post.

One striking feature of the bound is that it does not depend on the dimension of the vector. The reason is that the worst case is already attained by a two-group energy split: some blocks rounded up, others rounded down. Once those two groups exist, adding more blocks or more dimensions does not make the bound worse, as is apparent from the proof below.

20 degrees is less than it sounds

Our everyday intuition may tell us that this angle is not huge, but it’s not that small either. In a sense, that’s true. But angles behave very differently in high-dimensional spaces. In high dimension, most random vectors are almost orthogonal to one another: their angle is close to 90^\circ, so a guarantee that an approximation remains within 20^\circ of the original vector is much stronger than it would sound in two or three dimensions.

Beyond power-of-two

We’ve analysed power-of-two scaling here for two reasons: because it’s in a sense the crudest possible floating-point rounding, and because it’s commonly used in real hardware designs.

That does not mean it’s optimal. But it does raise two further questions. Firstly, we’ve assumed here that the exponent range is sufficiently wide – what if it’s not? Secondly – and relatedly – how much better can this angular bound get by spending some of the scale bits on greater precision?

My view is that the answer becomes clearer once a tensor-wide high-precision scale is introduced, something NVIDIA has recently done. In that setting, the block scales get relieved of their additional duty to capture global magnitude. This will be the subject of the next post on the topic!

Proofs

Readers not interested in the algebra can safely skip this section.

Cosine formula

Recall that \tilde v_b = x_b \hat v_b for each block b.

Then, because the blocks occupy disjoint coordinates, \langle \hat v,\tilde v\rangle = \sum_b \langle \hat v_b,\tilde v_b\rangle = \sum_b x_b \|\hat v_b\|^2.

Also, \|\hat v\|^2 = \sum_b \|\hat v_b\|^2, and \|\tilde v\|^2 = \sum_b x_b^2 \|\hat v_b\|^2.

Therefore \cos(\hat v,\tilde v) = \frac{\langle \hat v,\tilde v\rangle}{\|\hat v\|\,\|\tilde v\|} = \frac{\sum_b x_b \|\hat v_b\|^2}{\sqrt{\sum_b \|\hat v_b\|^2}\sqrt{\sum_b x_b^2 \|\hat v_b\|^2}}.

Now, as per the main blog post, define \alpha_b = \frac{\|\hat v_b\|^2}{\sum_j \|\hat v_j\|^2}.

Writing S=\sum_j \|\hat v_j\|^2, so that \|\hat v_b\|^2=\alpha_b S, the numerator becomes S\sum_b \alpha_b x_b and the denominator becomes S\sqrt{\sum_b \alpha_b x_b^2}, giving

\cos(\hat v,\tilde v)=\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}.

\square

20 degree bound

Assume that all the multiplicative error factors lie in an interval x_b \in [\ell,u] with u > \ell > 0.

Let \mu := \sum_b \alpha_b x_b,\qquad q := \sum_b \alpha_b x_b^2.

Then the cosine is just \mu/\sqrt q. Since each x_b\in[\ell,u], we have

(x_b-\ell)(x_b-u)\le 0.

Expanding this gives

x_b^2 \le (\ell+u)x_b - \ell u.

Multiplying by \alpha_b and summing over b gives

q \le (\ell+u)\mu - \ell u.

Therefore \frac{\mu^2}{q}\ge \frac{\mu^2}{(\ell+u)\mu-\ell u}.

Now the weighted mean \mu also lies in the interval [\ell,u], so it remains to minimize

\frac{\mu^2}{(\ell+u)\mu-\ell u} over \mu\in[\ell,u].

Differentiating shows that the minimum occurs at \mu=\frac{2\ell u}{\ell+u}, the harmonic mean of \ell and u.

Substituting this value gives

\frac{\mu^2}{q}\ge \frac{4\ell u}{(\ell +u)^2},

and therefore

\frac{\mu}{\sqrt q}\ge \frac{2\sqrt{\ell u}}{\ell +u}.

So we have proved that

\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}\ge \frac{2\sqrt{\ell u}}{\ell+u}.

Finally, in the power-of-two case we have

\ell=2^{-1/2} and u=2^{1/2}, so \ell u=1, and hence

\cos(\hat v,\tilde v)\ge \frac{2}{2^{-1/2}+2^{1/2}} = \frac{2\sqrt2}{3}.

Numerically,

\frac{2\sqrt2}{3}\approx 0.943,

so

\angle(\hat v,\tilde v)\le \arccos\left(\frac{2\sqrt2}{3}\right) < 20^\circ.

\square

Block Number Formats are Direction Preservers

I’ve recently returned from the SIAM PP 2026 conference and as always, conferences help provide time for research reflection. One thing I’ve been reflecting on during my journey back is the various explanations people give for why the machine learning world is so keen on block number formats (MX, NVFP, etc.) – see my earlier blog post on MX if you need a primer. Many hardware engineers tend to answer that they lead to efficient storage, or efficient arithmetic, or improved data transfer bandwidth, which are all true. But I think there’s another complementary answer that’s less well discussed (if indeed it is discussed at all). I hope this blog post might help stimulate some discussion of this complementary take.

On the numerical side, at first glance it might seem surprising that despite these formats representing numbers with very limited precision, large neural networks often tolerate them remarkably well, with little loss in accuracy. In my experience, most explanations focus on dynamic range, quantization noise, the inherent noise robustness of neural networks, or calibration techniques. But I suspect there is also a simple geometric way to think about what these formats are doing: Block number formats help preserve vector direction. And for many machine learning computations, preserving direction matters far more than preserving exact numerical values.

Block formats inherently represent direction and magnitude

Consider a vector v whose coordinates are partitioned into blocks v = (v_1, v_2, \dots, v_B).

In a block format, each block is represented using a shared scale and low-precision mantissas. For ease of discussion, we’ll consider the simplest case here, where scales are allowed to be arbitrary real-valued. In general, they may be much more restricted, e.g. powers of two.

Each block is approximated as \hat v_b = \beta_b m_b

where

m_b is a vector of low-precision mantissas, and

\beta_b is a scalar shared scaling factor.

In other words, each block can be thought of as a direction (encoded by the mantissas) multiplied by a magnitude (the shared scale). Strictly speaking, the mantissa vectorsย m_bโ€‹ย need not be normalized, and in many formats their entries may have quite different magnitudes (for example in integer mantissa formats such as MXINT). However this does not change the geometry. The representationย \hat{v}_b = \beta_b m_b is invariant to rescaling of m_b: multiplying m_bย by any constant simply rescales \beta_bย by the inverse factor. What matters for the approximation is therefore only theย directionย ofย m_bโ€‹, i.e. the one-dimensional subspace it spans.

Often we don’t think of it like this, but broadly speaking this is what has happened: block scaling allows us to decouple magnitude and direction representation. This resembles the familiar decomposition v = \|v\|\frac{v}{\|v\|} of a vector into its magnitude and direction, but applied locally within blocks.

If the mantissa vector m_b points roughly in the same direction as the original block v_b, then scaling it appropriately produces a good approximation of that block.

OK, but does preserving directions block by block actually preserve the direction of the whole vector? It turns out that the answer is yes.

Direction Preservation

Let us make the reasonable assumption that the scale of each block \beta_b is not chosen arbitrarily, but rather is the best possible scale for that block in the least squares sense, for whatever mantissa vector we choose, i.e. \beta_b = \arg\min_{\beta} \|v_b - \beta m_b\|^2. Then \hat v_b is the orthogonal projection of v_b onto the line spanned by m_b.

So to what extent do the approximate and the original block vector point in the same direction? We can measure the block cosine similarities of the blocks as: \rho_b = \frac{\langle v_b,\hat v_b\rangle}{\|v_b\|\|\hat v_b\|}.

Equally, we can measure the the cosine similarity of the full vectors (the concatenation of the original blocks versus the concatenation of the approximated blocks): \rho = \frac{\langle v,\hat v\rangle}{\|v\|\|\hat v\|}.

My aim here is to explain why small error in direction at block level leads to small error at vector level.

First, let’s define w_b = \frac{\|v_b\|^2}{\|v\|^2}, which we can think of as the fraction of the vectorโ€™s energy contained in block b; these add to 1 over the whole vector. Now we can state the result:

Theorem (Block Cosines)

Under the blockwise least-squares scaling, \rho = \sqrt{\sum_{b=1}^{B} w_b \rho_b^2 }.

For proof, see end of post.

In simple terms, this theorem states that the cosine similarity of the whole vector is the energy-weighted RMS of the block cosine similarities.

What are the implications?

The weights w_b represent how much of the vectorโ€™s energy lies in each block. Blocks that contain very little energy contribute very little to the final direction. The important consequence is that direction errors do not accumulate catastrophically across blocks. Instead, the overall directional error simply depends on a weighted average of the block direction errors. In other words, if block number formats preserve the directions of individual blocks, they automatically preserve the direction of the entire vector.

Many core operations in machine learning depend heavily on vector direction. Notably, during training, stochastic gradient descent updates are already in the form of magnitude (learning rate) + direction. We already have a knob controlling magnitude (the learning rate); what matters is that the direction is preserved. In attention mechanisms and embedding, directional similarity measures are very important. Even for the humble dot product, the workhorse of inference, preservation of direction means that small perturbations in input give rise to only small perturbations in output, so the dot product behaves robustly.

Conclusion

Block floating-point and similar formats like block mini-float, MX, NVFP, are usually explained in terms of dynamic range and quantization noise. But geometrically, I like the perspective that they do something simpler: they approximate each block of a vector as direction ร— magnitude.

And as long as the block directions are preserved reasonably well, the direction of the whole vector is preserved too.

I think this is a useful intuition as to why very low-precision formats can work so well in modern machine learning systems. Block number formats are, in a very real sense, direction preservers. From this perspective, such low-precision block formats succeed not because they represent individual numbers accurately, but because they preserve the geometry of vectors.

Lots of extensions of this kind of analysis are of course possible. To name just a few:

  • We’ve focused on vectors, but tensor-level scaling may have interesting interplay with batching during training, for example
  • We made the simplifying assumption that scaling factors were real valued, but these can be restricted, most significantly to powers of two, and the analysis would need to be modified to incorporate that change.
  • We’ve not discussed mantissas at all, lots more of interest could be said here.
  • Potentially this approach could help provide some guidance to the empirical sizing of blocks in a block representation.

If anyone would like to work with me on this topic, do let me know your ideas.


Proof of the theorem

Readers not interested in the algebra can safely skip this section.

For each block b, the approximation \hat v_b = \beta_b m_b with \beta_b chosen by least squares is the orthogonal projection of v_b onto the line spanned by m_b.

So we can write v_b = \hat v_b + r_b where r_b is orthogonal to \hat v_b.

Taking the inner product with \hat v_b gives \langle v_b,\hat v_b\rangle = \|\hat v_b\|^2.

Now sum over blocks. Because the blocks correspond to disjoint coordinates,

\langle v,\hat v\rangle = \sum_b \langle v_b,\hat v_b\rangle = \sum_b \|\hat v_b\|^2 = \|\hat v\|^2.

Therefore

\rho = \frac{\langle v,\hat v\rangle}{\|v\|\|\hat v\|} = \frac{\|\hat v\|}{\|v\|}.

Recall \rho_b = \frac{\langle v_b,\hat v_b\rangle}{\|v_b\|\|\hat v_b\|}.

Using \langle v_b,\hat v_b\rangle=\|\hat v_b\|^2, we obtain

\rho_b = \frac{\|\hat v_b\|}{\|v_b\|}.

Hence

\|\hat v_b\|^2 = \rho_b^2 \|v_b\|^2.

Summing over blocks gives

\|\hat v\|^2 = \sum_b \|\hat v_b\|^2 = \sum_b \rho_b^2 \|v_b\|^2.

Dividing by \|v\|^2, and writing

w_b = \frac{\|v_b\|^2}{\|v\|^2},

gives

\frac{\|\hat v\|^2}{\|v\|^2} = \sum_b w_b \rho_b^2.

Since \rho = \|\hat v\|/\|v\|, we obtain

\rho = \sqrt{\sum_b w_b \rho_b^2 }.

\square

Notes on Computational Learning Theory

This blog collects some of my notes on classical computational learning theory, based on my reading of Kearns and Vazirani. The results are (almost) all from their book, the sloganising (and mistakes, no doubt) are mine.

The Probably Approximately Correct (PAC) Framework

Definition (Instance Space). An instance space is a set, typically denoted X. It is the set of objects we are trying to learn about.

Definition (Concept). A concept c over X is a subset of the instance space X.

Although not covered in Kearns and Vazirani, in general it is possible to generalise beyond Boolean membership to some degree of uncertainty or fuzziness – I hope to cover this in a future blog post.

Definition (Concept Class). A concept class {\mathcal C} is a set of concepts, i.e. {\mathcal C} \subset \mathcal{P}(X), where \mathcal P denotes power set. We will follow Kearns and Vazirani and also use c to denote the corresponding indicator function c : X \to \{0,1\}.

In PAC learning, we assume {\mathcal C} is known, but the target class c \in {\mathcal C} is not. However, it doesn’t seem a jump to allow for unknown target class, in an appropriate approximation setting – I would welcome comments on established frameworks for this.

Definition (Target Distribution). A target distribution {\mathcal D} is a probability distribution over X.

In PAC learning, we assume {\mathcal D} is unknown.

Definition (Oracle). An oracle is a function EX(c,{\mathcal D}) taking a concept class and a distribution, and returning a labelled example (x, c(x)) where x is drawn randomly and independently from {\mathcal D}.

Definition (Error). The error of a hypothesis concept class h \in {\mathcal C} with reference to a target concept class c \in {\mathcal C} and target distribution {\mathcal D}, is \text{error}(h) = Pr_{x \in {\mathcal D}}\left\{ c(x) \neq h(x) \right\}, where Pr denotes probability.

Definition (Representation Scheme). A representation scheme for a concept class {\mathcal C} is a function {\mathcal R} : \Sigma^* \to {\mathcal C} where \Sigma is a finite alphabet of symbols (or – following the Real RAM model – a finite alphabet augmented with real numbers).

Definition (Representation Class). A representation class is a concept class together with a fixed representation scheme for that class.

Definition (Size). We associate a size \text{size}(\sigma) with each string from a representation alphabet \sigma \in \Sigma^*. We similarly associate a size with each concept c via the size of its minimal representation \text{size}(c) = \min_{R(\sigma) = c} \text{size}(\sigma).

Definition (PAC Learnable). Let {\mathcal C} and {\mathcal H} be representation classes classes over X, where {\mathcal C} \subseteq {\mathcal H}. We say that concept class {\mathcal C} is PAC learnable using hypothesis class {\mathcal H} if there exists an algorithm that, given access to an oracle, when learning any target concept c \in {\mathcal C} over any distribution {\mathcal D} on X, and for any given 0 < \epsilon < 1/2 and 0 < \delta < 1/2, with probability at least 1-\delta, outputs a hypothesis h \in {\mathcal H} with \text{error}(h) \leq \epsilon.

Definition (Efficiently PAC Learnable). Let {\mathcal C}_n and {\mathcal H}_n be representation classes classes over X_n, where {\mathcal C}_n \subseteq {\mathcal H}_n for all n. Let X_n = \{0,1\}^n or X_n = {\mathbb R}^n. Let X = \cup_{n \geq 1} X_n, {\mathcal C} = \cup_{n \geq 1} {\mathcal C_n}, and {\mathcal H} = \cup_{n \geq 1} {\mathcal H_n}. We say that concept class {\mathcal C} is efficiently PAC learnable using hypothesis class {\mathcal H} if there exists an algorithm that, given access to a constant time oracle, when learning any target concept c \in {\mathcal C}_n over any distribution {\mathcal D} on X, and for any given 0 < \epsilon < 1/2 and 0 < \delta < 1/2:

  • Runs in time polynomial in n, \text{size}(c), 1/\epsilon, and 1/\delta, and
  • With probability at least 1-\delta, outputs a hypothesis h \in {\mathcal H} with \text{error}(h) \leq \epsilon.

There is much of interest to unpick in these definitions. Firstly, notice that we have defined a family of classes parameterised by dimension n, allowing us to talk in terms of asymptotic behaviour as dimensionality increases. Secondly, note the key parameters of PAC learnability: \delta (the ‘probably’ bit) and \epsilon (the ‘approximate’ bit). The first of these captures the idea that we may get really unlucky with our calls to the oracle, and get misleading training data. The second captures the idea that we are not aiming for certainty in our final classification accuracy, some pre-defined tolerance is allowable. Thirdly, note the requirements of efficiency: polynomial scaling in dimension, in size of the concept (complex concepts can be harder to learn), in error rate (the more sloppy, the easier), and in probability of algorithm failure to find a suitable hypothesis (you need to pay for more certainty). Finally, and most intricately, notice the separation of concept class from hypothesis class. We require the hypothesis class to be at least as general, so the concept we’re trying to learn is actually one of the returnable hypotheses, but it can be strictly more general. This is to avoid the case where the restricted hypothesis classes are harder to learn; Kearns and Vazirani, following Pitt and Valiant, give the example of learning the concept class 3-DNF using the hypothesis class 3-DNF is intractable, yet learning the same concept class with the more general hypothesis class 3-CNF is efficiently PAC learnable.

Occam’s Razor

Definition (Occam Algorithm). Let \alpha \geq 0 and 0 \leq b < 1 be real constants. An algorithm is an (\alpha,\beta)-Occam algorithm for {\mathcal C} using {\mathcal H} if, on an input sample S of cardinality m labelled by membership in c \in {\mathcal C}_n, the algorithm outputs a hypothesis h \in {\mathcal H} such that:

  • h is consistent with S, i.e. there is no misclassification on S
  • \text{size}(h) \leq \left(n \cdot \text{size}(c)\right)^\alpha m^\beta

Thus Occam algorithms produce succinct hypotheses consistent with data. Note that the size of the hypothesis is allowed to grow only mildly – if at all – with the size of the dataset (via \beta). Note, however, that there is nothing in this definition that suggests predictive power on unseen samples.

Definition (Efficient Occam Algorithm). An (\alpha,\beta)-Occam algorithm is efficient iff its running time is polynomial in n, m, and \text{size}(c).

Theorem (Occam’s Razor). Let A be an efficient (\alpha,\beta)-Occam algorithm for {\mathcal C} using {\mathcal H}. Let {\mathcal D} be the target distribution over X, let c \in {\mathcal C}_n be the target concept, 0 < \epsilon, \delta \leq 1. Then there is a constant a > 0 such that if A is given as input a random sample S of m examples drawn from oracle EX(c,{\mathcal D}), where m satisfies m \geq a \left( \frac{1}{\epsilon} \log \frac{1}{\delta} + \left(\frac{\left( n \cdot \text{size}(c) \right)^\alpha}{\epsilon}\right)^\frac{1}{1-\beta}\right), then A runs in time polynomial in n, \text{size}(c), 1/\epsilon and \frac{1}{\delta} and, with probability at least 1 - \delta, the output h of A satisfies error(h) \leq \epsilon.

This is a technically dense presentation, but it’s a philosophically beautiful result. Let’s unpick it a bit, so its essence is not obscured by notation. In summary, simple rules that are consistent with prior observations have predictive power! The ‘simple’ part here comes from (\alpha,\beta), and the predictive power comes from the bound on \text{error}(h). Of course, one needs sufficient observations (the complex lower bound on m) for this to hold. Notice that as \beta approaches 1, and so – by the definition of an Occam algorithm – we get close to being able to memorise our entire training set – we need an arbitrarily large training set (memorisation doesn’t generalise).

Vapnik-Chervonenkis (VC) Dimension

Definition (Behaviours). The set of behaviours on S = \{x_1, \ldots, x_m\} that are realised by {\mathcal C}, is defined by \Pi_{\mathcal C}(S) = \left\{ \left(c(x_1), \ldots, c(x_m)\right) | c \in {\mathcal C} \right\}.

Each of the points in S is either included in a given concept or not. Each tuple \left(c(x_1), \ldots, c(x_m)\right) then forms a kind of fingerprint of X according to a particular concept. The set of behaviours is the set of all such fingerprints across the whole concept class..

Definition (Shattered). A set S is shattered by {\mathcal C} iff \Pi_{\mathcal C}(S) = \{0,1\}^{|S|}.

Note that \{0,1\}^{|S|} is the maximum cardinality that’s possible, i.e. the set of behaviours is all possible behaviours. So we can think of a set as being shattered by a concept class iff there’s no combination of inclusion/exclusion in the concepts that isn’t represented at least once in the set.

Definition (Vapnik-Chervonenkis Dimension). The VC dimension of {\mathcal C}, denoted VCD({\mathcal C}), is the cardinality of the largest set shattered by {\mathcal C}. If arbitrarily large finite sets can be shattered by {\mathcal C}, then VDC({\mathcal C}) = \infty.

VC dimension in this sense captures the ability of {\mathcal C} to discern between samples.

Theorem (PAC-learning in Low VC Dimension). Let {\mathcal C} be any concept class. Let {\mathcal H} be any representation class off of VC dimension d. Let A be any algorithm taking a set of m labelled examples of a concept c \in {\mathcal C} and producing a concept in {\mathcal H} that is consistent with the examples. Then there exists a constant c_0 such that A is a PAC learning algorithm for {\mathcal C} using {\mathcal H} when it is given examples from EX(c,{\mathcal D}), and when m \geq c_0 \left( \frac{1}{\epsilon} \log \frac{1}{\delta} + \frac{d}{\epsilon} \log \frac{1}{\epsilon} \right).

Let’s take a look at the similarity between this theorem and Occam’s razor, presented in the last section of this blog post. Both bounds have a similar feel, but the VCD-based bound does not depend on \text{size}(c); indeed it’s possible that the size of hypotheses is infinite and yet the VCD is still finite.

As the theorem below shows, the linear dependence on VCD achieved in the above theorem is actually the best one can do.

Theorem (PAC-learning Minimum Samples). Any algorithm for PAC-learning a concept class of VC dimension d must use \Omega(d/\epsilon) examples in the worst case.

Definition (Layered DAG). A layered DAG is a DAG in which each vertex is associated with a layer \ell \in {\mathbb N} and in which the edges are always from some layer \ell to the next layer \ell+1. Vertices at layer 0 have indegree 0 and are referred to as input nodes. Vertices at other layers are referred to as internal nodes. There is a single output node of outdegree 0.

Definition (G-composition). For a layered DAG G and a concept class {\mathcal C}, the G-composition of {\mathcal C} is the class of all concepts that can be obtained by: (i) associating a concept c_i \in {\mathcal C} with each vertex N_i in G, (ii) applying the concept at each node to its predecessor nodes.

Notice that this way we can think of the internal nodes as forming a Boolean circuit with a single output; the G-composition is the concept class we obtain by restricting concepts to only those computable with the structure G. This is a very natural way of composing concepts – so what kind of VCD arises through this composition? This theorem provides an answer:

Theorem (VCD Compositional Bound). Let G be a layered DAG with n input nodes and s \geq 2 internal nodes, each of indegree r. Let {\mathcal C} be a concept class over {\mathbb R}^r of VC dimension d, and let {\mathcal C}_G be the G-composition of {\mathcal C}. Then VCD({\mathcal C}_G) \leq 2ds \log(es).

Weak PAC Learnability

Definition (Weak PAC Learning). Let {\mathcal C} be a concept class and let A be an algorithm that is given access to EX(c,{\mathcal D}) for target concept c \in {\mathcal C}_n and distribution {\mathcal D}. A is a weak PAC learning algorithm for {\mathcal C} using {\mathcal H} if there exist polynomials p(\cdot,\cdot) and q(\cdot,\cdot) such that A outputs a hypothesis h \in {\mathcal H} that with probability at least 1/q(n,\text{size}(c)) satisfies \text{error}(h) \leq 1/2 - 1/p(n,\text{size}(c)).

Kearns and Vazirani justifiably describe weak PAC learning as “the weakest demand we could place on an algorithm in the PAC setting without trivialising the problem”: if these were exponential rather than polynomial functions in n, the problem is trivial: take a fixed-size random sample of the concept and memorise it, randomly guess with probability 50% outside the memorised sample. The remarkable result is that efficient weak PAC learnability and efficient PAC learnability coincide for an appropriate PAC hypothesis class, based on ternary majority trees.

Definition (Ternary Majority Tree). A ternary majority tree with leaves from {\mathcal H} is a tree where each non-leaf node computes a majority (voting) function of its three children, and each leaf is labelled with a hypothesis from {\mathcal H}.

Theorem (Weak PAC learnability is PAC learnability). Let {\mathcal C} be any concept class and {\mathcal H} any hypothesis class. Then if {\mathcal C} is efficiently weakly PAC learnable using {\mathcal H}, it follows that {\mathcal C} is efficiently PAC learnable using a hypothesis class of ternary majority trees with leaves from {\mathcal H}.

Kearns and Varzirani provide an algorithm to learn this way. The details are described in their book, but the basic principle is based on “boosting”, as developed in the lemma to follow.

Definition (Filtered Distributions). Given a distribution {\mathcal D} and a hypothesis h_1 we define {\mathcal D_2} to be the distribution obtained by flipping a fair coin and, on a heads, drawing from EX(c,{\mathcal D}) until h_1 agrees with the label; on a tails, drawing from EX(c,{\mathcal D}) until h_1 disagrees with the label. Invoking a weak learning algorithm on data from this new distribution yields a new hypothesis h_2. Similarly, we define {\mathcal D_3} to be the distribution obtained by drawing examples from EX(c,{\mathcal D}) until we find an example on which h_1 and h_2 disagree.

What’s going on in these constructions is quite clever: h_2 has been constructed so that it must contain new information about c, compared to h_1; h_1 has, by construction, no advantage over a coin flip on {\mathcal D}_2. Similarly, h_3 contains new information about c not already contained in h_1 and h_2, namely on the points where they disagree. Thus, one would expect that hypotheses that work in these three cases could be combined to give us a better overall hypothesis. This is indeed the case, as the following lemma shows.

Lemma (Boosting). Let g(\beta) = 3 \beta^2 - 2 \beta^3. Let the distributions {\mathcal D}, {\mathcal D}_2, {\mathcal D}_3 be defined above, and let h_1, h_2 and h_3 satisfy \text{error}_{\mathcal D}(h_1) \leq \beta, \text{error}_{{\mathcal D}_2}(h_2) \leq \beta, \text{error}_{{\mathcal D}_3}(h_3) \leq \beta. Then if h = \text{majority}(h_1, h_2, h_3), it follows that \text{error}_{\mathcal D}(h) \leq g(\beta).

The function g is monotone and strictly decreasing over [0,1/2). Hence by combining three hypotheses with only marginally better accuracy than flipping a coin, the boosting lemma tells us that we can obtain a strictly stronger hypothesis. The algorithm for (strong) PAC learnability therefore involves recursively calling this boosting procedure, leading to the majority tree – based hypothesis class. Of course, one needs to show that the depth of the recursion is not too large and that we can sample from the filtered distributions with not too many calls to the overall oracle EX(c,{\mathcal D}), so that the polynomial complexity bound in the PAC definition is maintained. Kearns and Vazirani include these two results in the book.

Learning from Noisy Data

Up until this point, we have only dealt with correctly classified training data. The introduction of a noisy oracle allows us to move beyond this limitation.

Definition (Noisy Oracle). A noisy oracle \hat{EX}^\eta( c, {\mathcal D}) extends the earlier idea of an oracle with an additional noise parameter 0 \leq \eta < 1/2. This oracle behaves in the identical way to EX except that it returns the wrong classification with probability \eta.

Definition (PAC Learnable from Noisy Data). Let {\mathcal C} be a concept class and let {\mathcal H} be a representation class over X. Then {\mathcal C} is PAC learnable from noisy data using {\mathcal H} if there exists and algorithm such that: for any concept c \in {\mathcal C}, any distribution {\mathcal D} on X, any 0 \leq \eta < 1/2, and any 0 < \epsilon < 1, 0 < \delta < 1 and \eta_0 with \eta \leq \eta_0 < 1/2, given access to a noisy oracle \hat{EX}^\eta( c, {\mathcal D}) and inputs \epsilon, \delta, \eta_0, with probability at least 1 - \delta the algorithm outputs a hypothesis concept h \in {\mathcal H} with \text{error}(h) \leq \epsilon. If the runtime of the algorithm is polynomial in n, 1/\epsilon, 1/\delta and 1/(1 - 2\eta_0) then {\mathcal C} is efficiently learnable from noisy data using {\mathcal H}.

Let’s unpick this definition a bit. The main difference from the PAC definition is simply the addition of noise via the oracle and an additional parameter \eta_0 which bounds the error of the oracle; thus the algorithm is allowed to know in advance an upper bound on the noisiness of the data, and an efficient algorithm is allowed to take more time on more noisy data.

Kearns and Vazirani address PAC learnability from noisy data in an indirect way, via the use of a slightly different framework, introduced below.

Definition (Statistical Oracle). A statistical oracle STAT(c, {\mathcal D}) takes queries of the form (\chi, \tau) where \chi : X \times \{0,1\} \to \{0,1\} and 0 < \tau \leq 1, and returns a value \hat{P}_\chi satisfying P_\chi - \tau \leq \hat{P}_\chi \leq P_\chi + \tau where P_\chi = Pr_{x \in {\mathcal D}}[ \chi(x, c(x)) = 1 ].

Definition (Learnable from Statistical Queries). Let {\mathcal C} be a concept class and let {\mathcal H} be a representation class over X. Then {\mathcal C} is efficiently learnable from statistical learning queries using {\mathcal H} if there exists a learning algorithm A and polynomials p(\cdot, \cdot, \cdot), q(\cdot, \cdot, \cdot) and r(\cdot,\cdot,\cdot) such that: for any c \in {\mathcal C}, any distribution {\mathcal D} over X and any 0 < \epsilon < 1/2, if given access to STAT(c,{\mathcal D}), the following hold. (i) For every query (\chi,\tau) made by A, the predicate \chi can be evaluated in time q(1/\epsilon, n, \text{size}(c)), and \tau \leq r(1/\epsilon, n, \text{size}(c)), (ii) A has execution time bounded by p(1/\epsilon, n, \text{size}(c)), (iii) A outputs a hypothesis h \in {\mathcal H} that satisfies \text{error}(h) \leq \epsilon.

So a statistical oracle can be asked about a whole predicate \chi, for any given tolerance \tau. The oracle must return an estimate of the probability that this predicate holds (where the probability is over the distribution over X). It is, perhaps, not entirely obvious how to relate this back to the more obvious noisy oracle used above. However, it is worth noting that one can construct a statistical oracle that works with high probability by taking enough samples from a standard oracle, and then returning the relative frequency of \chi evaluating to 1 on that sample. Kearns and Vazirani provide an intricate construction to efficiently sample from a noisy oracle to produce a statistical oracle with high probability. In essence, this then allows an algorithm that can learn from statistical queries to be used to learn from noisy data, resulting in the following theorem.

Theorem (Learnable from Statistical Queries means Learnable from Noisy Data). Let {\mathcal C} be a concept class and let {\mathcal H} be a representation class over X. Then if {\mathcal C} is efficiently learnable from statistical queries using {\mathcal H}, {\mathcal C} is also efficiently PAC learnable using {\mathcal H} in the presence of classification noise.

Hardness Results

I mentioned earlier in this post that Pitt and Valiant showed that sometimes we want more general hypothesis classes than concept classes: the concept class 3-DNF using the hypothesis class 3-DNF is intractable, yet learning the same concept class with the more general hypothesis class 3-CNF is efficiently PAC learnable. So in their chapter Inherent Unpredictability, Kearns and Vazirani turn their attention to the case where a concept class is hard to learn independent of the choice of a hypothesis class. This leads to some quite profound results for those of us interested in Boolean circuits.

We will need some kind of hardness assumption to develop hardness results for learning. In particular, note that if P = NP, then by Occam’s Razor (above) polynomially evaluable hypothesis classes are also polynomially-learnable ones. So we will need to do two things: focus our attention on polynomially evaluable hypothesis classes (or we can’t hope to learn them polynomially), and make a suitable hardness assumption. The latter requires a very brief detour into some results commonly associated with cryptography.

Let {\mathbb Z}_N^* = \{ i \; | \; 0 < i < N \; \wedge \text{gcd}(i, N) = 1 \}. We define the cubing function f_N : {\mathbb Z}_N^* \to {\mathbb Z}_N^* by f_N(x) = x^3 \text{ mod } N. Let \varphi define Euler’s totient function. Then if \varphi is not a multiple of three, it turns out that f_N is bijective, so we can talk of a unique discrete cube root.

Definition (Discrete Cube Root Problem). Let p and q be two n-bit primes with \varphi(N) not a multiple of 3, where N = pq. Given N and f_N(x) as input, output x.

Definition (Discrete Cube Root Assumption). For every polynomial P, there is no algorithm that runs in time P(n) that solves the discrete cube root problem with probability at least 1/P(n), where the probability is taken over randomisation of p, q, x and any internal randomisation of the algorithm A. (Where N = pq).

This Discrete Cube Root Assumption is widely known and studied, and forms the basis of the learning complexity results presented by Kearns and Vazirani.

Theorem (Concepts Computed by Small, Shallow Boolean Circuits are Hard to Learn). Under the Discrete Cube Root Assumption, the representation class of polynomial-size, log-depth Boolean circuits is not efficiently PAC learnable (using any polynomially evaluable hypothesis class).

The result also holds if one removes the log-depth requirement, but this result shows that even by restricting ourselves to only log-depth circuits, hardness remains.

In case any of my blog readers knows: please contact me directly if you’re aware of any resource of positive results on learnability of any compositionally closed non-trivial restricted classes of Boolean circuits.

The construction used to provide the result above for Boolean circuits can be generalised to neural networks:

Theorem (Concepts Computed by Neural Networks are Hard to Learn). Under the Discrete Cube Root Assumption, there is a polynomial p and an infinite family of directed acyclic graphs (neural network architectures) G = \{G_{n^2}\}_{n \geq 1} such that each G_{n^2} has n^2 Boolean inputs and at most p(n) nodes, the depth of G_{n^2} is a constant independent of n, but the representation class {\mathcal C}_G = \cup_{n \geq 1} {\mathcal C}_{G_{n^2}} is not efficiently PAC learnable (using any polynomially evaluable hypothesis class), and even if the weights are restricted to be binary.

Through an appropriate natural definition of reduction in PAC learning, Kearns and Vazirani show that the PAC-learnability of all these classes reduce to functions computed by deterministic finite automata. So, in particular:

Theorem (Concepts Computed by Deterministic Finite Automata are Hard to Learn). Under the Discrete Cube Root Assumption, the representation class of Deterministic Finite Automata is not efficiently PAC learnable (using any polynomially evaluable hypothesis class).

It is this result that motivates the final chapter of the book.

Experimentation in Learning

As discussed above, PAC model utilises an oracle that returns labelled samples (x, c(x)). An interesting question is whether more learning power arises if we allow the algorithms to be able to select x themselves, with the oracle returning c(x), i.e. not just to be shown randomly selected examples but take charge and test their understanding of the concept.

Definition (Membership Query). A membership query oracle takes any instance x and returns its classification c(x).

Definition (Equivalence Query). An equivalence query oracle takes a hypothesis concept h \in {\mathcal C} and determines whether there is an instance x on which c(x) \neq h(x), returning this counterexample if so.

Definition (Learnable From Membership and Equivalence Queries). The representation class {\mathcal C} is efficiently exactly learnable from membership and equivalence queries if there is a polynomial p(\cdot,\cdot) and an algorithm with access to membership and equivalence oracles such that for any target concept c \in {\mathcal C}_n, the algorithm outputs the concept c in time p(\text{size}(c),n).

There are a couple of things to note about this definition. It appears to be a much stronger requirement than PAC learning, as the concept must be exactly learnt. On the other hand, the existence of these more sophisticated oracles, especially the equivalence query oracle, appears to narrow the scope. Kearns and Vazirani encourage the reader to prove that the true strengthening over PAC-learnability is in the membership queries:

Theorem (Exact Learnability from Membership and Equivalence means PAC-learnable with only Membership). For any representation class {\mathcal C}, if {\mathcal C} is efficiently exactly learnable from membership and equivalence queries, then {\mathcal C} is also efficiently learnable in the PAC model with membership queries.

They then provide an explicit algorithm, based on these two new oracles, to efficiently exactly learn deterministic finite automata.

Theorem (Experiments Make Deterministic Finite Automata Efficiently Learnable). The representation class of Deterministic Finite Automata is efficiently exactly learnable from membership and equivalence queries.

Note the contrast with the hardness result of the previous section: through the addition of experimentation, we have gone from infeasible learnability to efficient learnability. Another very philosophically pleasing result.

Energy: Rewriting the Possibilities

In early June, my PhD student Sam Coward (co-advised by Theo Drane from Intel), will travel to ARITH 2024 in Mรกlaga to present some of our most recent work, “Combining Power and Arithmetic Optimization via Datapath Rewriting”, a joint paper with Emiliano Morini, also of Intel. In this blog post, I will describe the fundamental idea of our work.

It’s well-known that ICT is driving a significant amount of energy consumption in the modern world. The core question of how to organise the fundamental arithmetic operations in a computer in order to reduce power (energy per unit time) has been studied for a long time, and continues to be a priority for designers across industry, including the group at Intel with whom this work has been conducted.

Readers of this blog will know that Sam has been doing great work on how to explore the space of behaviourally equivalent hardware designs automatically. First for area, then for performance, and now for power consumption!

In our latest work, Sam looks at how we can use the e-graph data structure, and the related egg tool, to tightly integrate arithmetic optimisations (like building multi-input adders in hardware) with clock gating and data gating, two techniques for power saving. Clock gating avoids clocking new values into registers in hardware if we know they’re not going to be used in a given cycle; this avoids the costly switching activity associated with propagating unused information in a digital circuit. Data gating also avoids switching, but in a different way – by replacing operands with values inducing low switching: for example, if I do not end up using a result of a \times b, then I may as well be computing a \times 0. In both cases, the fundamental issue becomes how to identify whether a value will be unused later in a computation. Intriguingly, this question is tightly related to the way a computation is performed: there are many ways of computing a given mathematical computation, and each one will have its own redundancies to exploit.

In our ARITH 2024 paper, Sam has shown how data gating and clock gating can be expressed as rewrites over streams of Boolean data types, lifting our previous work that looks at equivalences between bit vectors, to equivalences over streams of bit vectors. In this way, he’s able to express both traditional arithmetic equivalences like a + (b + c) = (a + b) + c and equivalences expressing clock and data gating within the same rewriting framework. A collection of these latter equivalences are shown in the table below from our paper.

Some of the rewrites between equivalent expressions used in our ARITH 2024 paper

Sam has been able to show that by combining the rewrites creatively, using arithmetic rewrites to expose new opportunity for gating, our tool ROVER is able to save some 15% to 30% of power consumption over a range of benchmark problems of industrial interest. Moreover, ROVER will automatically adjust the whole design to better suit different switching profiles, knowing that rarely-switching circuit components are less problematic for energy, and prioritising exposing rewrites where they are needed.

I think this is really interesting work, and shows just how general the e-graph approach to circuit optimisation can be. If you’re going to ARITH 2024, do make sure to talk to Sam and find out more. If not, make sure to read his paper!