Precision Matters in Block Scales

This is the third post in a sequence relating to the geometry of block number formats as angle preservers. In my previous post, I argued that block number formats remain direction preservers even when their block scales are quantised all the way down to powers of two, as is common in some number representations like the MX concrete formats. The main result there was that exponent-only block scaling perturbs direction by at most about 20^\circ, and that 20^\circ is not that much in high dimensions. So we ended last time with the nice result that very coarse block-scale quantisation is relatively benign.

But that doesn’t mean power-of-two scaling is optimal. If we have a fixed budget of bits for representing block scales, how should we spend them? Specifically, should we spend them on exponent range, or on significand precision?

This post argues that the answer becomes much clearer once a high-precision tensor-wide scale is introduced, which is exactly the kind of two-level scaling used in NVIDIA’s NVFP4 format. In NVFP4, 4-bit E2M1 values are combined with an FP8 E4M3 scale for each 16-value micro-block and a second-level FP32 scale for the tensor.

With such a tensor-wide scale, the block scales are relieved of their duty to try to capture the global magnitude of the tensor. Instead, we can ask a more focused task of them: reconstruct the relative amplitudes of the blocks, so that the global direction of the represented vector is preserved.

Since drafting this post, Bardia Zadeh and I have also written Direction-Preserving Number Representations, which I blogged about separately. That paper studies the related question of what directions can be obtained when each coordinate of a vector is drawn from a finite scalar alphabet. This post is about block scales rather than scalar elements, but the same product-structured geometry reappears one level higher.

I will argue in this post that once we look at the problem that way, precision in the block scales starts to matter much more. This leads to a rough rule of thumb for the relationship between block scale formats and vector lengths.

What a tensor-wide scale changes

Suppose, as per my previous posts, that each block is represented as \hat v_b = \beta_b^\star m_b, where m_b is a chosen mantissa vector and \beta_b^\star is the ideal real-valued block scale for that mantissa direction.

Now suppose that the final represented tensor has the form

\tilde v = \gamma \bigoplus_{b=1}^K (\tilde \beta_b m_b)

where

  • \gamma is the high-precision tensor-wide scale,
  • \tilde \beta_b is the low-precision per-block scale, and
  • \bigoplus denotes direct sum (block concatenation).

Of course, the tensor-wide scale \gamma has no effect on direction at all: it multiplies the whole tensor uniformly, so it only changes magnitude. That means the tensor-wide scale can be used to absorb the global length of the vector, leaving the block scales to encode relative block amplitudes.

In other words, once a tensor-wide scale is present, the block scales stop answering the question “how large is this tensor?” and instead answer the question “how do the blocks compare with one another?”

Exact scale-only cosine factor

Let \hat v denote the ideal blockwise representation obtained using the real-valued scales \beta_b^\star, and let \tilde v denote the represented tensor after block-scale quantisation.

Write

x_b = \frac{\tilde \beta_b}{\beta_b^\star}.

Then the represented block is simply

\tilde v_b = x_b \hat v_b.

So scale quantisation does not change any chosen block direction, it only rescales the ideal projected blocks.

Define

\alpha_b = \frac{|\hat v_b|^2}{\sum_j |\hat v_j|^2},

so that \alpha_b is the fraction of the ideal projected energy contained in block b.

Then, exactly as in the previous post, we have

\cos(\hat v,\tilde v) = \frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}.

This says that directional distortion from block-scale quantisation depends only on how uneven the multiplicative scale errors x_b are across blocks.

If all blocks were rescaled by the same factor, direction would be unchanged.

Two jobs for two different kinds of bits

A block-scale format does two things.

First, its exponent bits determine what range of relative block scales can be represented without clipping or underflow.

Second, its significand bits determine how accurately the in-range scales are represented.

So there are really two error sources:

  1. tail loss, from blocks whose scales fall outside range;
  2. in-range uneven rescaling, from finite precision within range.

The interesting question is how these trade off when the total number of scale bits is fixed.

A conservative scale-only view for fixed K

Suppose the tensor is divided into K blocks.

If we want a format-level guarantee, a natural scale-only question is:

for a given number of blocks K, how should a fixed budget of scale bits be split between exponent and significand so as to control the worst-case angular loss caused by scale quantisation?

I am deliberately saying “scale-only” here because this is not a claim about the globally optimal scalar alphabet (a problem Bardia and I cover in our preprint, linked to above), nor about the full problem of choosing the mantissa vectors. It is a conservative model of the additional angular error introduced after the block directions have already been chosen.

To make this concrete, let

  • e be the number of exponent bits,
  • p be the significand precision in bits, and
  • k=e+p be the total scale-field width.

Now make one simplifying assumption: the mantissa vectors used in different blocks all have the same norm. Under that assumption, projected block energy is proportional to the square of the ideal block scale. This lets us reason directly in terms of the block scales themselves.

If the exponent field is too narrow, then very small relative block scales may underflow to zero after tensor-wide normalisation. In the worst case, one block survives at unit scale and the remaining K-1 blocks sit just below the lower threshold and are lost.

If the smallest representable normalised scale is \tau_e, then the exponent-side cosine contribution is bounded by

\displaystyle \cos(\hat v,z)\geq \frac{1}{\sqrt{1+(K-1)\tau_e^2}}.

Here z denotes the intermediate vector obtained by zeroing the blocks that fall below the representable scale threshold.

If, on the other hand, the surviving blocks remain in range but the scale precision is limited, then the remaining error comes from uneven in-range rescaling. Suppose the multiplicative scale errors for the surviving blocks satisfy

\displaystyle \ell_p \leq x_b \leq u_p.

Then the interval argument from the previous post gives

\displaystyle \cos(z,\tilde v)\geq \frac{2\sqrt{\ell_p u_p}}{\ell_p+u_p}.

In the common symmetric relative-error model, \ell_p=1-u and u_p=1+u, so this becomes

\displaystyle \cos(z,\tilde v)\geq \sqrt{1-u^2}.

Putting the clipping and in-range effects together gives the conservative scale-only bound

\displaystyle \cos(\hat v,\tilde v)\geq \frac{2\sqrt{\ell_p u_p}}{\ell_p+u_p}\cdot \frac{1}{\sqrt{1+(K-1)\tau_e^2}}.

In the symmetric relative-error model, this simplifies to

\displaystyle \cos(\hat v,\tilde v)\geq \frac{\sqrt{1-u^2}}{\sqrt{1+(K-1)\tau_e^2}}.

This is the key scale-only design inequality.

What this says about exponent bits

The first striking feature is that exponent bits only need to control the low-energy tail.

To make clipping negligible, it is enough to ensure that

\displaystyle (K-1)\tau_e^2 \ll 1.

Now the dynamic range of a floating-point-like scale grows extremely quickly with exponent width. A format-dependent way to write this is

\displaystyle \tau_e \approx 2^{-c2^e},

where c is a positive constant depending on details such as exponent bias, reserved encodings, and whether subnormals are supported.

Substituting into the clipping condition gives

\displaystyle (K-1)2^{-2c2^e}\ll 1.

Solving this roughly gives

\displaystyle e \gtrsim \log_2\log_2 K+O(1).

The important point is the growth law. The number of exponent bits needed to control relative-tail clipping grows only like \log\log K.

That is a very slow growth law: once K is fixed, only a small number of exponent bits is needed before clipping becomes a second-order issue for direction.

What this says about significand bits

Once clipping is under control, the remaining scale error is dominated by in-range relative precision. That is the role of the significand.

In the symmetric relative-error model, the scale-only cosine contribution is

\displaystyle \cos(z,\tilde v)\geq \sqrt{1-u^2}.

Equivalently, the angular contribution is at most

\displaystyle \arcsin(u).

If a p-bit significand gives a relative scale error of the form

\displaystyle u\approx C_{\rm round}2^{-p},

where C_{\rm round} depends on the precise rounding convention, then asking the scale field alone to contribute at most an angle \theta gives the rough condition

\displaystyle C_{\rm round}2^{-p}\lesssim \sin\theta.

Equivalently,

\displaystyle p\gtrsim \log_2\left(\frac{C_{\rm round}}{\sin\theta}\right).

So once the exponent field is “good enough”, every additional scale bit is more profitably spent on significand precision than on more dynamic range. This is the central conclusion of the scale-only model.

A simple rule of thumb

The previous discussion suggests the following design rule.

Use just enough exponent bits to make clipping of important blocks negligible. Spend the rest on significand precision.

For a tensor with K blocks and a scale field of width k, a rough rule of thumb is

\displaystyle e^\star \approx \left\lceil \log_2\log_2 K \right\rceil + C_{\rm format},\qquad p^\star=k-e^\star.

Here C_{\rm format} is a small format-dependent additive correction – in a real format, it depends on details such as exponent bias, special encodings, and subnormal support. This is not meant as a precise optimum, only as a rough scale-only guide. But it makes the main point quite clearly:

exponent bits become sufficient very quickly; significand bits keep helping.

What this means for modern designs

This way of looking at the problem helps explain why a format that combines

  • a high-precision tensor-wide scale, and
  • a more precise block-scale format

looks like a very sensible design.

The tensor-wide scale deals with global magnitude. That leaves the block scales free to focus on preserving the relative block amplitudes that determine global direction.

This is exactly the tradeoff that makes NVFP4 interesting. Compared with exponent-only scaling, E4M3 block scales spend some representational power on non-power-of-two precision. The second-level FP32 tensor scale then compensates for the reduced range of the more precise E4M3 block-scale format.

There is also a useful connection with Bardia’s preprint. That paper studies the scalar alphabet inside a block, and finds that for 4-bit alphabets at the NVFP4 micro-block dimension d=16, E2M1 is close to an independently optimized direction-preserving alphabet. This post studies a complementary question one level higher: once those micro-blocks have scales, how should the scale alphabet itself spend its bits?

So the two messages reinforce one another:

  • inside a micro-block, E2M1 is a surprisingly good product-structured scalar alphabet for direction preservation;
  • across micro-blocks, a tensor-wide scale makes the relative block-scale problem more important, so non-power-of-two block-scale precision becomes valuable.

From the perspective of directional reconstruction, that seems like a very good bargain.

Conclusion

The previous post showed that even exponent-only block scaling preserves direction surprisingly well.

This post goes a step further. Once a tensor-wide high-precision scale is available, the main question is no longer whether coarse block scaling is robust enough. Instead, we should ask whether the block scales are making the best possible use of their bits.

From the point of view of reconstructing global direction,

  • exponent bits protect against clipping of low-energy tail blocks;
  • significand bits improve the relative amplitudes of all the important in-range blocks.

Since exponent range grows very quickly with exponent width, only a modest number of exponent bits is needed before clipping becomes a secondary issue. After that, precision matters more.

This is a scale-only argument. It assumes the block directions have already been chosen, and studies the additional directional error caused by quantising the relative block amplitudes.

Proof sketch of the conservative scale-only bound

Readers not interested in the algebra can safely skip this section.

Assume equal mantissa norms across blocks, so that ideal projected block energy is proportional to the square of the ideal block scale.

Let \hat v be the ideal blockwise projected vector after tensor-wide normalisation. We model low-precision block scales in two stages.

First, let z be obtained from \hat v by zeroing every block whose normalised scale falls below the smallest representable positive scale \tau_e. In the worst case, one block survives at scale 1 and the remaining K-1 blocks sit just below \tau_e and are lost. The lost projected-energy fraction is then

\displaystyle \eta=\frac{(K-1)\tau_e^2}{1+(K-1)\tau_e^2},

so

\displaystyle \cos(\hat v,z)=\sqrt{1-\eta}=\frac{1}{\sqrt{1+(K-1)\tau_e^2}}.

Second, form the final represented vector \tilde v by applying in-range rounding to the surviving block scales. If the surviving-block multiplicative errors satisfy

\displaystyle \ell_p\leq x_b\leq u_p,

then the interval bound from the previous post gives

\displaystyle \cos(z,\tilde v)\geq \frac{2\sqrt{\ell_p u_p}}{\ell_p+u_p}.

In the symmetric relative-error model, \ell_p=1-u and u_p=1+u, so this becomes

\displaystyle \cos(z,\tilde v)\geq \sqrt{1-u^2}.

Now write \hat v=z+r, where r is supported only on the clipped blocks. Since \tilde v is supported only on the surviving blocks, we have r\perp \tilde v, and hence

\displaystyle \cos(\hat v,\tilde v)=\frac{\langle z,\tilde v\rangle}{|\hat v|_2|\tilde v|_2}=\frac{|z|_2}{|\hat v|_2}\frac{\langle z,\tilde v\rangle}{|z|_2|\tilde v|_2}=\cos(\hat v,z)\cos(z,\tilde v).

Therefore

\displaystyle \cos(\hat v,\tilde v)\geq \frac{2\sqrt{\ell_p u_p}}{\ell_p+u_p}\cdot \frac{1}{\sqrt{1+(K-1)\tau_e^2}}.

In the symmetric relative-error model, this is

\displaystyle \cos(\hat v,\tilde v)\geq \frac{\sqrt{1-u^2}}{\sqrt{1+(K-1)\tau_e^2}}.

Parliamentary Inquiry on Low-Energy Computing

In the UK, the House of Commons Select Committee on Science, Innovation and Technology is currently holding an inquiry into “Low-Energy Computing”. The remit of the inquiry can be found below.

AI model sizes and data volumes are growing significantly. At the same time, areas like quantum computing and protein synthesis also require increasing amounts of computational power.

This trend is exerting increasing demands on energy supplies, and it has been suggested that new innovations in silicon photonics and neuromorphic computing could offer a solution. 

The Science, Innovation and Technology committee is examining how realistic a possibility this is, when breakthroughs might be expected to take place and what the government is doing to support research and innovation activity in this area.

This inquiry has been launched following pitches made to the committee as part of its Under the Microscope initiative. 

I am reproducing my submission to this inquiry below, in case it is of value to others. The text can also be found – alongside all other submissions made – at the official parliamentary website.

Directionality in Low Precision

In a couple of recent posts [1,2], I have been trying to reason through the geometric properties of block number formats. The basic idea is that when a group of numbers shares a scale factor, the small low-precision numbers inside the block are no longer meaningfully trying to approximate scalar magnitudes, as the scale has already taken care of much of the magnitude information. What remains is, to a significant extent, a question of direction.

This post is about a new preprint I have recently put out with Bardia Zadeh, “Direction-Preserving Number Representations”. The question we ask is “If a block scale has already taken care of magnitude, what should the scalar values inside the block look like if our aim is to preserve direction?”

That is not the usual question in computer arithmetic, and I can’t really find a historical precedent for this question in the usual places I would look (please let me know if you know others who have looked at this question from an arithmetic perspective!) We usually ask about absolute error, relative error, dynamic range, rounding behaviour, underflow, overflow, and so on. But in many modern machine learning settings, it is also natural to ask a more geometric question: how well do the values available inside a block cover the possible directions of a vector?

On the research methodology side, there is another aspect of this paper that is new for me personally. This is the first paper I have written where AI tools (namely GPT-5.4, GPT-5.5 and Aristotle) made a substantive contribution to the development of proof ideas, as well as to the Lean formalisation, which I am delighted we have open sourced. The AI tools definitely did not replace mathematical judgement or lead to a push button approach: the process still required many manual reformulations, checking, rewriting, and false starts. But it was a new and positive experience for me. In particular, the combination of exploratory AI assistance and Lean’s rock solid proof checking made for a very different style of research interaction from the one I am used to.

I will come back to Lean and AI below. First, let me describe the mathematical object we studied.

From scalar alphabets to directions

Suppose we have a finite scalar alphabet A. For example, A might be the set of real values represented by a 4-bit floating-point format, or by a 4-bit integer format.

If a block has dimension n, then the unscaled vectors we can represent are the product set A^n.

But if the block also has an independent positive scale factor, then multiplying the whole vector by a positive scalar should not really change the information encoded by the low-precision elements. The scale can absorb that. What matters is the direction.

So the relevant object becomes

P_n(A)=\left\{\frac{x}{|x|_2}:x\in A^n,\;x\ne 0\right\}.

This is a finite set of points on the unit sphere. But these points can’t be placed arbitrarily on the sphere. The set has product structure, because each coordinate is chosen independently from the scalar alphabet.

The natural worst-case measure is the covering radius

F_n(A)=\sup_{u\in S^{n-1}}\min_{c\in P_n(A)}\angle(u,c).

This asks: given any true direction u, how close is the nearest direction obtainable from the alphabet A? Smaller F_n(A) is better.

This lens changes the problem of how to select the alphabet. Instead of asking “which scalar values approximate real numbers well?”, we ask: which scalar values, when used coordinatewise, cover directions well?

You can see the two-dimensional directionality coverage of the floating-point alphabet with two exponent bits, one mantissa bit and one sign bit (E2M1), illustrated below. Each intersection of grid lines corresponds to a 2-vector with elements drawn from this alphabet. We may then find where the line between the origin and that intersection meets a circle (marked as red points). Note the non-uniform spacing around the circle: some regions are better covered than others.

Directional Coverage of E2M1

If you want to play with designing your own two-dimensional alphabet in a graphical user interface, you can get more intuition into this problem using this widget Bardia created: https://bardia01.github.io/directional_coverage_explorer/.

A Lean Interlude

As I mentioned above, I’m super pleased that we’ve open-sourced formalisations of all our theorems and definitions in Lean. The formalisation is organised so that the top-level declarations correspond closely to the paper.

The key definitions look like this:

abbrev Aq (q : Nat) : Set (Finset Real) :=
  {A : Finset Real | A.card = q}

abbrev P (n : Nat) (A : Finset Real) :
  Finset (OptimalAlphabets.SpherePoint n) :=
  OptimalAlphabets.AsymmetricProduct.asymProdSphericalCode n A

abbrev F (n : Nat) (A : Finset Real) : Real :=
  OptimalAlphabets.AsymmetricProduct.F_asym n A

abbrev rhoSph (n m : Nat) : Real :=
  OptimalAlphabets.rho_sph n m

Here Aq q is the class of scalar alphabets with q elements. The definition P n A is the finite spherical code obtained by normalising nonzero elements of A^n. The quantity F n A is the product-code covering objective F_n(A). The definition rhoSph n m is the optimal covering radius of an unconstrained spherical code with m points.

I’m giving these Lean definitions now so we can roughly follow along in Lean as we go for the rest of this blog post.

Product Codes versus Spherical Codes

It seems natural, almost obvious, that a product code should give worse directional coverage than a vector code chosen specifically to optimise directional coverage. Of course, the latter may not be practical, as the decoding process may be significant, but it still forms a useful baseline comparison point. Our first theoretical contribution is to quantify the gap between these two code classes.

Notice that the product structure induces a very severe geometric constraint. Even if A has q values, so that A^n has up to q^n raw vectors, those vectors are not arbitrary. They arise from independent coordinate choices from the same scalar alphabet. Meanwhile, a spherical code with q^n points is free to place those points anywhere on the sphere.

The harmonic witness

A central construction in the paper is a direction that is hard for product codes to cover. Let

H_n=\sum_{i=1}^{n}\frac{1}{i}

be the nth harmonic number, and define the unit vector

u^{(n)}=\left(\frac{1}{\sqrt{H_n}},\frac{1}{\sqrt{2H_n}},\ldots,\frac{1}{\sqrt{nH_n}}\right).

The entries of this vector decay slowly, making the vector awkward for a finite scalar alphabet. The resulting theorem gives a lower bound on the worst-case angular error. If m(A) is the smaller of the number of positive and negative nonzero values in A, then

F_n(A)\ge \arccos\left(\min\left\{1,\;2\sqrt{\frac{m(A)}{H_n}}\right\}\right).

The Lean statement is compact enough to include directly:

theorem theorem2_sign_count_bound {n : Nat} (hn : 2 <= n)
    (A : Finset Real) :
  Real.arccos
    (min 1 (2 * Real.sqrt (mSign A : Real) / Real.sqrt (H n))) <=
  F n A

The notation mSign A is the Lean name for m(A), the smaller nonzero sign count. The theorem is written as a lower bound on F n A, just as in the paper.

Since H_n grows like \log n, this bound tends towards \pi/2 for any fixed alphabet. In plain language: in sufficiently high dimension, every fixed scalar alphabet has some direction that it represents very badly.

This is a worst-case theorem. We’re not claiming that real neural network tensors look like the harmonic witness. What the theorem tells us is that if the metric is worst-case angular coverage of the entire sphere, product-structured scalar alphabets have an inherent limitation.

So What about Spherical Codes?

A product code built from a q-element alphabet has at most q^n raw codewords before normalisation. The fair unconstrained comparison is therefore a spherical code with q^n points.

The paper proves that, for any fixed q, sufficiently high-dimensional spherical codes beat every q-element product code in worst-case angular covering radius.

Here is the Lean statement:

theorem theorem4_asymptotic_strict_separation_fixed_alphabet_size
    {q : Nat} (hq : 2 <= q) :
   N : Nat, 2 <= N 
    forall n, N <= n ->
      forall A : Finset Real, A  Aq q ->
        rhoSph n (q ^ n) < F n A

Read from right to left, this says: take any scalar alphabet A with q elements. In all sufficiently large dimensions n, the best unconstrained spherical code with q^n points has strictly smaller covering radius than the product-code direction set induced by A.

What about Floating and Fixed Point?

The next question we answer is more practical. Within the product-code world, are the usual scalar alphabets the best ones?

The paper studies standard floating-point, fixed-point, and two’s complement alphabets. The answer is that these conventional choices are asymptotically suboptimal for the worst-case directional metric.

Because the worst-case angle for any product code tends towards 90^\circ in high dimension (see above), it is more informative when comparing product codes to look at a normalised quantity such as:

\sqrt{H_n}\cos F_n(A).

Very roughly, this measures how slowly the worst-case angle approaches 90^\circ. Larger is better.

The Lean statement packages the relevant comparison as a liminf/limsup chain:

theorem theorem5_liminf_limsup_chain {b : Nat} (hb : 3 <= b) :
  arbConst b <=
      Filter.liminf (fun n : Nat => normBestAlpha n (2 ^ b)) atTop 
  fpConst b < arbConst b 
  Filter.limsup (fun n : Nat => normBestFpCos n b) atTop <= fpConst b

This is a theorem about b-bit alphabets. The quantity normBestAlpha n (2 ^ b) is the best normalised performance obtainable by an arbitrary 2^b-element scalar alphabet. The quantity normBestFpCos n b is the corresponding floating-point quantity, optimised over valid splits of exponent and mantissa bits. The constants arbConst b and fpConst b are the two asymptotic constants being compared.

The middle inequality fpConst b < arbConst b is the key point. For b\ge 3, arbitrary scalar alphabets can do strictly better than the floating-point family in this asymptotic directional metric.

For four bits, the paper obtains a concrete constant-factor separation of at least \sqrt{7/3}\approx 1.528.

So if the design objective is “choose scalar levels whose product code covers directions well”, there are better alphabets than the standard ones, at least in this worst-case asymptotic sense.

AI and Lean

It feels worth saying a little more about the process followed to reach the proofs of these theorems. This paper is the first time I have had a genuinely substantive AI contribution to the development of proof ideas, not just text polishing and review. AI was useful both in the Lean formalisation and in exploring how some of the mathematical arguments might be structured.

The AI tools we used needed lots of iteration, but the workflow was unexpectedly productive. GPT-5.4 and 5.5 and Aristotle from Harmonic could suggest possible routes, propose intermediate lemmas, help with translation between informal mathematics and Lean statements, and generate candidate proof fragments.

This combination was new for me. I am used to mathematical collaboration involving conversations with people, paper sketches, whiteboards, and eventually LaTeX. Here there was another kind of interaction: a fast, imperfect, but useful assistant for exploring the proof space, coupled with Lean as a formal system that refused to accept anything vague.

Mathematical judgement still matters, in what we wanted to prove as well as what counts as an informative proof. But I came away from the experience more positive about the role these tools can play in research, especially when paired with formal verification rather than used as a substitute for it.

The Experimental Side: Exploring 4-bit Alphabets

The theory says that better scalar alphabets should exist. The experiments in the paper ask what they look like. For four-bit alphabets, we impose sign symmetry and include zero. Since multiplying all scalar levels by a common positive factor does not change the represented directions, we normalise the smallest positive value to one.

For block dimension d=16 (as used in NVIDIA NVFP4), the optimised positive levels found in the paper are approximately

1,\;2.12,\;3.40,\;5.04,\;7.25,\;10.5,\;13.2.

The optimised alphabet is best across the tested dimensions. But what I find most interesting is how close E2M1 is to the optimum, especially compared with integer/fixed-point and pure powers-of-two formats.

E2M1 is the four-bit format used in NVFP4. The results suggest a geometric explanation for why it works well in block-scaled machine learning settings. The key point is that, for this bit-width and block size, the E2M1 levels lie surprisingly close to the levels obtained by directly optimising the product-code directional covering problem.

Conclusion

The main message is that the geometric lens provides value when considering how to design low-precision number formats for machine learning. Once a block scale is present, the scalar values inside the block are not merely approximating real numbers independently. Together, they are choosing a direction. The scalar alphabet therefore determines a product-structured spherical code.

There are three consequences I find useful.

First, product structure has unavoidable limitations. A coordinatewise scalar alphabet cannot cover directions as well as an unconstrained spherical code with the same number of raw codewords.

Second, standard scalar formats are not forced by the geometry. Floating-point, fixed-point, and two’s complement are natural formats for many reasons, but the directional covering objective points to other possibilities.

Third, E2M1 comes out looking very good. The optimised alphabet is better in the sampled worst-case metric, but E2M1 is close enough that its empirical success in block-scaled low-precision settings has a clean geometric explanation.

The Lean formalisation matters because it pins down the definitions, checks the asymptotic comparisons, verifies the separation from standard formats, and formalises the scale-search theorem used in the experiments.

The AI aspect matters to me for a different reason. It changed the way this paper was developed. The experience was not one of handing mathematics over to a machine, but of using AI as part of a proof-development workflow. For me, that was new, and it was positive.

So perhaps a promising design question for future low-precision formats should be phrased less like the traditional “Which scalar values approximate real numbers best?” and more like “Which scalar values, when used coordinatewise inside a block, represent directions best?” That seems like a useful question in the world of block-scaled arithmetic.