Directionality in Low Precision

In a couple of recent posts [1,2], I have been trying to reason through the geometric properties of block number formats. The basic idea is that when a group of numbers shares a scale factor, the small low-precision numbers inside the block are no longer meaningfully trying to approximate scalar magnitudes, as the scale has already taken care of much of the magnitude information. What remains is, to a significant extent, a question of direction.

This post is about a new preprint I have recently put out with Bardia Zadeh, “Direction-Preserving Number Representations”. The question we ask is “If a block scale has already taken care of magnitude, what should the scalar values inside the block look like if our aim is to preserve direction?”

That is not the usual question in computer arithmetic, and I can’t really find a historical precedent for this question in the usual places I would look (please let me know if you know others who have looked at this question from an arithmetic perspective!) We usually ask about absolute error, relative error, dynamic range, rounding behaviour, underflow, overflow, and so on. But in many modern machine learning settings, it is also natural to ask a more geometric question: how well do the values available inside a block cover the possible directions of a vector?

On the research methodology side, there is another aspect of this paper that is new for me personally. This is the first paper I have written where AI tools (namely GPT-5.4, GPT-5.5 and Aristotle) made a substantive contribution to the development of proof ideas, as well as to the Lean formalisation, which I am delighted we have open sourced. The AI tools definitely did not replace mathematical judgement or lead to a push button approach: the process still required many manual reformulations, checking, rewriting, and false starts. But it was a new and positive experience for me. In particular, the combination of exploratory AI assistance and Lean’s rock solid proof checking made for a very different style of research interaction from the one I am used to.

I will come back to Lean and AI below. First, let me describe the mathematical object we studied.

From scalar alphabets to directions

Suppose we have a finite scalar alphabet $A$ . For example, $A$ might be the set of real values represented by a 4-bit floating-point format, or by a 4-bit integer format.

If a block has dimension $n$ , then the unscaled vectors we can represent are the product set $A^n.$

But if the block also has an independent positive scale factor, then multiplying the whole vector by a positive scalar should not really change the information encoded by the low-precision elements. The scale can absorb that. What matters is the direction.

So the relevant object becomes

$P_n(A)=\left\{\frac{x}{|x|_2}:x\in A^n,\;x\ne 0\right\}.$

This is a finite set of points on the unit sphere. But these points can’t be placed arbitrarily on the sphere. The set has product structure, because each coordinate is chosen independently from the scalar alphabet.

The natural worst-case measure is the covering radius

$F_n(A)=\sup_{u\in S^{n-1}}\min_{c\in P_n(A)}\angle(u,c).$

This asks: given any true direction $u$ , how close is the nearest direction obtainable from the alphabet $A$ ? Smaller $F_n(A)$ is better.

This lens changes the problem of how to select the alphabet. Instead of asking “which scalar values approximate real numbers well?”, we ask: which scalar values, when used coordinatewise, cover directions well?

You can see the two-dimensional directionality coverage of the floating-point alphabet with two exponent bits, one mantissa bit and one sign bit (E2M1), illustrated below. Each intersection of grid lines corresponds to a 2-vector with elements drawn from this alphabet. We may then find where the line between the origin and that intersection meets a circle (marked as red points). Note the non-uniform spacing around the circle: some regions are better covered than others.

If you want to play with designing your own two-dimensional alphabet in a graphical user interface, you can get more intuition into this problem using this widget Bardia created: https://bardia01.github.io/directional_coverage_explorer/.

A Lean Interlude

As I mentioned above, I’m super pleased that we’ve open-sourced formalisations of all our theorems and definitions in Lean. The formalisation is organised so that the top-level declarations correspond closely to the paper.

The key definitions look like this:

abbrev Aq (q : Nat) : Set (Finset Real) :=
  {A : Finset Real | A.card = q}

abbrev P (n : Nat) (A : Finset Real) :
  Finset (OptimalAlphabets.SpherePoint n) :=
  OptimalAlphabets.AsymmetricProduct.asymProdSphericalCode n A

abbrev F (n : Nat) (A : Finset Real) : Real :=
  OptimalAlphabets.AsymmetricProduct.F_asym n A

abbrev rhoSph (n m : Nat) : Real :=
  OptimalAlphabets.rho_sph n m

Here Aq q is the class of scalar alphabets with $q$ elements. The definition P n A is the finite spherical code obtained by normalising nonzero elements of $A^n$ . The quantity F n A is the product-code covering objective $F_n(A)$ . The definition rhoSph n m is the optimal covering radius of an unconstrained spherical code with $m$ points.

I’m giving these Lean definitions now so we can roughly follow along in Lean as we go for the rest of this blog post.

Product Codes versus Spherical Codes

It seems natural, almost obvious, that a product code should give worse directional coverage than a vector code chosen specifically to optimise directional coverage. Of course, the latter may not be practical, as the decoding process may be significant, but it still forms a useful baseline comparison point. Our first theoretical contribution is to quantify the gap between these two code classes.

Notice that the product structure induces a very severe geometric constraint. Even if $A$ has $q$ values, so that $A^n$ has up to $q^n$ raw vectors, those vectors are not arbitrary. They arise from independent coordinate choices from the same scalar alphabet. Meanwhile, a spherical code with $q^n$ points is free to place those points anywhere on the sphere.

The harmonic witness

A central construction in the paper is a direction that is hard for product codes to cover. Let

$H_n=\sum_{i=1}^{n}\frac{1}{i}$

be the $n$ th harmonic number, and define the unit vector

$u^{(n)}=\left(\frac{1}{\sqrt{H_n}},\frac{1}{\sqrt{2H_n}},\ldots,\frac{1}{\sqrt{nH_n}}\right).$

The entries of this vector decay slowly, making the vector awkward for a finite scalar alphabet. The resulting theorem gives a lower bound on the worst-case angular error. If $m(A)$ is the smaller of the number of positive and negative nonzero values in $A$ , then

$F_n(A)\ge \arccos\left(\min\left\{1,\;2\sqrt{\frac{m(A)}{H_n}}\right\}\right).$

The Lean statement is compact enough to include directly:

theorem theorem2_sign_count_bound {n : Nat} (hn : 2 <= n)
    (A : Finset Real) :
  Real.arccos
    (min 1 (2 * Real.sqrt (mSign A : Real) / Real.sqrt (H n))) <=
  F n A

The notation mSign A is the Lean name for $m(A)$ , the smaller nonzero sign count. The theorem is written as a lower bound on F n A, just as in the paper.

Since $H_n$ grows like $\log n$ , this bound tends towards $\pi/2$ for any fixed alphabet. In plain language: in sufficiently high dimension, every fixed scalar alphabet has some direction that it represents very badly.

This is a worst-case theorem. We’re not claiming that real neural network tensors look like the harmonic witness. What the theorem tells us is that if the metric is worst-case angular coverage of the entire sphere, product-structured scalar alphabets have an inherent limitation.

So What about Spherical Codes?

A product code built from a $q$ -element alphabet has at most $q^n$ raw codewords before normalisation. The fair unconstrained comparison is therefore a spherical code with $q^n$ points.

The paper proves that, for any fixed $q$ , sufficiently high-dimensional spherical codes beat every $q$ -element product code in worst-case angular covering radius.

Here is the Lean statement:

theorem theorem4_asymptotic_strict_separation_fixed_alphabet_size
    {q : Nat} (hq : 2 <= q) :
  ∃ N : Nat, 2 <= N ∧
    forall n, N <= n ->
      forall A : Finset Real, A ∈ Aq q ->
        rhoSph n (q ^ n) < F n A

Read from right to left, this says: take any scalar alphabet $A$ with $q$ elements. In all sufficiently large dimensions $n$ , the best unconstrained spherical code with $q^n$ points has strictly smaller covering radius than the product-code direction set induced by $A$ .

What about Floating and Fixed Point?

The next question we answer is more practical. Within the product-code world, are the usual scalar alphabets the best ones?

The paper studies standard floating-point, fixed-point, and two’s complement alphabets. The answer is that these conventional choices are asymptotically suboptimal for the worst-case directional metric.

Because the worst-case angle for any product code tends towards $90^\circ$ in high dimension (see above), it is more informative when comparing product codes to look at a normalised quantity such as:

$\sqrt{H_n}\cos F_n(A).$

Very roughly, this measures how slowly the worst-case angle approaches $90^\circ$ . Larger is better.

The Lean statement packages the relevant comparison as a liminf/limsup chain:

theorem theorem5_liminf_limsup_chain {b : Nat} (hb : 3 <= b) :
  arbConst b <=
      Filter.liminf (fun n : Nat => normBestAlpha n (2 ^ b)) atTop ∧
  fpConst b < arbConst b ∧
  Filter.limsup (fun n : Nat => normBestFpCos n b) atTop <= fpConst b

This is a theorem about $b$ -bit alphabets. The quantity normBestAlpha n (2 ^ b) is the best normalised performance obtainable by an arbitrary $2^b$ -element scalar alphabet. The quantity normBestFpCos n b is the corresponding floating-point quantity, optimised over valid splits of exponent and mantissa bits. The constants arbConst b and fpConst b are the two asymptotic constants being compared.

The middle inequality fpConst b < arbConst b is the key point. For $b\ge 3$ , arbitrary scalar alphabets can do strictly better than the floating-point family in this asymptotic directional metric.

For four bits, the paper obtains a concrete constant-factor separation of at least $\sqrt{7/3}\approx 1.528.$

So if the design objective is “choose scalar levels whose product code covers directions well”, there are better alphabets than the standard ones, at least in this worst-case asymptotic sense.

AI and Lean

It feels worth saying a little more about the process followed to reach the proofs of these theorems. This paper is the first time I have had a genuinely substantive AI contribution to the development of proof ideas, not just text polishing and review. AI was useful both in the Lean formalisation and in exploring how some of the mathematical arguments might be structured.

The AI tools we used needed lots of iteration, but the workflow was unexpectedly productive. GPT-5.4 and 5.5 and Aristotle from Harmonic could suggest possible routes, propose intermediate lemmas, help with translation between informal mathematics and Lean statements, and generate candidate proof fragments.

This combination was new for me. I am used to mathematical collaboration involving conversations with people, paper sketches, whiteboards, and eventually LaTeX. Here there was another kind of interaction: a fast, imperfect, but useful assistant for exploring the proof space, coupled with Lean as a formal system that refused to accept anything vague.

Mathematical judgement still matters, in what we wanted to prove as well as what counts as an informative proof. But I came away from the experience more positive about the role these tools can play in research, especially when paired with formal verification rather than used as a substitute for it.

The Experimental Side: Exploring 4-bit Alphabets

The theory says that better scalar alphabets should exist. The experiments in the paper ask what they look like. For four-bit alphabets, we impose sign symmetry and include zero. Since multiplying all scalar levels by a common positive factor does not change the represented directions, we normalise the smallest positive value to one.

For block dimension $d=16$ (as used in NVIDIA NVFP4), the optimised positive levels found in the paper are approximately

$1,\;2.12,\;3.40,\;5.04,\;7.25,\;10.5,\;13.2.$

The optimised alphabet is best across the tested dimensions. But what I find most interesting is how close E2M1 is to the optimum, especially compared with integer/fixed-point and pure powers-of-two formats.

E2M1 is the four-bit format used in NVFP4. The results suggest a geometric explanation for why it works well in block-scaled machine learning settings. The key point is that, for this bit-width and block size, the E2M1 levels lie surprisingly close to the levels obtained by directly optimising the product-code directional covering problem.

Conclusion

The main message is that the geometric lens provides value when considering how to design low-precision number formats for machine learning. Once a block scale is present, the scalar values inside the block are not merely approximating real numbers independently. Together, they are choosing a direction. The scalar alphabet therefore determines a product-structured spherical code.

There are three consequences I find useful.

First, product structure has unavoidable limitations. A coordinatewise scalar alphabet cannot cover directions as well as an unconstrained spherical code with the same number of raw codewords.

Second, standard scalar formats are not forced by the geometry. Floating-point, fixed-point, and two’s complement are natural formats for many reasons, but the directional covering objective points to other possibilities.

Third, E2M1 comes out looking very good. The optimised alphabet is better in the sampled worst-case metric, but E2M1 is close enough that its empirical success in block-scaled low-precision settings has a clean geometric explanation.

The Lean formalisation matters because it pins down the definitions, checks the asymptotic comparisons, verifies the separation from standard formats, and formalises the scale-search theorem used in the experiments.

The AI aspect matters to me for a different reason. It changed the way this paper was developed. The experience was not one of handing mathematics over to a machine, but of using AI as part of a proof-development workflow. For me, that was new, and it was positive.

So perhaps a promising design question for future low-precision formats should be phrased less like the traditional “Which scalar values approximate real numbers best?” and more like “Which scalar values, when used coordinatewise inside a block, represent directions best?” That seems like a useful question in the world of block-scaled arithmetic.

Block Number Formats are (Still!) Direction Preservers

In my previous post, I argued that block number formats can be understood geometrically as direction preservers. That argument relied on an idealization: once a block direction had been chosen, its scale could be set optimally as an arbitrary real number.

Real hardware formats do not usually work that way. In many practical schemes, block scales are quantized very coarsely, sometimes all the way down to powers of two. In particular, in the MX specification, all the concrete compliant formats use E8M0 scaling.

So does the directional picture I painted in my last post survive this brutal scaling? Here I will argue, in the first of what I hope will be a short sequence of follow-up blog posts, that it does.

From ideal block scales to quantized block scales

Recall the setup from the earlier post. A vector is partitioned into blocks, $v = (v_1,\dots,v_B)$ , and each block is approximated as $\hat v_b = \beta_b m_b$ , where $m_b$ is a low-precision mantissa vector and $\beta_b$ is a scalar block scale.

In the earlier post, I assumed that $\beta_b$ was an arbitrary real value, chosen optimally in the least-squares sense. That gave the ideal blockwise representation $\hat v$ .

Now let us keep the same mantissa vectors $m_b$ , but suppose that the scale factors themselves must be quantized. Write the implemented scale as $\tilde \beta_b$ , so that the represented block becomes $\tilde v_b = \tilde \beta_b m_b$ .

It is convenient to define the multiplicative scale error $x_b = \frac{\tilde \beta_b}{\beta_b}$ . Then $\tilde v_b = x_b \hat v_b$ .

Note that, of course, quantizing the block scale does not change the chosen direction of a block at all; it only changes its length. So the only directional distortion comes from the relative rescaling of different blocks.

An exact cosine formula

Let $\alpha_b = \frac{\|\hat v_b\|^2}{\sum_j \|\hat v_j\|^2}$ , so that $\alpha_b$ is the fraction of the ideal projected vector’s energy contained in block $b$ .

Then it can be shown that $\cos(\hat v,\tilde v) = \frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}$ (the proof is included at the end of this post).

So the effect of scale quantization on direction depends only on how uneven the factors $x_b$ are across blocks. If all blocks were rescaled by the same factor, direction would be unchanged.

Exponent-only power-of-two scaling

Now consider the coarsest plausible case: each block scale is rounded to the nearest power of two. Then each multiplicative error satisfies $x_b \in [2^{-1/2},\,2^{1/2}]$ .

So, from our exact cosine formula, we are interested in how small $\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}$ can be, when all the $x_b$ lie in the interval $[2^{-1/2},\,2^{1/2}]$ .

A simple inequality shows that the answer depends only on the two extreme values of the interval. If all the block rescaling factors lie in $[\ell,u]$ , then

$\frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}\ge \frac{2\sqrt{\ell u}}{\ell+u}$ (proof at the end of this blog post).

In the power-of-two case we have $\ell=2^{-1/2}$ , $u=2^{1/2}$ , so $\ell u=1$ , and therefore

$\cos(\hat v,\tilde v)\ge \frac{2}{2^{-1/2}+2^{1/2}} = \frac{2\sqrt2}{3}\approx 0.943$ .

Equivalently, $\angle(\hat v,\tilde v)\le 20^\circ$ .

So even if every block scale is rounded to the nearest power of two, the resulting vector remains within about $20^\circ$ of the ideally scaled one.

That is the main result of this post.

One striking feature of the bound is that it does not depend on the dimension of the vector. The reason is that the worst case is already attained by a two-group energy split: some blocks rounded up, others rounded down. Once those two groups exist, adding more blocks or more dimensions does not make the bound worse, as is apparent from the proof below.

20 degrees is less than it sounds

Our everyday intuition may tell us that this angle is not huge, but it’s not that small either. In a sense, that’s true. But angles behave very differently in high-dimensional spaces. In high dimension, most random vectors are almost orthogonal to one another: their angle is close to $90^\circ$ , so a guarantee that an approximation remains within $20^\circ$ of the original vector is much stronger than it would sound in two or three dimensions.

Beyond power-of-two

We’ve analysed power-of-two scaling here for two reasons: because it’s in a sense the crudest possible floating-point rounding, and because it’s commonly used in real hardware designs.

That does not mean it’s optimal. But it does raise two further questions. Firstly, we’ve assumed here that the exponent range is sufficiently wide – what if it’s not? Secondly – and relatedly – how much better can this angular bound get by spending some of the scale bits on greater precision?

My view is that the answer becomes clearer once a tensor-wide high-precision scale is introduced, something NVIDIA has recently done. In that setting, the block scales get relieved of their additional duty to capture global magnitude. This will be the subject of the next post on the topic!

Proofs

Readers not interested in the algebra can safely skip this section.

Cosine formula

Recall that $\tilde v_b = x_b \hat v_b$ for each block $b$ .

Then, because the blocks occupy disjoint coordinates, $\langle \hat v,\tilde v\rangle = \sum_b \langle \hat v_b,\tilde v_b\rangle = \sum_b x_b \|\hat v_b\|^2$ .

Therefore $\cos(\hat v,\tilde v) = \frac{\langle \hat v,\tilde v\rangle}{\|\hat v\|\,\|\tilde v\|} = \frac{\sum_b x_b \|\hat v_b\|^2}{\sqrt{\sum_b \|\hat v_b\|^2}\sqrt{\sum_b x_b^2 \|\hat v_b\|^2}}$ .

Now, as per the main blog post, define $\alpha_b = \frac{\|\hat v_b\|^2}{\sum_j \|\hat v_j\|^2}$ .

Writing $S=\sum_j \|\hat v_j\|^2$ , so that $\|\hat v_b\|^2=\alpha_b S$ , the numerator becomes $S\sum_b \alpha_b x_b$ and the denominator becomes $S\sqrt{\sum_b \alpha_b x_b^2}$ , giving