Block Number Formats are Direction Preservers

I’ve recently returned from the SIAM PP 2026 conference and as always, conferences help provide time for research reflection. One thing I’ve been reflecting on during my journey back is the various explanations people give for why the machine learning world is so keen on block number formats (MX, NVFP, etc.) – see my earlier blog post on MX if you need a primer. Many hardware engineers tend to answer that they lead to efficient storage, or efficient arithmetic, or improved data transfer bandwidth, which are all true. But I think there’s another complementary answer that’s less well discussed (if indeed it is discussed at all). I hope this blog post might help stimulate some discussion of this complementary take.

On the numerical side, at first glance it might seem surprising that despite these formats representing numbers with very limited precision, large neural networks often tolerate them remarkably well, with little loss in accuracy. In my experience, most explanations focus on dynamic range, quantization noise, the inherent noise robustness of neural networks, or calibration techniques. But I suspect there is also a simple geometric way to think about what these formats are doing: Block number formats help preserve vector direction. And for many machine learning computations, preserving direction matters far more than preserving exact numerical values.

Block formats inherently represent direction and magnitude

Consider a vector v whose coordinates are partitioned into blocks v = (v_1, v_2, \dots, v_B).

In a block format, each block is represented using a shared scale and low-precision mantissas. For ease of discussion, we’ll consider the simplest case here, where scales are allowed to be arbitrary real-valued. In general, they may be much more restricted, e.g. powers of two.

Each block is approximated as \hat v_b = \beta_b m_b

where

m_b is a vector of low-precision mantissas, and

\beta_b is a scalar shared scaling factor.

In other words, each block can be thought of as a direction (encoded by the mantissas) multiplied by a magnitude (the shared scale). Strictly speaking, the mantissa vectors m_b​ need not be normalized, and in many formats their entries may have quite different magnitudes (for example in integer mantissa formats such as MXINT). However this does not change the geometry. The representation \hat{v}_b = \beta_b m_b is invariant to rescaling of m_b: multiplying m_b by any constant simply rescales \beta_b by the inverse factor. What matters for the approximation is therefore only the direction of m_b​, i.e. the one-dimensional subspace it spans.

Often we don’t think of it like this, but broadly speaking this is what has happened: block scaling allows us to decouple magnitude and direction representation. This resembles the familiar decomposition v = \|v\|\frac{v}{\|v\|} of a vector into its magnitude and direction, but applied locally within blocks.

If the mantissa vector m_b points roughly in the same direction as the original block v_b, then scaling it appropriately produces a good approximation of that block.

OK, but does preserving directions block by block actually preserve the direction of the whole vector? It turns out that the answer is yes.

Direction Preservation

Let us make the reasonable assumption that the scale of each block \beta_b is not chosen arbitrarily, but rather is the best possible scale for that block in the least squares sense, for whatever mantissa vector we choose, i.e. \beta_b = \arg\min_{\beta} \|v_b - \beta m_b\|^2. Then \hat v_b is the orthogonal projection of v_b onto the line spanned by m_b.

So to what extent do the approximate and the original block vector point in the same direction? We can measure the block cosine similarities of the blocks as: \rho_b = \frac{\langle v_b,\hat v_b\rangle}{\|v_b\|\|\hat v_b\|}.

Equally, we can measure the the cosine similarity of the full vectors (the concatenation of the original blocks versus the concatenation of the approximated blocks): \rho = \frac{\langle v,\hat v\rangle}{\|v\|\|\hat v\|}.

My aim here is to explain why small error in direction at block level leads to small error at vector level.

First, let’s define w_b = \frac{\|v_b\|^2}{\|v\|^2}, which we can think of as the fraction of the vector’s energy contained in block b; these add to 1 over the whole vector. Now we can state the result:

Theorem (Block Cosines)

Under the blockwise least-squares scaling, \rho = \sqrt{\sum_{b=1}^{B} w_b \rho_b^2 }.

For proof, see end of post.

In simple terms, this theorem states that the cosine similarity of the whole vector is the energy-weighted RMS of the block cosine similarities.

What are the implications?

The weights w_b represent how much of the vector’s energy lies in each block. Blocks that contain very little energy contribute very little to the final direction. The important consequence is that direction errors do not accumulate catastrophically across blocks. Instead, the overall directional error simply depends on a weighted average of the block direction errors. In other words, if block number formats preserve the directions of individual blocks, they automatically preserve the direction of the entire vector.

Many core operations in machine learning depend heavily on vector direction. Notably, during training, stochastic gradient descent updates are already in the form of magnitude (learning rate) + direction. We already have a knob controlling magnitude (the learning rate); what matters is that the direction is preserved. In attention mechanisms and embedding, directional similarity measures are very important. Even for the humble dot product, the workhorse of inference, preservation of direction means that small perturbations in input give rise to only small perturbations in output, so the dot product behaves robustly.

Conclusion

Block floating-point and similar formats like block mini-float, MX, NVFP, are usually explained in terms of dynamic range and quantization noise. But geometrically, I like the perspective that they do something simpler: they approximate each block of a vector as direction × magnitude.

And as long as the block directions are preserved reasonably well, the direction of the whole vector is preserved too.

I think this is a useful intuition as to why very low-precision formats can work so well in modern machine learning systems. Block number formats are, in a very real sense, direction preservers. From this perspective, such low-precision block formats succeed not because they represent individual numbers accurately, but because they preserve the geometry of vectors.

Lots of extensions of this kind of analysis are of course possible. To name just a few:

  • We’ve focused on vectors, but tensor-level scaling may have interesting interplay with batching during training, for example
  • We made the simplifying assumption that scaling factors were real valued, but these can be restricted, most significantly to powers of two, and the analysis would need to be modified to incorporate that change.
  • We’ve not discussed mantissas at all, lots more of interest could be said here.
  • Potentially this approach could help provide some guidance to the empirical sizing of blocks in a block representation.

If anyone would like to work with me on this topic, do let me know your ideas.


Proof of the theorem

Readers not interested in the algebra can safely skip this section.

For each block b, the approximation \hat v_b = \beta_b m_b with \beta_b chosen by least squares is the orthogonal projection of v_b onto the line spanned by m_b.

So we can write v_b = \hat v_b + r_b where r_b is orthogonal to \hat v_b.

Taking the inner product with \hat v_b gives \langle v_b,\hat v_b\rangle = \|\hat v_b\|^2.

Now sum over blocks. Because the blocks correspond to disjoint coordinates,

\langle v,\hat v\rangle = \sum_b \langle v_b,\hat v_b\rangle = \sum_b \|\hat v_b\|^2 = \|\hat v\|^2.

Therefore

\rho = \frac{\langle v,\hat v\rangle}{\|v\|\|\hat v\|} = \frac{\|\hat v\|}{\|v\|}.

Recall \rho_b = \frac{\langle v_b,\hat v_b\rangle}{\|v_b\|\|\hat v_b\|}.

Using \langle v_b,\hat v_b\rangle=\|\hat v_b\|^2, we obtain

\rho_b = \frac{\|\hat v_b\|}{\|v_b\|}.

Hence

\|\hat v_b\|^2 = \rho_b^2 \|v_b\|^2.

Summing over blocks gives

\|\hat v\|^2 = \sum_b \|\hat v_b\|^2 = \sum_b \rho_b^2 \|v_b\|^2.

Dividing by \|v\|^2, and writing

w_b = \frac{\|v_b\|^2}{\|v\|^2},

gives

\frac{\|\hat v\|^2}{\|v\|^2} = \sum_b w_b \rho_b^2.

Since \rho = \|\hat v\|/\|v\|, we obtain

\rho = \sqrt{\sum_b w_b \rho_b^2 }.

\square