Precision Matters in Block Scales

This is the third post in a sequence relating to the geometry of block number formats as angle preservers. In my previous post, I argued that block number formats remain direction preservers even when their block scales are quantised all the way down to powers of two, as is common in some number representations like the MX concrete formats. The main result there was that exponent-only block scaling perturbs direction by at most about $20^\circ$ , and that $20^\circ$ is not that much in high dimensions. So we ended last time with the nice result that very coarse block-scale quantisation is relatively benign.

But that doesn’t mean power-of-two scaling is optimal. If we have a fixed budget of bits for representing block scales, how should we spend them? Specifically, should we spend them on exponent range, or on significand precision?

This post argues that the answer becomes much clearer once a high-precision tensor-wide scale is introduced, which is exactly the kind of two-level scaling used in NVIDIA’s NVFP4 format. In NVFP4, 4-bit E2M1 values are combined with an FP8 E4M3 scale for each 16-value micro-block and a second-level FP32 scale for the tensor.

With such a tensor-wide scale, the block scales are relieved of their duty to try to capture the global magnitude of the tensor. Instead, we can ask a more focused task of them: reconstruct the relative amplitudes of the blocks, so that the global direction of the represented vector is preserved.

Since drafting this post, Bardia Zadeh and I have also written Direction-Preserving Number Representations, which I blogged about separately. That paper studies the related question of what directions can be obtained when each coordinate of a vector is drawn from a finite scalar alphabet. This post is about block scales rather than scalar elements, but the same product-structured geometry reappears one level higher.

I will argue in this post that once we look at the problem that way, precision in the block scales starts to matter much more. This leads to a rough rule of thumb for the relationship between block scale formats and vector lengths.

What a tensor-wide scale changes

Suppose, as per my previous posts, that each block is represented as $\hat v_b = \beta_b^\star m_b$ , where $m_b$ is a chosen mantissa vector and $\beta_b^\star$ is the ideal real-valued block scale for that mantissa direction.

Now suppose that the final represented tensor has the form

$\tilde v = \gamma \bigoplus_{b=1}^K (\tilde \beta_b m_b)$

where

$\gamma$ is the high-precision tensor-wide scale,
$\tilde \beta_b$ is the low-precision per-block scale, and
$\bigoplus$ denotes direct sum (block concatenation).

Of course, the tensor-wide scale $\gamma$ has no effect on direction at all: it multiplies the whole tensor uniformly, so it only changes magnitude. That means the tensor-wide scale can be used to absorb the global length of the vector, leaving the block scales to encode relative block amplitudes.

In other words, once a tensor-wide scale is present, the block scales stop answering the question “how large is this tensor?” and instead answer the question “how do the blocks compare with one another?”

Exact scale-only cosine factor

Let $\hat v$ denote the ideal blockwise representation obtained using the real-valued scales $\beta_b^\star$ , and let $\tilde v$ denote the represented tensor after block-scale quantisation.

Write

$x_b = \frac{\tilde \beta_b}{\beta_b^\star}.$

Then the represented block is simply

$\tilde v_b = x_b \hat v_b.$

So scale quantisation does not change any chosen block direction, it only rescales the ideal projected blocks.

Define

$\alpha_b = \frac{|\hat v_b|^2}{\sum_j |\hat v_j|^2},$

so that $\alpha_b$ is the fraction of the ideal projected energy contained in block $b$ .

Then, exactly as in the previous post, we have

$\cos(\hat v,\tilde v) = \frac{\sum_b \alpha_b x_b}{\sqrt{\sum_b \alpha_b x_b^2}}.$

This says that directional distortion from block-scale quantisation depends only on how uneven the multiplicative scale errors $x_b$ are across blocks.

If all blocks were rescaled by the same factor, direction would be unchanged.

Two jobs for two different kinds of bits

A block-scale format does two things.

First, its exponent bits determine what range of relative block scales can be represented without clipping or underflow.

Second, its significand bits determine how accurately the in-range scales are represented.

So there are really two error sources:

tail loss, from blocks whose scales fall outside range;
in-range uneven rescaling, from finite precision within range.

The interesting question is how these trade off when the total number of scale bits is fixed.

A conservative scale-only view for fixed $K$

Suppose the tensor is divided into $K$ blocks.

If we want a format-level guarantee, a natural scale-only question is:

for a given number of blocks $K$ , how should a fixed budget of scale bits be split between exponent and significand so as to control the worst-case angular loss caused by scale quantisation?

I am deliberately saying “scale-only” here because this is not a claim about the globally optimal scalar alphabet (a problem Bardia and I cover in our preprint, linked to above), nor about the full problem of choosing the mantissa vectors. It is a conservative model of the additional angular error introduced after the block directions have already been chosen.

To make this concrete, let

$e$ be the number of exponent bits,
$p$ be the significand precision in bits, and
$k=e+p$ be the total scale-field width.

Now make one simplifying assumption: the mantissa vectors used in different blocks all have the same norm. Under that assumption, projected block energy is proportional to the square of the ideal block scale. This lets us reason directly in terms of the block scales themselves.

If the exponent field is too narrow, then very small relative block scales may underflow to zero after tensor-wide normalisation. In the worst case, one block survives at unit scale and the remaining $K-1$ blocks sit just below the lower threshold and are lost.

If the smallest representable normalised scale is $\tau_e$ , then the exponent-side cosine contribution is bounded by

$\displaystyle \cos(\hat v,z)\geq \frac{1}{\sqrt{1+(K-1)\tau_e^2}}.$

Here $z$ denotes the intermediate vector obtained by zeroing the blocks that fall below the representable scale threshold.

If, on the other hand, the surviving blocks remain in range but the scale precision is limited, then the remaining error comes from uneven in-range rescaling. Suppose the multiplicative scale errors for the surviving blocks satisfy

$\displaystyle \ell_p \leq x_b \leq u_p.$

Then the interval argument from the previous post gives

$\displaystyle \cos(z,\tilde v)\geq \frac{2\sqrt{\ell_p u_p}}{\ell_p+u_p}.$

In the common symmetric relative-error model, $\ell_p=1-u$ and $u_p=1+u$ , so this becomes

$\displaystyle \cos(z,\tilde v)\geq \sqrt{1-u^2}.$

Putting the clipping and in-range effects together gives the conservative scale-only bound

$\displaystyle \cos(\hat v,\tilde v)\geq \frac{2\sqrt{\ell_p u_p}}{\ell_p+u_p}\cdot \frac{1}{\sqrt{1+(K-1)\tau_e^2}}.$

In the symmetric relative-error model, this simplifies to

$\displaystyle \cos(\hat v,\tilde v)\geq \frac{\sqrt{1-u^2}}{\sqrt{1+(K-1)\tau_e^2}}.$

This is the key scale-only design inequality.

What this says about exponent bits

The first striking feature is that exponent bits only need to control the low-energy tail.

To make clipping negligible, it is enough to ensure that

$\displaystyle (K-1)\tau_e^2 \ll 1.$

Now the dynamic range of a floating-point-like scale grows extremely quickly with exponent width. A format-dependent way to write this is

$\displaystyle \tau_e \approx 2^{-c2^e},$

where $c$ is a positive constant depending on details such as exponent bias, reserved encodings, and whether subnormals are supported.

Substituting into the clipping condition gives

$\displaystyle (K-1)2^{-2c2^e}\ll 1.$

Solving this roughly gives

$\displaystyle e \gtrsim \log_2\log_2 K+O(1).$

The important point is the growth law. The number of exponent bits needed to control relative-tail clipping grows only like $\log\log K$ .

That is a very slow growth law: once $K$ is fixed, only a small number of exponent bits is needed before clipping becomes a second-order issue for direction.

What this says about significand bits

Once clipping is under control, the remaining scale error is dominated by in-range relative precision. That is the role of the significand.

In the symmetric relative-error model, the scale-only cosine contribution is

$\displaystyle \cos(z,\tilde v)\geq \sqrt{1-u^2}.$

Equivalently, the angular contribution is at most

$\displaystyle \arcsin(u).$

If a $p$ -bit significand gives a relative scale error of the form

$\displaystyle u\approx C_{\rm round}2^{-p},$

where $C_{\rm round}$ depends on the precise rounding convention, then asking the scale field alone to contribute at most an angle $\theta$ gives the rough condition

$\displaystyle C_{\rm round}2^{-p}\lesssim \sin\theta.$

Equivalently,

$\displaystyle p\gtrsim \log_2\left(\frac{C_{\rm round}}{\sin\theta}\right).$

So once the exponent field is “good enough”, every additional scale bit is more profitably spent on significand precision than on more dynamic range. This is the central conclusion of the scale-only model.

A simple rule of thumb

The previous discussion suggests the following design rule.

Use just enough exponent bits to make clipping of important blocks negligible. Spend the rest on significand precision.

For a tensor with $K$ blocks and a scale field of width $k$ , a rough rule of thumb is

$\displaystyle e^\star \approx \left\lceil \log_2\log_2 K \right\rceil + C_{\rm format},\qquad p^\star=k-e^\star.$

Here $C_{\rm format}$ is a small format-dependent additive correction – in a real format, it depends on details such as exponent bias, special encodings, and subnormal support. This is not meant as a precise optimum, only as a rough scale-only guide. But it makes the main point quite clearly:

exponent bits become sufficient very quickly; significand bits keep helping.

What this means for modern designs

This way of looking at the problem helps explain why a format that combines

a high-precision tensor-wide scale, and
a more precise block-scale format

looks like a very sensible design.

The tensor-wide scale deals with global magnitude. That leaves the block scales free to focus on preserving the relative block amplitudes that determine global direction.

This is exactly the tradeoff that makes NVFP4 interesting. Compared with exponent-only scaling, E4M3 block scales spend some representational power on non-power-of-two precision. The second-level FP32 tensor scale then compensates for the reduced range of the more precise E4M3 block-scale format.

There is also a useful connection with Bardia’s preprint. That paper studies the scalar alphabet inside a block, and finds that for 4-bit alphabets at the NVFP4 micro-block dimension $d=16$ , E2M1 is close to an independently optimized direction-preserving alphabet. This post studies a complementary question one level higher: once those micro-blocks have scales, how should the scale alphabet itself spend its bits?

So the two messages reinforce one another:

inside a micro-block, E2M1 is a surprisingly good product-structured scalar alphabet for direction preservation;
across micro-blocks, a tensor-wide scale makes the relative block-scale problem more important, so non-power-of-two block-scale precision becomes valuable.

From the perspective of directional reconstruction, that seems like a very good bargain.

Conclusion

The previous post showed that even exponent-only block scaling preserves direction surprisingly well.

This post goes a step further. Once a tensor-wide high-precision scale is available, the main question is no longer whether coarse block scaling is robust enough. Instead, we should ask whether the block scales are making the best possible use of their bits.

From the point of view of reconstructing global direction,

exponent bits protect against clipping of low-energy tail blocks;
significand bits improve the relative amplitudes of all the important in-range blocks.

Since exponent range grows very quickly with exponent width, only a modest number of exponent bits is needed before clipping becomes a secondary issue. After that, precision matters more.

This is a scale-only argument. It assumes the block directions have already been chosen, and studies the additional directional error caused by quantising the relative block amplitudes.

Proof sketch of the conservative scale-only bound

Readers not interested in the algebra can safely skip this section.

Assume equal mantissa norms across blocks, so that ideal projected block energy is proportional to the square of the ideal block scale.

Let $\hat v$ be the ideal blockwise projected vector after tensor-wide normalisation. We model low-precision block scales in two stages.

First, let $z$ be obtained from $\hat v$ by zeroing every block whose normalised scale falls below the smallest representable positive scale $\tau_e$ . In the worst case, one block survives at scale $1$ and the remaining $K-1$ blocks sit just below $\tau_e$ and are lost. The lost projected-energy fraction is then

$\displaystyle \eta=\frac{(K-1)\tau_e^2}{1+(K-1)\tau_e^2},$

$\displaystyle \cos(\hat v,z)=\sqrt{1-\eta}=\frac{1}{\sqrt{1+(K-1)\tau_e^2}}.$

Second, form the final represented vector $\tilde v$ by applying in-range rounding to the surviving block scales. If the surviving-block multiplicative errors satisfy

$\displaystyle \ell_p\leq x_b\leq u_p,$

then the interval bound from the previous post gives

$\displaystyle \cos(z,\tilde v)\geq \frac{2\sqrt{\ell_p u_p}}{\ell_p+u_p}.$

In the symmetric relative-error model, $\ell_p=1-u$ and $u_p=1+u$ , so this becomes

$\displaystyle \cos(z,\tilde v)\geq \sqrt{1-u^2}.$

Now write $\hat v=z+r$ , where $r$ is supported only on the clipped blocks. Since $\tilde v$ is supported only on the surviving blocks, we have $r\perp \tilde v$ , and hence

$\displaystyle \cos(\hat v,\tilde v)=\frac{\langle z,\tilde v\rangle}{|\hat v|_2|\tilde v|_2}=\frac{|z|_2}{|\hat v|_2}\frac{\langle z,\tilde v\rangle}{|z|_2|\tilde v|_2}=\cos(\hat v,z)\cos(z,\tilde v).$

Therefore

$\displaystyle \cos(\hat v,\tilde v)\geq \frac{2\sqrt{\ell_p u_p}}{\ell_p+u_p}\cdot \frac{1}{\sqrt{1+(K-1)\tau_e^2}}.$

In the symmetric relative-error model, this is

$\displaystyle \cos(\hat v,\tilde v)\geq \frac{\sqrt{1-u^2}}{\sqrt{1+(K-1)\tau_e^2}}.$

Block Number Formats are Direction Preservers

I’ve recently returned from the SIAM PP 2026 conference and as always, conferences help provide time for research reflection. One thing I’ve been reflecting on during my journey back is the various explanations people give for why the machine learning world is so keen on block number formats (MX, NVFP, etc.) – see my earlier blog post on MX if you need a primer. Many hardware engineers tend to answer that they lead to efficient storage, or efficient arithmetic, or improved data transfer bandwidth, which are all true. But I think there’s another complementary answer that’s less well discussed (if indeed it is discussed at all). I hope this blog post might help stimulate some discussion of this complementary take.

On the numerical side, at first glance it might seem surprising that despite these formats representing numbers with very limited precision, large neural networks often tolerate them remarkably well, with little loss in accuracy. In my experience, most explanations focus on dynamic range, quantization noise, the inherent noise robustness of neural networks, or calibration techniques. But I suspect there is also a simple geometric way to think about what these formats are doing: Block number formats help preserve vector direction. And for many machine learning computations, preserving direction matters far more than preserving exact numerical values.

Block formats inherently represent direction and magnitude

Consider a vector $v$ whose coordinates are partitioned into blocks $v = (v_1, v_2, \dots, v_B)$ .

In a block format, each block is represented using a shared scale and low-precision mantissas. For ease of discussion, we’ll consider the simplest case here, where scales are allowed to be arbitrary real-valued. In general, they may be much more restricted, e.g. powers of two.

Each block is approximated as $\hat v_b = \beta_b m_b$

where

$m_b$ is a vector of low-precision mantissas, and

$\beta_b$ is a scalar shared scaling factor.

In other words, each block can be thought of as a direction (encoded by the mantissas) multiplied by a magnitude (the shared scale). Strictly speaking, the mantissa vectors $m_b$ need not be normalized, and in many formats their entries may have quite different magnitudes (for example in integer mantissa formats such as MXINT). However this does not change the geometry. The representation $\hat{v}_b = \beta_b m_b$ is invariant to rescaling of $m_b$ : multiplying $m_b$ by any constant simply rescales $\beta_b$ by the inverse factor. What matters for the approximation is therefore only the direction of $m_b$ , i.e. the one-dimensional subspace it spans.

Often we don’t think of it like this, but broadly speaking this is what has happened: block scaling allows us to decouple magnitude and direction representation. This resembles the familiar decomposition $v = \|v\|\frac{v}{\|v\|}$ of a vector into its magnitude and direction, but applied locally within blocks.

If the mantissa vector $m_b$ points roughly in the same direction as the original block $v_b$ , then scaling it appropriately produces a good approximation of that block.

OK, but does preserving directions block by block actually preserve the direction of the whole vector? It turns out that the answer is yes.

Direction Preservation

Let us make the reasonable assumption that the scale of each block $\beta_b$ is not chosen arbitrarily, but rather is the best possible scale for that block in the least squares sense, for whatever mantissa vector we choose, i.e. $\beta_b = \arg\min_{\beta} \|v_b - \beta m_b\|^2$ . Then $\hat v_b$ is the orthogonal projection of $v_b$ onto the line spanned by $m_b$ .

So to what extent do the approximate and the original block vector point in the same direction? We can measure the block cosine similarities of the blocks as: $\rho_b = \frac{\langle v_b,\hat v_b\rangle}{\|v_b\|\|\hat v_b\|}$ .

Equally, we can measure the the cosine similarity of the full vectors (the concatenation of the original blocks versus the concatenation of the approximated blocks): $\rho = \frac{\langle v,\hat v\rangle}{\|v\|\|\hat v\|}$ .

My aim here is to explain why small error in direction at block level leads to small error at vector level.

First, let’s define $w_b = \frac{\|v_b\|^2}{\|v\|^2}$ , which we can think of as the fraction of the vector’s energy contained in block $b$ ; these add to 1 over the whole vector. Now we can state the result:

Theorem (Block Cosines)

Under the blockwise least-squares scaling, $\rho = \sqrt{\sum_{b=1}^{B} w_b \rho_b^2 }$ .

For proof, see end of post.

In simple terms, this theorem states that the cosine similarity of the whole vector is the energy-weighted RMS of the block cosine similarities.

What are the implications?

The weights $w_b$ represent how much of the vector’s energy lies in each block. Blocks that contain very little energy contribute very little to the final direction. The important consequence is that direction errors do not accumulate catastrophically across blocks. Instead, the overall directional error simply depends on a weighted average of the block direction errors. In other words, if block number formats preserve the directions of individual blocks, they automatically preserve the direction of the entire vector.

Many core operations in machine learning depend heavily on vector direction. Notably, during training, stochastic gradient descent updates are already in the form of magnitude (learning rate) + direction. We already have a knob controlling magnitude (the learning rate); what matters is that the direction is preserved. In attention mechanisms and embedding, directional similarity measures are very important. Even for the humble dot product, the workhorse of inference, preservation of direction means that small perturbations in input give rise to only small perturbations in output, so the dot product behaves robustly.

Conclusion

Block floating-point and similar formats like block mini-float, MX, NVFP, are usually explained in terms of dynamic range and quantization noise. But geometrically, I like the perspective that they do something simpler: they approximate each block of a vector as direction × magnitude.

And as long as the block directions are preserved reasonably well, the direction of the whole vector is preserved too.

I think this is a useful intuition as to why very low-precision formats can work so well in modern machine learning systems. Block number formats are, in a very real sense, direction preservers. From this perspective, such low-precision block formats succeed not because they represent individual numbers accurately, but because they preserve the geometry of vectors.

Lots of extensions of this kind of analysis are of course possible. To name just a few:

We’ve focused on vectors, but tensor-level scaling may have interesting interplay with batching during training, for example
We made the simplifying assumption that scaling factors were real valued, but these can be restricted, most significantly to powers of two, and the analysis would need to be modified to incorporate that change.
We’ve not discussed mantissas at all, lots more of interest could be said here.
Potentially this approach could help provide some guidance to the empirical sizing of blocks in a block representation.

If anyone would like to work with me on this topic, do let me know your ideas.

Proof of the theorem

Readers not interested in the algebra can safely skip this section.

For each block $b$ , the approximation $\hat v_b = \beta_b m_b$ with $\beta_b$ chosen by least squares is the orthogonal projection of $v_b$ onto the line spanned by $m_b$ .

So we can write $v_b = \hat v_b + r_b$ where $r_b$ is orthogonal to $\hat v_b$ .

Taking the inner product with $\hat v_b$ gives $\langle v_b,\hat v_b\rangle = \|\hat v_b\|^2$ .

Now sum over blocks. Because the blocks correspond to disjoint coordinates,

$\langle v,\hat v\rangle = \sum_b \langle v_b,\hat v_b\rangle = \sum_b \|\hat v_b\|^2 = \|\hat v\|^2$ .

Therefore

$\rho = \frac{\langle v,\hat v\rangle}{\|v\|\|\hat v\|} = \frac{\|\hat v\|}{\|v\|}$ .

Recall $\rho_b = \frac{\langle v_b,\hat v_b\rangle}{\|v_b\|\|\hat v_b\|}$ .

Using $\langle v_b,\hat v_b\rangle=\|\hat v_b\|^2$ , we obtain

$\rho_b = \frac{\|\hat v_b\|}{\|v_b\|}$ .

Hence

$\|\hat v_b\|^2 = \rho_b^2 \|v_b\|^2$ .

Summing over blocks gives

$\|\hat v\|^2 = \sum_b \|\hat v_b\|^2 = \sum_b \rho_b^2 \|v_b\|^2$ .

Dividing by $\|v\|^2$ , and writing

$w_b = \frac{\|v_b\|^2}{\|v\|^2}$ ,

gives

$\frac{\|\hat v\|^2}{\|v\|^2} = \sum_b w_b \rho_b^2$ .

Since $\rho = \|\hat v\|/\|v\|$ , we obtain

$\rho = \sqrt{\sum_b w_b \rho_b^2 }$ .

$\square$

FCCM 2025

I’ve recently returned from the IEEE International Symposium on Field-Programmable Custom Computing Machines (known as FCCM). I used to attend FCCM regularly in the early 2000s, and while I have continued to publish there, I have not attended myself for some years. I tried a couple of years ago, but ended up isolated with COVID in Los Angeles. In contrast, I am pleased to report that the conference is in good health!

The conference kicked off on the the evening of the 4th May, with a panel discussion on the topic of “The Future of FCCMs Beyond Moore’s Law”, of which I was invited be be part, alongside industrial colleagues Chris Lavin and Madhura Purnaprajna from AMD, Martin Langhammer from Altera, and Mark Shand from Waymo. Many companies have tried and failed to produce lasting post-Moore alternatives to the FPGA and the microprocessor over the decades I’ve been in the field and some of these ideas and architectures (less commonly, associated compiler flows / design tools) have been very good. But, as Keynes said, “markets can remain irrational longer than you can remain solvent”. So instead of focusing on commercial realities, I tried to steer the panel discussion towards the genuinely fantastic opportunities our academic field has for a future in which power, performance and area innovation changes become a matter of intellectual advances in architecture and compiler technology rather than riding the wave of technology miniaturisation (itself, of course, the product of great advances by others).

The evening panel, as imagined by AI. I’m 2nd to left. The AI tool was clearly unaware of Martin’s height difference!

The following day, the conference proper kicked off. Some highlights for me from other authors included the following papers aligned with my general interests:

AutoNTT: Automatic Architecture Design and Exploration for Number Theoretic Transform Acceleration on FPGAs from Simon Fraser University, presented by Zhenman Fang.
RealProbe: An Automated and Lightweight Performance Profiler for In-FPGA Execution of High-Level Synthesis Designs from Georgia Tech, presented by Jiho Kim from Callie Hao‘s group.
High Throughput Matrix Transposition on HBM-Enabled FPGAs from the University of Southern California (Viktor Prasanna‘s group).
ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference Through Iterative Tensor Decomposition from my colleague Christos Bouganis‘ group at Imperial College, presented by Keran Zheng.
Guaranteed Yet Hard to Find: Uncovering FPGA Routing Convergence Paradox from Mirjana Stojilovic‘s group at EPFL – and winner of this year’s best paper prize!

In addition, my own group had two full papers at FCCM this year:

Banked Memories for Soft SIMT Processors, joint work between Martin Langhammer (Altera) and me, where Martin has been able to augment his ultra-high-frequency soft-processor with various useful memory structures. This is probably the last paper of Martin’s PhD – he’s done great work in both developing a super-efficient soft-processor and in forcing the FPGA community to recognise that some published clock frequency results are really quite poor and that people should spend a lot longer thinking about the physical aspects of their designs if they want to get high performance.
NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT Inference, joint work between my PhD student Marta Andronic and me. I think this is a landmark paper in terms of the results that Marta has been able to achieve. Compared to her earlier NeuraLUT work which I’ve blogged on previously, she has added a way to break down large LUTs into trees of smaller LUTs, and a hardware-aware way to learn sparsity patterns that work best, localising nonlinear interactions in these neural networks to within lookup tables. The impact of these changes on the area and delay of her designs is truly impressive.