In my previous post, I argued that block number formats can be understood geometrically as direction preservers. That argument relied on an idealization: once a block direction had been chosen, its scale could be set optimally as an arbitrary real number.
Real hardware formats do not usually work that way. In many practical schemes, block scales are quantized very coarsely, sometimes all the way down to powers of two. In particular, in the MX specification, all the concrete compliant formats use E8M0 scaling.
So does the directional picture I painted in my last post survive this brutal scaling? Here I will argue, in the first of what I hope will be a short sequence of follow-up blog posts, that it does.
From ideal block scales to quantized block scales
Recall the setup from the earlier post. A vector is partitioned into blocks, , and each block is approximated as
, where
is a low-precision mantissa vector and
is a scalar block scale.
In the earlier post, I assumed that was an arbitrary real value, chosen optimally in the least-squares sense. That gave the ideal blockwise representation
.
Now let us keep the same mantissa vectors , but suppose that the scale factors themselves must be quantized. Write the implemented scale as
, so that the represented block becomes
.
It is convenient to define the multiplicative scale error . Then
.
Note that, of course, quantizing the block scale does not change the chosen direction of a block at all; it only changes its length. So the only directional distortion comes from the relative rescaling of different blocks.
An exact cosine formula
Let , so that
is the fraction of the ideal projected vector’s energy contained in block
.
Then it can be shown that (the proof is included at the end of this post).
So the effect of scale quantization on direction depends only on how uneven the factors are across blocks. If all blocks were rescaled by the same factor, direction would be unchanged.
Exponent-only power-of-two scaling
Now consider the coarsest plausible case: each block scale is rounded to the nearest power of two. Then each multiplicative error satisfies .
So, from our exact cosine formula, we are interested in how small can be, when all the
lie in the interval
.
A simple inequality shows that the answer depends only on the two extreme values of the interval. If all the block rescaling factors lie in , then
(proof at the end of this blog post).
In the power-of-two case we have ,
, so
, and therefore
.
Equivalently, .
So even if every block scale is rounded to the nearest power of two, the resulting vector remains within about of the ideally scaled one.
That is the main result of this post.
One striking feature of the bound is that it does not depend on the dimension of the vector. The reason is that the worst case is already attained by a two-group energy split: some blocks rounded up, others rounded down. Once those two groups exist, adding more blocks or more dimensions does not make the bound worse, as is apparent from the proof below.
20 degrees is less than it sounds
Our everyday intuition may tell us that this angle is not huge, but it’s not that small either. In a sense, that’s true. But angles behave very differently in high-dimensional spaces. In high dimension, most random vectors are almost orthogonal to one another: their angle is close to , so a guarantee that an approximation remains within
of the original vector is much stronger than it would sound in two or three dimensions.
Beyond power-of-two
We’ve analysed power-of-two scaling here for two reasons: because it’s in a sense the crudest possible floating-point rounding, and because it’s commonly used in real hardware designs.
That does not mean it’s optimal. But it does raise two further questions. Firstly, we’ve assumed here that the exponent range is sufficiently wide – what if it’s not? Secondly – and relatedly – how much better can this angular bound get by spending some of the scale bits on greater precision?
My view is that the answer becomes clearer once a tensor-wide high-precision scale is introduced, something NVIDIA has recently done. In that setting, the block scales get relieved of their additional duty to capture global magnitude. This will be the subject of the next post on the topic!
Proofs
Readers not interested in the algebra can safely skip this section.
Cosine formula
Recall that for each block
.
Then, because the blocks occupy disjoint coordinates, .
Also, , and
.
Therefore .
Now, as per the main blog post, define .
Writing , so that
, the numerator becomes
and the denominator becomes
, giving
.
20 degree bound
Assume that all the multiplicative error factors lie in an interval with
.
Let .
Then the cosine is just . Since each
, we have
.
Expanding this gives
.
Multiplying by and summing over
gives
.
Therefore .
Now the weighted mean also lies in the interval
, so it remains to minimize
over
.
Differentiating shows that the minimum occurs at , the harmonic mean of
and
.
Substituting this value gives
,
and therefore
.
So we have proved that
.
Finally, in the power-of-two case we have
and
, so
, and hence
.
Numerically,
,
so
.