Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RMSNorm (Normalization)

Rescales a vector to have roughly unit variance. This prevents numbers from growing too large or too small as they pass through multiple layers:

function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

It computes the root mean square of the vector, then divides each element by it. The 1e-5 prevents division by zero. For example:

input = [3.0, 4.0]

mean of squares:  (9 + 16) / 2 = 12.5
root mean square: √12.5 ≈ 3.54
divide each:      [3.0 / 3.54, 4.0 / 3.54] ≈ [0.85, 1.13]

The values have been rescaled so they sit near 1, but their relative proportions (3:4) are preserved. Without normalization, activations can explode or vanish across layers, making training unstable.

Real RMSNorm includes a learnable per-element scale parameter (gamma) that lets each dimension adjust its magnitude after normalization. We omit it for simplicity; our model trains fine without it at this scale.

Summary

These three primitives (linear, softmax, and rmsnorm) are the building blocks the model assembles in the next chapter. Each one is built entirely from Value operations, so the autograd engine can compute gradients through all of them.

PrimitiveWhat it doesWhere it is used
linearMatrix-vector multiplyAttention projections, MLP layers, output head
softmaxScores to probabilitiesAttention weights, training loss, generation sampling
rmsnormNormalize to unit varianceBefore each transformer sub-block