RMSNorm (Normalization)

Rescales a vector to have roughly unit variance. This prevents numbers from growing too large or too small as they pass through multiple layers:

function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

It computes the root mean square of the vector, then divides each element by it. The 1e-5 prevents division by zero. For example:

input = [3.0, 4.0]

mean of squares:  (9 + 16) / 2 = 12.5
root mean square: √12.5 ≈ 3.54
divide each:      [3.0 / 3.54, 4.0 / 3.54] ≈ [0.85, 1.13]

The values have been rescaled so they sit near 1, but their relative proportions (3:4) are preserved. Without normalization, activations can explode or vanish across layers, making training unstable.

Real RMSNorm includes a learnable per-element scale parameter (gamma) that lets each dimension adjust its magnitude after normalization. We omit it for simplicity; our model trains fine without it at this scale.

Summary

These three primitives (linear, softmax, and rmsnorm) are the building blocks the model assembles in the next chapter. Each one is built entirely from Value operations, so the autograd engine can compute gradients through all of them.

Primitive	What it does	Where it is used
`linear`	Matrix-vector multiply	Attention projections, MLP layers, output head
`softmax`	Scores to probabilities	Attention weights, training loss, generation sampling
`rmsnorm`	Normalize to unit variance	Before each transformer sub-block

Keyboard shortcuts

LLMs, the Hard Way

RMSNorm (Normalization)

Summary