The Computation Graph

As the model processes a token, every addition, multiplication, exponentiation and so on creates a new Value node. The result is a computation graph, a tree-like structure connecting the parameters at the leaves to a single loss value at the root.

Building the Graph

Consider a tiny example using the Value operations from the previous section:

const x = new Value(2.0);
const y = new Value(3.0);
const z = x.mul(y);          // z = 6.0
const loss = z.add(1);       // loss = 7.0

Each operation creates a new Value that records its children and local gradients. After these four lines, the graph looks like this:

x(2.0) ──┐
          mul ──→ z(6.0) ──┐
y(3.0) ──┘                 add ──→ loss(7.0)
                  1.0  ────┘

And the internal linkages are:

Node	`.data`	`.children`	`.localGrads`
`x`	2.0	[]	[]
`y`	3.0	[]	[]
`z`	6.0	[x, y]	[3.0, 2.0]
`loss`	7.0	[z, Value(1)]	[1, 1]

Leaf nodes (x and y) have no children; they are the inputs. The mul node records both inputs as children and stores [y.data, x.data] as local gradients (the product rule). The add node stores [1, 1] because addition passes gradients through unchanged.

This is the entire data structure that backward() will walk. No separate graph object, no registration step. The graph is just the chain of .children pointers from the loss back to the leaves.

From Loss to Parameters

In the real model, the graph is much larger (thousands of nodes) but the structure is the same. The loss at the root points to the softmax outputs, which point to the logits, which point to the attention and MLP operations, which eventually point to the embedding lookups and weight matrices. Every parameter is reachable by following .children pointers from the loss.

The backward() method traverses this graph in reverse (from loss to leaves), using each node’s .localGrads to propagate the gradient signal. That is the subject of the next section.

Keyboard shortcuts

LLMs, the Hard Way

The Computation Graph

Building the Graph

From Loss to Parameters