LLMs, the Hard Way

Build a GPT language model from scratch in TypeScript. No frameworks, no libraries, just math.

This tutorial walks you through every piece of a working language model: from tokenization to attention to training to generation. By the end you will have trained a tiny GPT that writes grade-school sentences and can be fine-tuned to generate questions.

The code is intentionally minimal. The entire model fits in a few hundred lines of TypeScript. Every operation, every matrix multiply, every gradient, is written by hand so you can see exactly what happens inside a transformer.

What you will build

A tokenizer that maps words to integers
An autograd engine that computes gradients automatically
Neural network primitives: linear layers, softmax, normalization
A transformer model with attention, MLPs, and residual connections
A training loop with the Adam optimizer
Text generation with temperature, top-k, and top-p sampling
Fine-tuning to adapt the model to a new task

How to read this tutorial

The tutorial is split into three parts:

Building the Model: the components that make up a transformer
Training and Inference: teaching the model and using it
Putting It to Work: end-to-end training, generation, and fine-tuning

Each section builds on the last. The code is cumulative: by the final chapter you have a complete, working system.

Prerequisites

In this lab you will set up the tools needed to complete this tutorial and examine the training data that the model will learn from.

Software

This tutorial requires Node.js (v18 or later) and TypeScript. Verify they are installed:

node --version

v22.15.0

Clone the Repository

Clone the tutorial repository and install dependencies:

git clone https://github.com/irbull/llms-the-hard-way.git
cd llms-the-hard-way
npm install

The Training Data

Every language model starts with a question: what do we want it to learn?

We want ours to learn how simple English sentences are put together. The training corpus is 30,000 grade-1-level sentences in data/grade1_sentences.txt:

wc -l data/grade1_sentences.txt

30000 data/grade1_sentences.txt

Look at a few:

head -10 data/grade1_sentences.txt

nan has the nut
the shy bed is old
nan mixed
the bird is wide
the goat likes to go
the cow yells
we dance up the garden
we swing up the store
the cat eats a muffin
the hen walks at dusk

Each sentence is short (2-8 words), uses basic vocabulary, and follows simple subject-verb-object patterns. The entire vocabulary is only 596 unique words, a fraction of what a real LLM would handle, but enough to see every concept in action.

What the Model Will Learn

We will never tell the model anything about English grammar. We will not define “noun” or “verb” or “sentence.” We will simply show it thousands of sentences and ask: given the words so far, what word comes next?

That is all a language model does. It learns to predict the next token. And from this single task, next-token prediction, structure emerges: the model learns that “the” is often followed by a noun, that sentences end after a handful of words, that certain words cluster together.

It’s All Just Scale

Our model has 596 words, 2 layers, and around 63,000 parameters. A model like GPT-4 has a vocabulary of 100,000+ tokens, nearly a hundred layers, and over a trillion parameters. It trains on trillions of words scraped from books, code repositories, scientific papers, and the open web.

But the architecture is the same. Embeddings, attention, feed-forward layers, residual connections, softmax over the vocabulary: every piece you will build in this tutorial is a piece that ships in production LLMs. The difference is not a difference in kind. It is a difference in scale.

When a large model writes working code, answers a medical question, or translates between languages, it is doing exactly what our tiny model does: predicting the next token. It just has enough parameters and has seen enough data that “predict the next token” becomes indistinguishable from understanding.

There is no hidden trick. No secret module that “really” does the thinking. The code you are about to write is the whole story.

The Pipeline

Here is the path from raw text to a model that generates new sentences:

training data  ->  tokenizer  ->  model  ->  training  ->  saved weights
                                                               |
                   tokenizer  ->  model  ->  generation  <-  load weights

We build each piece from scratch, in order. Each chapter introduces one component and ends with a Complete Code page containing the finished source for that stage.

What You Will Build

Chapter	You will create	Pipeline stage
The Tokenizer	`tokenizer.ts`	Turns words into numbers and back
The Autograd Engine	`autograd.ts`	Automatic differentiation (makes training possible)
Neural Network Primitives	`nn.ts`	Linear layers, softmax, normalization
The Model	`model.ts`, `rng.ts`	The GPT architecture: config, weights, forward pass
Training	`train.ts`	The training loop and optimizer
Saving the Model	`saveModel`, `loadModel`	Serialize trained weights to disk and load them back
Generation	`generate.ts`	Inference: turning a trained model into sentences
Smoke Test	`phrases-train.ts`, `phrases-generate.ts`	Entry points to train and generate
Fine-Tuning	`phrases-fine-tune.ts`	Adapt a trained model to new data

The Tokenizer

Neural networks operate on numbers, not words. The tokenizer is the bridge: it assigns every word in our vocabulary a unique integer, then translates sentences back and forth between text and number sequences.

Building the Vocabulary

We scan every sentence in the training data and collect every unique word:

const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();

This gives us a sorted list of 596 words:

Index 0:   "a"
Index 1:   "all"
Index 2:   "am"
...
Index 75:  "cat"
Index 76:  "catch"
...
Index 525: "the"
Index 526: "then"
...
Index 595: "zoo"

We then add one special token: BOS (Beginning of Sequence), assigned index 596. This marker tells the model “a sentence starts here” and “a sentence ends here.” Production models typically use separate BOS and EOS (End of Sequence) tokens. We use a single token for both roles: the model sees BOS at the start and learns that predicting BOS means the sentence is over.

Total vocabulary size: 597 (596 words + 1 BOS token).

Encoding and Decoding

Encoding: Text to Numbers

To encode a sentence, we wrap it in BOS tokens and replace each word with its index:

"the cat eats a muffin"
   | encode
[596, 525, 75, 152, 0, 324, 596]
 BOS  the  cat eats a  muffin BOS

The BOS token appears at both ends. The opening BOS gives the model a consistent starting signal. The closing BOS tells the model the sentence is complete. During training, the model learns that after certain patterns of words, the next token should be BOS, meaning “stop.”

Decoding: Numbers Back to Text

Decoding reverses the process. Given [525, 75, 152, 0, 324], we look up each index in the vocabulary and join with spaces:

[525, 75, 152, 0, 324]
  | decode
"the cat eats a muffin"

The Code

The tokenizer’s API is captured in a Tokenizer interface:

// tokenizer.ts
export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

A factory function builds one from the training corpus:

export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

The tokenizer has no knowledge of English. It does not know that “the” is an article or that “cat” is a noun. It just maps strings to integers. All the meaning will come from training.

Why Word-Level Tokens?

An alternative approach is character-level tokenization, where each letter is a token. Our vocabulary would be just 28 tokens (a-z, space, plus BOS), but the sequences become much longer: “the cat” becomes 7 character tokens instead of 2 word tokens. With only 596 unique words in our corpus, word-level tokenization gives us short sequences that are fast to train on and easy to reason about.

Production LLMs like GPT-4 use a middle ground called subword tokenization (BPE), which splits rare words into pieces while keeping common words whole. Our word-level approach is the simplest version of this idea.

Complete Code

Here is the complete source code for the tokenizer.

tokenizer.ts

/**
 * Tokenizer
 *
 * Translates strings to sequences of integers ("tokens") and back.
 * Builds a word-level vocabulary from the training corpus, with a BOS
 * (Beginning of Sequence) delimiter appended on each side.
 */

export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

The Autograd Engine

Training a neural network requires answering one question for each of the model’s parameters: if I nudge this number up a tiny bit, does the model get better or worse?

The answer is the parameter’s gradient: the direction and magnitude of change that would reduce the model’s error. Computing gradients by hand for tens of thousands of interconnected parameters would be impossible. Instead, we use automatic differentiation: we build a record of every computation the model performs, then trace backwards through that record to compute all gradients at once.

The Math You Need

Training a neural network relies on a small set of ideas from calculus. If your calculus is rusty (or you never took it), this section covers everything you need before we start writing code.

What Is a Derivative?

A derivative answers one question: if I nudge the input by a tiny amount, how much does the output change?

Formally, df/dx means “the rate of change of f with respect to x.” Intuitively, it is the slope of the function at a given point.

If f(x) = x², then:

f(3) = 9
f(3.001) = 9.006001
Change in output / change in input = 0.006001 / 0.001 ≈ 6.0

The derivative of x² at x = 3 is 6. That number tells you: a tiny nudge to x produces roughly 6 times that nudge in the output.

Reading the Notation

The notation trips people up because it appears in several forms that all mean the same thing.

dy/dx: Read as “the derivative of y with respect to x.” How much does y change when you nudge x? Think of it as “a tiny change in y divided by a tiny change in x,” which is exactly what a slope is.

d/dx [f]: The d/dx part is an operator meaning “take the derivative of whatever follows, with respect to x.” So d/dx [x²] asks “what is the derivative of x² with respect to x?” Answer: 2x. This is just another way of writing the same thing as dy/dx.

d(loss)/dz: “The derivative of loss with respect to z.” You see different letters in the denominator because you are asking about different variables at different points in the computation. At one node you ask “how does loss change if I nudge z?” At the next you ask “how does loss change if I nudge x?”

Notation	Read as
dy/dx	derivative of y with respect to x
d/dx [f]	take the derivative of f with respect to x
d(loss)/dz	derivative of loss with respect to z

The d just means “a tiny change in.”

The Rules You Need

You need four derivative rules. That’s it.

Power rule: d/dx [xⁿ] = n · xⁿ⁻¹

The exponent drops down as a coefficient and decreases by one:

d/dx [x²] = 2x
d/dx [x³] = 3x²

Constant factor rule: constants pass through, bare constants vanish:

d/dx [3x] = 3
d/dx [7] = 0

Addition rule: derivatives distribute over sums:

d/dx [f + g] = df/dx + dg/dx

Product rule: for f · g:

d/dx [f · g] = f · dg/dx + g · df/dx

These four rules, plus the chain rule below, are enough to differentiate everything our autograd engine will encounter.

The Chain Rule

This is the rule that makes backpropagation work. When you have nested functions:

y = f(g(x))

the derivative is:

dy/dx = df/dg · dg/dx

In plain English: multiply the local derivatives along the path.

Think of it like unit conversion. If nudging x by 1 changes g by 3, and nudging g by 1 changes y by 5, then nudging x by 1 changes y by 3 × 5 = 15. You just multiply the rates together.

For a longer chain a → b → c → loss, the chain rule says:

d(loss)/da = d(loss)/dc · dc/db · db/da

Each node only needs to know two things: its own local derivative and the gradient flowing in from the node ahead of it. That is exactly what our backward pass will compute.

A Worked Example

Forget neural networks for a moment. Consider four lines of arithmetic:

x    = 2.0
y    = 3.0
z    = x * y      // z = 6.0
loss = z + 1      // loss = 7.0

The computation graph:

x(2.0) ──┐
          [*] ──→ z(6.0) ──→ [+] ──→ loss(7.0)
y(3.0) ──┘                1 ──┘

Forward pass (left to right)

Just arithmetic. Compute each value and remember which operation produced it.

Backward pass (right to left)

We want to know: how sensitive is loss to each input? Start at loss. By convention, d(loss)/d(loss) = 1.0. The loss is perfectly sensitive to itself.

Step 1: through the + node.

loss = z + 1, so d(loss)/dz = 1. Adding a constant doesn’t change the slope: a nudge to z passes through to loss unchanged.

z.grad = 1.0

Step 2: through the × node.

z = x × y. Using the product rule, dz/dx = y and dz/dy = x. Apply the chain rule:

d(loss)/dx = d(loss)/dz · dz/dx = 1.0 × y = 1.0 × 3.0 = 3.0

d(loss)/dy = d(loss)/dz · dz/dy = 1.0 × x = 1.0 × 2.0 = 2.0

So x.grad = 3.0 and y.grad = 2.0.

Sanity check

We used the chain rule to compute x.grad = 3.0. Can we verify that without any calculus? Yes: just run the computation twice, once normally and once with x nudged by a tiny amount, and measure how much the loss actually changed:

Original:  z = 2.0   × 3.0 = 6.0,   loss = 7.0
Nudge x:   z = 2.001 × 3.0 = 6.003, loss = 7.003

Change in loss / change in x = 0.003 / 0.001 = 3.0  ✓

The brute-force answer (3.0) matches the chain-rule answer (3.0). Two completely independent methods give the same result, which confirms the calculus was correct.

This is what autograd does automatically for every node in the graph, no matter how large the computation grows, except it uses the chain rule, not brute force, because running the full computation twice per parameter would be far too slow.

The Value Class: A Number That Remembers

Every number in the model is not a plain number. It is a Value. A Value stores three things:

data: the actual number (e.g., 0.03)
grad: the gradient, filled in later during the backward pass
children and localGrads: how this Value was computed from other Values

When we perform any operation on Values, the result is a new Value that remembers its children (the input Values) and the local gradients, the partial derivatives of that operation with respect to each input. These local gradients answer: “if I nudge this input, how much does the output change?”

children and localGrads are parallel arrays: one local gradient per child. A binary operation like add or mul takes two inputs, so both arrays have two entries. A unary operation like pow, log, or relu takes one input, so both arrays have one entry. The backward pass pairs them up: the gradient for children[i] is computed using localGrads[i].

Here is the constructor:

export class Value {
  data: number;
  grad: number;
  children: Value[];
  localGrads: number[];

  constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
    this.data = data;
    this.grad = 0;
    this.children = children;
    this.localGrads = localGrads;
  }

A leaf Value (like a model parameter) is created with just a number: new Value(0.03). The children and localGrads default to empty arrays. When an operation creates a new Value, it passes in the inputs and their local gradients.

Every primitive operation in our engine records these derivatives. Here they all are, with the intuition for why each gradient is what it is.

Addition: `a + b` (local gradients `[1, 1]`)

add(other: Value | number): Value {
  const o = typeof other === "number" ? new Value(other) : other;
  return new Value(this.data + o.data, [this, o], [1, 1]);
}

d(a + b) / da = 1
d(a + b) / db = 1

If you nudge a up by 0.001, the sum goes up by exactly 0.001. The output changes by the same amount as the input, regardless of what the values are. Hence [1, 1].

a	b	a + b	Nudge a to 3.001	New result	Change	Local grad
3	5	8	3.001 + 5	8.001	0.001	1

Multiplication: `a * b` (local gradients `[b, a]`)

mul(other: Value | number): Value {
  const o = typeof other === "number" ? new Value(other) : other;
  return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}

d(a * b) / da = b
d(a * b) / db = a

If you nudge a up by 0.001, the product goes up by 0.001 * b. The sensitivity to a depends on how large b is, and vice versa. Hence [o.data, this.data], meaning the gradient for each input is the other input’s value.

a	b	a * b	Nudge a to 3.001	New result	Change	Local grad
3	5	15	3.001 * 5	15.005	0.005	5 (= b)

Power: `a ^ n` (local gradient `[n * a^(n-1)]`)

pow(n: number): Value {
  return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}

d(a^n) / da = n * a^(n-1)

This is the classic power rule from calculus. The exponent drops down as a coefficient, and the power decreases by one. For a^2, the gradient is 2a. The larger a is, the more sensitive the square is to small changes.

a	n	a ^ n	Nudge a to 3.001	New result	Change	Local grad
3	2	9	3.001^2	9.006001	~0.006	6 (= 2 * 3)

Log: `ln(a)` (local gradient `[1 / a]`)

log(): Value {
  return new Value(Math.log(this.data), [this], [1 / this.data]);
}

d(ln(a)) / da = 1 / a

The log function is steep when a is small and flat when a is large. A tiny nudge to a small number produces a big change in the log; the same nudge to a large number barely moves it. Note: log(0) is -Infinity and the gradient 1/0 poisons the computation. In our model, log() is only called on softmax outputs, which are always positive, so this is safe in practice.

a	ln(a)	Nudge a to 0.101	New result	Change	Local grad
0.1	-2.303	ln(0.101)	-2.293	~0.010	10 (= 1 / 0.1)
10.0	2.303	ln(10.001)	2.3026	~0.0001	0.1 (= 1 / 10)

Exp: `e^a` (local gradient `[e^a]`)

exp(): Value {
  return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}

d(e^a) / da = e^a

The exponential function is its own derivative. The larger the output is, the faster it grows. A nudge to a changes the output by an amount proportional to the output itself.

a	e^a	Nudge a to 2.001	New result	Change	Local grad
2	7.389	e^2.001	7.396	~0.007	7.389 (= e^2)

ReLU: `max(0, a)` (local gradient `[1 if a > 0, else 0]`)

relu(): Value {
  return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}

d(relu(a)) / da = 1 if a > 0, 0 if a <= 0

ReLU is the simplest nonlinearity: it passes positive values through unchanged and clamps negatives to zero. When a is positive, the gradient is 1 (the nudge passes through). When a is negative, the gradient is 0 (the output is stuck at zero, so nudging a does nothing).

a	relu(a)	Nudge a by 0.001	New result	Change	Local grad
3	3	relu(3.001)	3.001	0.001	1
-2	0	relu(-1.999)	0	0	0

Derived Operations

The remaining operations (neg, sub, div) don’t need their own gradient rules. They are composed from the primitives above:

neg(a) = a * (-1), uses mul
sub(a, b) = a + (-b), uses add and neg
div(a, b) = a * b^(-1), uses mul and pow

Because every step is tracked in the computation graph, the chain rule handles these compositions automatically.

The Computation Graph

As the model processes a token, every addition, multiplication, exponentiation and so on creates a new Value node. The result is a computation graph, a tree-like structure connecting the parameters at the leaves to a single loss value at the root.

Building the Graph

Consider a tiny example using the Value operations from the previous section:

const x = new Value(2.0);
const y = new Value(3.0);
const z = x.mul(y);          // z = 6.0
const loss = z.add(1);       // loss = 7.0

Each operation creates a new Value that records its children and local gradients. After these four lines, the graph looks like this:

x(2.0) ──┐
          mul ──→ z(6.0) ──┐
y(3.0) ──┘                 add ──→ loss(7.0)
                  1.0  ────┘

And the internal linkages are:

Node	`.data`	`.children`	`.localGrads`
`x`	2.0	[]	[]
`y`	3.0	[]	[]
`z`	6.0	[x, y]	[3.0, 2.0]
`loss`	7.0	[z, Value(1)]	[1, 1]

Leaf nodes (x and y) have no children; they are the inputs. The mul node records both inputs as children and stores [y.data, x.data] as local gradients (the product rule). The add node stores [1, 1] because addition passes gradients through unchanged.

This is the entire data structure that backward() will walk. No separate graph object, no registration step. The graph is just the chain of .children pointers from the loss back to the leaves.

From Loss to Parameters

In the real model, the graph is much larger (thousands of nodes) but the structure is the same. The loss at the root points to the softmax outputs, which point to the logits, which point to the attention and MLP operations, which eventually point to the embedding lookups and weight matrices. Every parameter is reachable by following .children pointers from the loss.

The backward() method traverses this graph in reverse (from loss to leaves), using each node’s .localGrads to propagate the gradient signal. That is the subject of the next section.

Backward Pass: The Chain Rule

The backward() method answers this question. Starting from the loss (with gradient 1.0), it walks backward through every node in the graph, applying the chain rule:

gradient of child += (gradient of parent) * (local gradient)

The += is critical: when a Value feeds into multiple operations, its gradient must sum contributions from all paths. Using = would overwrite earlier contributions and produce incorrect gradients.

backward(): void {
  const topo: Value[] = [];
  const visited = new Set<Value>();
  const buildTopo = (v: Value): void => {
    if (!visited.has(v)) {
      visited.add(v);
      for (const child of v.children) buildTopo(child);
      topo.push(v);
    }
  };
  buildTopo(this);
  this.grad = 1;
  for (const v of topo.reverse()) {
    for (let i = 0; i < v.children.length; i++) {
      v.children[i].grad += v.localGrads[i] * v.grad;
    }
  }
}

First it sorts the graph topologically (inputs before outputs), then walks it in reverse. For each node, it multiplies its own gradient by each local gradient and accumulates it onto the child’s gradient.

After backward() runs, every Value in the entire computation, including all model parameters, has its .grad field filled in. Each gradient says: “increase this parameter slightly and the loss changes by this much.”

Note that backward() does not zero out gradients before running. It sets this.grad = 1 at the root and accumulates onto whatever .grad values already exist. Calling it twice on the same graph would double-count every gradient. In our training loop this is not a problem: each step builds a fresh computation graph, and we explicitly reset every parameter’s gradient to zero before the next step.

The vsum Helper

One more utility: vsum adds a list of Values through the computation graph, so the sum is also differentiable:

export function vsum(values: Value[]): Value {
  return values.reduce((acc, v) => acc.add(v), new Value(0));
}

This is used throughout: for dot products, for summing losses, and for normalization.

Complete Code

Here is the complete source code for the autograd engine, along with the tokenizer from the previous chapter.

autograd.ts

/**
 * Autograd Engine
 *
 * A scalar-valued automatic differentiation engine. Each Value node records:
 * - its forward-pass result (data)
 * - its gradient w.r.t. the loss (grad), filled in by backward()
 * - its children in the computation graph and the local derivatives
 *
 * The backward() method applies the chain rule via reverse-mode autodiff:
 * topologically sort the graph, then propagate gradients from output to inputs.
 *
 * "If I nudge this parameter slightly, how does the loss change?"
 */

export class Value {
  data: number;
  grad: number;
  children: Value[];
  localGrads: number[];

  constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
    this.data = data;
    this.grad = 0;
    this.children = children;
    this.localGrads = localGrads;
  }

  add(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data + o.data, [this, o], [1, 1]);
  }

  mul(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data * o.data, [this, o], [o.data, this.data]);
  }

  pow(n: number): Value {
    return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
  }

  log(): Value {
    return new Value(Math.log(this.data), [this], [1 / this.data]);
  }

  exp(): Value {
    return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
  }

  relu(): Value {
    return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
  }

  neg(): Value {
    return this.mul(-1);
  }

  sub(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.add(o.neg());
  }

  div(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.mul(o.pow(-1));
  }

  backward(): void {
    const topo: Value[] = [];
    const visited = new Set<Value>();
    const buildTopo = (v: Value): void => {
      if (!visited.has(v)) {
        visited.add(v);
        for (const child of v.children) buildTopo(child);
        topo.push(v);
      }
    };
    buildTopo(this);
    this.grad = 1;
    for (const v of topo.reverse()) {
      for (let i = 0; i < v.children.length; i++) {
        v.children[i].grad += v.localGrads[i] * v.grad;
      }
    }
  }
}

/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
  return values.reduce((acc, v) => acc.add(v), new Value(0));
}

tokenizer.ts

/**
 * Tokenizer
 *
 * Translates strings to sequences of integers ("tokens") and back.
 * Builds a word-level vocabulary from the training corpus, with a BOS
 * (Beginning of Sequence) delimiter appended on each side.
 */

export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

Neural Network Primitives

On top of the autograd engine from the previous chapter, we build three operations that the model uses throughout. These are general-purpose building blocks, the same operations found in every neural network framework. A neural network only needs a surprisingly small toolkit:

A way to combine inputs. Take several numbers, multiply each by a learned weight, and add the results together. This is just a weighted sum, the same idea as computing a course grade from weighted assignment scores. That is what linear does.
A way to make decisions. The model needs to express confidence: “I think the next word is 70% likely to be cat and 20% likely to be dog.” Raw scores can be any number, but probabilities must be positive and sum to
1. That is what softmax does. It turns arbitrary scores into a proper probability distribution.
A way to stay stable. Numbers that pass through dozens of weighted sums tend to either explode toward infinity or shrink toward zero. That is what rmsnorm does. It rescales a vector so the values stay in a reasonable range, similar to normalizing a set of exam scores to have a consistent spread.

Linear (Matrix-Vector Multiply)

The fundamental neural net operation. It transforms a vector by multiplying it with a weight matrix. First, a type alias. A Matrix is just a 2D array of Values:

type Matrix = Value[][];

function linear(input: Value[], weights: Matrix): Value[] {
  return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}

Each output element is a dot product: multiply each input by the corresponding weight, then sum the results. Here is a concrete example with a 3-element input and a 2×3 weight matrix:

input = [2, 3, 1]

weights = [[1, 0, -1],
           [0, 2,  1]]

output[0] = (1×2) + (0×3) + (-1×1) = 1
output[1] = (0×2) + (2×3) + ( 1×1) = 7

output = [1, 7]

Each row of the weight matrix produces one output element. A weight matrix with shape 2×3 takes a 3-element input and produces a 2-element output. In the actual model, a weight matrix with shape 128×32 takes a 32-element input vector and produces a 128-element output vector. The operation is the same, just more rows and longer dot products.

This is how the model mixes information across dimensions.

Because every multiplication and addition uses Value nodes, the entire operation is differentiable, so gradients flow back through it during training.

Softmax (Scores to Probabilities)

Neural networks often produce raw, unnormalized scores: numbers that can be positive, negative, large, or small. On their own these scores don’t mean much. Softmax converts a vector of scores into a probability distribution: a vector of positive numbers that sum to 1.

function softmax(logits: Value[]): Value[] {
  const maxVal = Math.max(...logits.map((v) => v.data));
  const exps = logits.map((v) => v.sub(maxVal).exp());
  const total = vsum(exps);
  return exps.map((e) => e.div(total));
}

The steps: subtract the maximum value (for numerical stability), exponentiate each element, then divide by the sum. For example:

scores = [2.0, 1.0, 0.1]

subtract max:  [0.0, -1.0, -1.9]
exponentiate:  [1.0,  0.37, 0.15]
divide by sum: [0.66, 0.24, 0.10]

The largest score gets the highest probability, but every element stays positive and they all sum to 1.

Softmax is used in three places:

Inside attention, to turn attention scores into attention weights
During training, to turn output scores into probabilities for computing the loss
During generation, to turn output scores into a probability distribution for sampling

RMSNorm (Normalization)

Rescales a vector to have roughly unit variance. This prevents numbers from growing too large or too small as they pass through multiple layers:

function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

It computes the root mean square of the vector, then divides each element by it. The 1e-5 prevents division by zero. For example:

input = [3.0, 4.0]

mean of squares:  (9 + 16) / 2 = 12.5
root mean square: √12.5 ≈ 3.54
divide each:      [3.0 / 3.54, 4.0 / 3.54] ≈ [0.85, 1.13]

The values have been rescaled so they sit near 1, but their relative proportions (3:4) are preserved. Without normalization, activations can explode or vanish across layers, making training unstable.

Real RMSNorm includes a learnable per-element scale parameter (gamma) that lets each dimension adjust its magnitude after normalization. We omit it for simplicity; our model trains fine without it at this scale.

Summary

These three primitives (linear, softmax, and rmsnorm) are the building blocks the model assembles in the next chapter. Each one is built entirely from Value operations, so the autograd engine can compute gradients through all of them.

Primitive	What it does	Where it is used
`linear`	Matrix-vector multiply	Attention projections, MLP layers, output head
`softmax`	Scores to probabilities	Attention weights, training loss, generation sampling
`rmsnorm`	Normalize to unit variance	Before each transformer sub-block

Complete Code

Here is the complete source code for the neural network primitives, along with all files from previous chapters.

nn.ts

/**
 * Neural Network Primitives
 *
 * The building blocks that the GPT architecture assembles:
 * - linear: matrix-vector multiply (the fundamental neural net operation)
 * - softmax: convert raw scores into a probability distribution
 * - rmsnorm: normalize activations to stabilize training
 *
 * These mirror PyTorch's torch.nn.functional — general-purpose operations
 * used by the model, training loop, and inference.
 */

import { Value, vsum } from "./autograd.js";

export type Matrix = Value[][];

/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
  return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}

/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
  const maxVal = Math.max(...logits.map((v) => v.data));
  const exps = logits.map((v) => v.sub(maxVal).exp());
  const total = vsum(exps);
  return exps.map((e) => e.div(total));
}

/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

autograd.ts

/**
 * Autograd Engine
 *
 * A scalar-valued automatic differentiation engine. Each Value node records:
 * - its forward-pass result (data)
 * - its gradient w.r.t. the loss (grad), filled in by backward()
 * - its children in the computation graph and the local derivatives
 *
 * The backward() method applies the chain rule via reverse-mode autodiff:
 * topologically sort the graph, then propagate gradients from output to inputs.
 *
 * "If I nudge this parameter slightly, how does the loss change?"
 */

export class Value {
  data: number;
  grad: number;
  children: Value[];
  localGrads: number[];

  constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
    this.data = data;
    this.grad = 0;
    this.children = children;
    this.localGrads = localGrads;
  }

  add(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data + o.data, [this, o], [1, 1]);
  }

  mul(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data * o.data, [this, o], [o.data, this.data]);
  }

  pow(n: number): Value {
    return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
  }

  log(): Value {
    return new Value(Math.log(this.data), [this], [1 / this.data]);
  }

  exp(): Value {
    return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
  }

  relu(): Value {
    return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
  }

  neg(): Value {
    return this.mul(-1);
  }

  sub(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.add(o.neg());
  }

  div(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.mul(o.pow(-1));
  }

  backward(): void {
    const topo: Value[] = [];
    const visited = new Set<Value>();
    const buildTopo = (v: Value): void => {
      if (!visited.has(v)) {
        visited.add(v);
        for (const child of v.children) buildTopo(child);
        topo.push(v);
      }
    };
    buildTopo(this);
    this.grad = 1;
    for (const v of topo.reverse()) {
      for (let i = 0; i < v.children.length; i++) {
        v.children[i].grad += v.localGrads[i] * v.grad;
      }
    }
  }
}

/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
  return values.reduce((acc, v) => acc.add(v), new Value(0));
}

tokenizer.ts

/**
 * Tokenizer
 *
 * Translates strings to sequences of integers ("tokens") and back.
 * Builds a word-level vocabulary from the training corpus, with a BOS
 * (Beginning of Sequence) delimiter appended on each side.
 */

export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

The Model

With the autograd engine and neural net primitives in hand, we can build the model. What is a model, concretely?

A model is a configuration plus a large collection of numbers (parameters).

Before training, these numbers are random. After training, they encode everything the model has learned about language. The architecture (how those numbers are wired together) determines what the model can learn. The training process determines what it does learn.

What the Model Contains

Concretely, the model is a nested structure of weight matrices: arrays of arrays of Value nodes. Each matrix has a specific shape and role. Here is every piece:

Model
├── config: { nLayer: 2, nEmbd: 32, vocabSize: 597, ... }
│
├── weights
│   ├── tokenEmbedding     [597 × 32]   one row per word in the vocabulary
│   ├── positionEmbedding  [16 × 32]    one row per position in a sentence
│   │
│   ├── layers[0]
│   │   ├── attention
│   │   │   ├── query      [32 × 32]    three identical projections
│   │   │   ├── key        [32 × 32]    (roles emerge from training,
│   │   │   ├── value      [32 × 32]     not from the code)
│   │   │   └── output     [32 × 32]    combine attention results
│   │   └── mlp
│   │       ├── hidden     [128 × 32]   expand to 128 dimensions
│   │       └── output     [32 × 128]   compress back to 32
│   │
│   ├── layers[1]                        (same structure as layer 0)
│   │
│   └── output             [597 × 32]   map back to vocabulary scores
│
└── params: all 63,296 Values in a flat array (for the optimizer)

Every number in every matrix is a Value. That means the autograd engine can compute gradients for all 63,296 of them. The rest of this chapter explains what each piece does and how they connect.

Model Architecture

Our model is a decoder-only transformer. This is the same architecture behind GPT-2, GPT-3, GPT-4, and LLaMA. “Decoder-only” means it processes tokens left to right, predicting the next token from the ones before it. There is no separate encoder; the same stack of layers both reads the context and produces predictions.

Here is the full architecture, showing the path a single token takes through the model:

Token ID (e.g. 541 = "the") + Position (e.g. 2)
    |
[Token Embedding] + [Position Embedding]  ->  32-dim vector
    |
[RMSNorm]  ->  normalize the vector
    |
+--- Transformer Layer 0 ----------------------+
|  [RMSNorm]                                    |
|  [Multi-Head Attention]  ->  look at context  |
|  [+ Residual Connection]                      |
|  [RMSNorm]                                    |
|  [MLP: expand -> ReLU -> compress]  -> process|
|  [+ Residual Connection]                      |
+-----------------------------------------------+
    |
+--- Transformer Layer 1 ----------------------+
|  (same structure)                             |
+-----------------------------------------------+
    |
[output projection]  ->  597 raw scores (one per word)
    |
"logits", unnormalized predictions for the next word

Each transformer layer has two blocks: multi-head attention (which gathers information from previous tokens) and an MLP (which processes each token’s representation independently). Both blocks use residual connections so the input is added back to the output at each step. The rest of this chapter walks through each component in detail.

The output is 597 raw scores, one per word in the vocabulary. These scores are called logits: unnormalized numbers that can be positive, negative, large, or small. A higher logit means the model thinks that word is more likely to come next. On their own logits are not probabilities; they become probabilities when passed through softmax (from the previous chapter) during training or generation.

Before training, logits are essentially random. After training, if you feed the model “the cat”, the logits for “runs,” “eats,” and “sits” will be much higher than “the” or “zoo.”

Simplifications

Our architecture follows the same structure as GPT-2 and LLaMA, with a few simplifications that keep the code short without changing how the architecture works:

Our model	Standard GPT-2 / LLaMA	Why we simplify
RMSNorm	LayerNorm (GPT-2) / RMSNorm (LLaMA)	Fewer operations, same effect at this scale
ReLU activation	GELU (GPT-2) / SiLU (LLaMA)	Simpler gradient, easier to understand
No bias terms	Bias on every linear layer (GPT-2)	Fewer parameters, modern models drop them too
No learnable norm scale	Learnable gamma per element	One less thing to train, works fine here
No final norm before output	RMSNorm before output projection	Skipped for brevity

None of these affect the core ideas. The architecture, the attention mechanism, the training loop: all identical. A reader who understands this model understands the real thing.

Configuration

The configuration defines the shape of the model:

const model = createModel({
  nLayer: 2,       // number of transformer layers
  nEmbd: 32,       // embedding dimension (size of internal vectors)
  blockSize: 16,   // maximum sequence length (longest sentence we can process)
  nHead: 4,        // number of attention heads
  headDim: 8,      // dimension per attention head (nEmbd / nHead)
  vocabSize: 597,  // our tokenizer's vocabulary size
});

These are small numbers. Production models use nEmbd in the thousands and dozens of layers. But the architecture is the same, ours just fits in memory and trains in minutes instead of months.

A note on nHead: with 32 embedding dimensions, 4 heads is a good balance. Each head gets 32 / 4 = 8 dimensions to work with. Two heads would give 16 dims each (fewer distinct attention patterns), and 8 heads would give 4 dims each (very little room per head at this small scale).

Parameters: The Model’s Memory

When we create a model, we allocate a set of weight matrices filled with small random numbers:

function matrix(nout: number, nin: number, std = 0.08): Matrix {
  return Array.from({ length: nout }, () =>
    Array.from({ length: nin }, () => new Value(gauss(0, std)))
  );
}

Each weight matrix serves a specific role. Here is every matrix in our model and what it does:

Matrix	Shape	Purpose
`weights.tokenEmbedding`	597 x 32	Token embeddings: one 32-dim vector per word
`weights.positionEmbedding`	16 x 32	Position embeddings: one 32-dim vector per position
`weights.output`	597 x 32	Output projection: maps back to vocabulary
`layers[0].attention.query`	32 x 32	Attention query weights (layer 0)
`layers[0].attention.key`	32 x 32	Attention key weights (layer 0)
`layers[0].attention.value`	32 x 32	Attention value weights (layer 0)
`layers[0].attention.output`	32 x 32	Attention output weights (layer 0)
`layers[0].mlp.hidden`	128 x 32	MLP hidden layer (layer 0)
`layers[0].mlp.output`	32 x 128	MLP output layer (layer 0)
`layers[1].attention.query`	32 x 32	Attention query weights (layer 1)
`layers[1].attention.key`	32 x 32	Attention key weights (layer 1)
`layers[1].attention.value`	32 x 32	Attention value weights (layer 1)
`layers[1].attention.output`	32 x 32	Attention output weights (layer 1)
`layers[1].mlp.hidden`	128 x 32	MLP hidden layer (layer 1)
`layers[1].mlp.output`	32 x 128	MLP output layer (layer 1)

Total: 63,296 parameters. Every one of these numbers will be adjusted during training.

Embeddings: How Token IDs Become Vectors

This is a crucial concept. The model cannot do math on the integer 541 directly. It needs a representation, a vector of numbers that captures something about the meaning of the word.

The token embedding table (tokenEmbedding) is a 597 x 32 matrix. Each row is a 32-dimensional vector representing one word in our vocabulary. To look up the embedding for “the” (token 541):

tokenEmbedding[541] -> [0.03, -0.11, 0.07, ..., 0.02]   (32 numbers)

Before training, these vectors are random. The word “the” and the word “cat” have random, meaningless embeddings. After training, words that appear in similar contexts will have similar embeddings. The model discovers relationships between words entirely from the data.

The position embedding table (positionEmbedding) works the same way but for positions instead of words. positionEmbedding[0] is a 32-dim vector meaning “first word,” positionEmbedding[3] means “fourth word.” This is how the model knows word order. Without position embeddings, it would not know whether “the cat chased the dog” and “the dog chased the cat” are different.

The first thing the model does with any input token is combine both embeddings:

const tokenVector: Value[] = weights.tokenEmbedding[tokenId];    // what word is this?
const positionVector: Value[] = weights.positionEmbedding[posId]; // where in the sentence is it?
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i])); // combine

This combined vector, hidden, is 32 numbers encoding both the word identity and its position. It is the hidden state that flows through the rest of the network.

Notice that embedding lookup is array indexing, not a matrix multiply. When the model processes token 541, only row 541 of the embedding table enters the computation graph. The other 596 rows are never touched; they receive no gradients during the backward pass and are not updated. Over many training sentences different words appear, so different rows get updated at different times. Words that appear frequently develop better embeddings than rare ones, simply because they get more gradient updates.

Multi-Head Attention

Each transformer layer has two blocks, both using the primitives from the previous chapter. The first is multi-head attention, the mechanism that lets the model look at previous tokens to inform its prediction.

Three Identical Projections

The model computes three vectors from the current hidden state using linear:

const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[]   = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);

These are three matrix multiplications with three different weight matrices, all the same shape ([32 × 32]). Before training, they contain random numbers. There is nothing in this code that makes query behave differently from key or value. They are structurally identical linear projections of the same hidden state.

The names describe what these projections become, not what they are. The labels Q, K, and V are aspirational: they describe roles that emerge through training, not roles assigned by the code. What forces them into distinct roles is the math that follows.

The Attention Computation

Here is what happens to Q, K, and V once they exist:

attention = softmax(Q · K^T / √headDim) × V

This structure imposes hard constraints:

Q and K always get dot-producted to produce a similarity score. They are structurally forced into a “matching” role: whatever W_q and W_k produce will be compared against each other.
V always gets weighted-summed using those scores. It is structurally forced into a “what gets retrieved” role: the scores from Q·K gate how much of each V vector flows through.

What’s learned is everything else: what Q and K match on (syntactic role? position? something we can’t name?), and what V carries (whatever information turned out to be useful to retrieve). Q/K/V really are like query/key/value in a database lookup, but the schema of that database, what counts as a match, what’s worth storing: all of that is discovered by training.

The √headDim scaling prevents the dot products from growing so large that softmax saturates into near-zero gradients, which would kill learning.

Here is how the math notation maps to our code:

Math	Code	Shape
Q	`query` (or `headQuery` per head)	32-dim (8 per head)
K	`key` (or `headKeys` per head)	32-dim (8 per head)
V	`value` (or `headValues` per head)	32-dim (8 per head)
W_q, W_k, W_v	`layer.attention.query`, `.key`, `.value`	[32 × 32]
√d_k	`Math.pow(headDim, -0.5)`	scalar (√8)

Worked Example: “the cat eats”

Suppose we’re processing the sentence “BOS the cat eats” and we’re on “eats” at position 3, trying to predict what comes next. The 32-dim hidden state for “eats” gets projected into Q, K, and V. Each head works on its own 8-dim slice, so let’s follow one head.

Q_eats = hidden @ W_q   →  an 8-dim vector for this head

The KV cache already holds the key and value vectors for every previous position (more on the cache later). So we have:

K_BOS, K_the, K_cat, K_eats     (cached keys for positions 0, 1, 2, 3)
V_BOS, V_the, V_cat, V_eats     (cached values for positions 0, 1, 2, 3)

Compute attention scores: how much should “eats” attend to each position:

score("eats", BOS)    = Q_eats · K_BOS  / √8 = 0.1    (low)
score("eats", "the")  = Q_eats · K_the  / √8 = 0.5    (low)
score("eats", "cat")  = Q_eats · K_cat  / √8 = 2.5    (high)
score("eats", "eats") = Q_eats · K_eats / √8 = 1.0    (medium)

After softmax, these become weights that sum to 1.0. The output is a weighted sum of the value vectors:

output = 0.02 × V_BOS  +  0.08 × V_the  +  0.66 × V_cat  +  0.24 × V_eats

“cat” dominates: the information encoded in V_cat flows through strongly. This is how the model routes information from relevant earlier tokens to the current position.

To be honest: those scores are made up. What actually causes Q_eats · K_cat to be high? The weight matrices W_q and W_k learned, from millions of training examples, to project words into a space where verbs and their subjects end up pointing in similar directions. We can observe that this happens. We can’t easily read off why from the weight matrices.

Causal Masking

Our model is autoregressive: it predicts the next token from previous ones. Each position can only attend to itself and earlier positions, never future ones. In our implementation this happens naturally: the KV cache only contains entries for tokens that have already been processed, so there are no future keys to attend to.

Multiple Heads

The “multi-head” part means we split the 32-dim vectors into 4 heads of 8 dimensions each. Each head has its own slice of W_q, W_k, and W_v, so each head runs its own independent attention computation in parallel:

for (let h = 0; h < nHead; h++) {
  const headStart = h * headDim;
  const headQuery  = query.slice(headStart, headStart + headDim);   // this head's 8-dim query
  const headKeys   = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
  const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));
  // compute attention scores and weighted sum for this head
}

Different heads learn to attend to different things. On “the cat eats”, the four heads might specialize like this:

Head 0: tracks grammatical subject:
  score("eats" → "cat") = 0.9    →  output mostly carries V_cat

Head 1: tracks recency / local context:
  score("eats" → "eats") = 0.8   →  output mostly carries V_eats

Head 2: tracks determiner-noun relationships:
  score("cat" → "the") = 0.9     →  output mostly carries V_the

Head 3: honestly, nobody knows:
  output = some mix that turns out to be useful

The outputs from all four heads are concatenated back into a 32-dim vector and projected through one more linear layer (layer.attention.output) to produce the final attention output.

By the time we’re done, the representation of “eats” has been enriched by: who the subject is, where we are in the sequence, what the determiner structure looks like, and whatever Head 3 noticed.

Why Heads Don’t All Learn the Same Thing

Each head starts with different random weights. Gradient descent is local: it nudges each head’s weights based on where they started, so they drift in different directions. And there’s no benefit to convergence: redundant heads contribute identical information, giving the model no extra expressive power. The loss landscape rewards diversity because diverse heads capture more signal.

One consequence: the capabilities are reproducible but their locations are not. Train the same model twice with different random seeds and you’ll get subject-verb tracking somewhere, but not necessarily in the same head or layer. The capability emerges; its address doesn’t.

MLP (Feed-Forward Network)

Each transformer layer has two blocks. Attention was the first: it mixes information between tokens. The MLP is the second, and it does something complementary: it processes each token’s vector independently.

That distinction matters. Attention is how “cat” gets routed into the representation of “eats.” But attention is mostly a weighted average; it selects and combines, it doesn’t do much computation on its own. The MLP is where per-token nonlinear computation happens. It transforms each vector through learned functions that training has shaped to be useful.

Attention moves information around. The MLP does something with it.

The Code

hidden = linear(hidden, layer.mlp.hidden);     // 32 -> 128
hidden = hidden.map((h) => h.relu());          // non-linearity
hidden = linear(hidden, layer.mlp.output);     // 128 -> 32

Three operations: expand, apply nonlinearity, compress.

Why Expand and Compress?

The first linear layer projects the 32-dim vector into 128 dimensions, a 4x expansion. The second projects it back down to 32. Why not just do a single 32→32 transformation?

The expanded 128-dim space gives the model room to compute. In 32 dimensions, each element has to carry the final representation directly. In 128 dimensions, the model can build intermediate features (combinations and interactions) that are useful for computing the final output but don’t need to survive in the compressed result.

Think of it as scratch space. The expansion gives 128 dimensions to work in; the compression selects what’s worth keeping.

Why ReLU Is Critical

ReLU sets negative values to zero: relu(x) = max(0, x). This is the nonlinearity between the two linear layers, and without it, the MLP would be pointless.

Here’s why: a linear transformation followed by another linear transformation is just… a single linear transformation. Matrix multiplication composes. W₂(W₁x) = (W₂W₁)x. You could replace the two weight matrices with one matrix that does the same thing. No matter how many linear layers you stack, they collapse into one.

ReLU breaks this. By zeroing out different elements depending on the input, it makes the transformation input-dependent: different inputs activate different subsets of the 128 hidden neurons. This is what lets the MLP compute functions that aren’t just weighted sums. It’s what makes depth useful.

What Does It Compute?

Same honesty as attention: we can observe what the MLP does in aggregate, but the individual weight values aren’t interpretable. After attention has gathered context into the “eats” vector, enriching it with information about the subject, position, and determiner structure, the MLP applies a learned nonlinear transformation to that enriched representation.

What patterns does it recognize? What features does it compute? Training shaped the weights to reduce prediction error. The results work. The internals are opaque.

Residual Connections

After each block (attention and MLP), the original input is added back:

hidden = hidden.map((h, i) => h.add(beforeBlock[i]));

where beforeBlock is the hidden state saved before the block started. This is element-wise addition. The output of each block is added to the input that went in, not substituted for it. The consequence: a block doesn’t have to reproduce the entire representation from scratch. It only needs to learn a useful correction, a delta to add on top of what’s already there. If a block has nothing useful to contribute, its weights can settle near zero and the input passes through unchanged.

Why This Matters for Training

During training, the model adjusts weights using gradients: signals that flow backward through the network saying “this weight should go up” or “this weight should go down.” Without residual connections, those signals have to pass through every layer’s transformations on the way back. Each layer can shrink the signal (vanishing gradients) or amplify it out of control (exploding gradients). By the time the signal reaches the early layers, it’s often degraded beyond usefulness.

The residual connection gives gradients a direct path. Because the input is added to the output (y = f(x) + x), the gradient of y with respect to x always includes a term of 1, the identity. No matter what f does to the gradient, the skip connection ensures the raw signal can still flow backward unimpeded. This is what makes training deep networks practical.

The KV Cache

The model processes one token at a time. If we have the sentence “the cat eats…”, when we feed it “cat” at position 2, it does not automatically know that “the” came before it. But attention, the mechanism that lets the model look at previous words, needs access to those earlier tokens. How?

The answer is that we cache the key and value vectors computed at every previous position. As we saw in the attention section, each token produces three vectors: a query (Q), a key (K), and a value (V). The query is used immediately and discarded. But the keys and values are needed again every time a future token wants to attend to this position.

The KV cache is just a list that grows by one entry each time we process a token:

After processing BOS at position 0:
  keys:   [ K_bos ]
  values: [ V_bos ]

After processing "the" at position 1:
  keys:   [ K_bos, K_the ]
  values: [ V_bos, V_the ]

After processing "cat" at position 2:
  keys:   [ K_bos, K_the, K_cat ]
  values: [ V_bos, V_the, V_cat ]

When the model processes “cat”, its query vector is compared against all cached keys (BOS, “the”, and “cat” itself) to compute attention scores. Those scores determine how much to weight each cached value vector. This is how “cat” can attend to “the” even though “the” was processed in a previous call.

This is also how causal masking works in our implementation: the cache only contains entries for tokens already processed, so there are no future keys to attend to.

We create a fresh, empty cache at the start of each sentence:

export function createKVCache(model: Model): {
  keys: Value[][][];
  values: Value[][][];
} {
  return {
    keys: Array.from({ length: model.config.nLayer }, () => []),
    values: Array.from({ length: model.config.nLayer }, () => []),
  };
}

The type Value[][][] is one list per layer (we have 2 layers, so 2 lists). Each list starts empty and grows as tokens are processed. A fresh cache means a fresh sentence. The model has no memory of previous sentences.

Creating the Model

The createModel function allocates all the weight matrices and collects every individual number into a flat params array:

export function createModel(config: GPTConfig): Model {
  const weights: Weights = {
    tokenEmbedding: matrix(vocabSize, nEmbd),
    positionEmbedding: matrix(blockSize, nEmbd),
    output: matrix(vocabSize, nEmbd),
    layers: Array.from({ length: nLayer }, () => ({
      attention: {
        query: matrix(nEmbd, nEmbd),
        key: matrix(nEmbd, nEmbd),
        value: matrix(nEmbd, nEmbd),
        output: matrix(nEmbd, nEmbd),
      },
      mlp: {
        hidden: matrix(4 * nEmbd, nEmbd),
        output: matrix(nEmbd, 4 * nEmbd),
      },
    })),
  };

  // Collect all matrices into a flat param array
  const allMatrices: Matrix[] = [
    weights.tokenEmbedding,
    weights.positionEmbedding,
    weights.output,
    ...weights.layers.flatMap((layer) => [
      layer.attention.query, layer.attention.key,
      layer.attention.value, layer.attention.output,
      layer.mlp.hidden, layer.mlp.output,
    ]),
  ];
  const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));

  return { config, weights, params };
}

The params array is what the optimizer will update during training. The weights object is a typed view into those same parameters. When training updates params[i], the corresponding entry in weights changes too, because they are the same Value objects.

Putting It All Together

Every operation (embedding lookup, linear transform, softmax, rmsnorm, addition, ReLU) is built from Value nodes. The entire forward pass builds one enormous computation graph. When we call backward() on the loss, the gradients for all 63,296 parameters are computed in a single sweep through this graph.

This is what makes neural network training possible: the autograd engine turns the question “how should I change 63,296 numbers to make my predictions better?” into a mechanical, automatic computation.

At this point we have 63,296 random numbers and a blueprint for how to wire them together. The model can process tokens, but its output is nonsense. To make it useful, we need to train it.

Running the Model

The gpt() function wires together every piece from this chapter: it takes a single token and its position, runs it through the full architecture, and returns 597 logits. Let’s walk through it before looking at the code.

Step by Step

The function takes five arguments: the model, a token ID, a position, and the KV cache (keys and values) introduced earlier in this chapter.

1. Embedding lookup. Look up the token embedding and the position embedding, and add them together to produce a single 32-element vector:

const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
const positionVector: Value[] = weights.positionEmbedding[posId];
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));

At this point hidden is a 32-dim vector that encodes both which word this is and where it appears in the sentence.

2. Normalize. Apply rmsnorm to keep the numbers at a stable scale:

hidden = rmsnorm(hidden);

3. Transformer layers. The model has 2 layers. Each one does the same thing: attention, then MLP, each with a residual connection.

3a. Attention. Save the hidden state for the residual connection, normalize, then compute query, key, and value vectors. Notice that normalization comes before each block, not after. This is called pre-norm and is what GPT-2 and LLaMA use. The original transformer paper normalized after each block (post-norm), but pre-norm trains more stably because the residual connection carries unnormalized values, giving gradients a cleaner path backward. Push the key and value into the KV cache so future tokens can attend to this position:

const beforeAttention: Value[] = hidden;
hidden = rmsnorm(hidden);

const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[]   = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
keys[li].push(key);
values[li].push(value);

Then split into heads (4 heads of 8 dimensions each). For each head, compute attention scores by comparing this token’s query against all cached keys, run softmax to get weights, and take a weighted sum of the cached values. This is how the model looks at previous tokens to gather context.

Finally, project the concatenated head outputs back to 32 dimensions and add the residual:

hidden = linear(attentionOutput, layer.attention.output);
hidden = hidden.map((h, i) => h.add(beforeAttention[i]));

3b. MLP. Save the hidden state for the residual, normalize, expand from 32 to 128 dimensions, apply ReLU, compress back to 32, and add the residual:

const beforeMLP: Value[] = hidden;
hidden = rmsnorm(hidden);
hidden = linear(hidden, layer.mlp.hidden);     // 32 -> 128
hidden = hidden.map((h) => h.relu());          // non-linearity
hidden = linear(hidden, layer.mlp.output);     // 128 -> 32
hidden = hidden.map((h, i) => h.add(beforeMLP[i]));

4. Output projection. After all layers, project the 32-dim vector to 597 dimensions, one score per word in the vocabulary:

return linear(hidden, weights.output);

These 597 numbers are the logits: the model’s raw, unnormalized prediction for what word comes next.

The Complete Function

Here is the full gpt() function with all the pieces together:

export function gpt(
  model: Model,
  tokenId: number,
  posId: number,
  keys: Value[][][],
  values: Value[][][],
): Value[] {
  const { nLayer, nHead, headDim } = model.config;
  const { weights } = model;

  // Step 1: Embedding lookup
  // Combine "what word is this?" with "where does it appear?" into a single
  // vector. This is the hidden state that flows through the rest of the network.
  const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
  const positionVector: Value[] = weights.positionEmbedding[posId];
  let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));

  // Normalize before the first layer to keep values at a stable scale
  hidden = rmsnorm(hidden);

  // Step 2: Transformer layers
  // Each layer has two blocks: attention (gather context from other tokens)
  // followed by MLP (process the gathered information). Both use residual
  // connections so the input is added back to the output of each block.
  for (let li = 0; li < nLayer; li++) {
    const layer = weights.layers[li];

    // --- Attention block: look at previous tokens to gather context ---

    // Save the hidden state so we can add it back after the block (residual)
    const beforeAttention: Value[] = hidden;
    hidden = rmsnorm(hidden);

    // Project the hidden state into query, key, and value vectors.
    // These are structurally identical projections. Training teaches them
    // to play different roles: query asks "what am I looking for?",
    // key advertises "what do I contain?", value carries "what to retrieve".
    const query: Value[] = linear(hidden, layer.attention.query);
    const key: Value[]   = linear(hidden, layer.attention.key);
    const value: Value[] = linear(hidden, layer.attention.value);

    // Cache the key and value so future tokens can attend to this position
    keys[li].push(key);
    values[li].push(value);

    // Each head independently attends to a different slice of the vectors,
    // allowing the model to track multiple relationships at once
    const attentionOutput: Value[] = [];
    for (let h = 0; h < nHead; h++) {
      const headStart = h * headDim;
      const headQuery = query.slice(headStart, headStart + headDim);
      const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
      const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));

      // Scaled dot-product attention: score = (query · key) / √headDim
      const attnLogits = headKeys.map((cachedKey) =>
        vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
      );

      // Softmax converts scores into weights that sum to 1
      const attnWeights = softmax(attnLogits);

      // Weighted sum of value vectors. High-scoring positions contribute more
      for (let j = 0; j < headDim; j++) {
        attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
      }
    }

    // Project concatenated head outputs back to the hidden dimension
    hidden = linear(attentionOutput, layer.attention.output);
    // Residual connection: add back what we had before attention
    hidden = hidden.map((h, i) => h.add(beforeAttention[i]));

    // --- MLP block: process each token's representation independently ---

    const beforeMLP: Value[] = hidden;
    hidden = rmsnorm(hidden);
    hidden = linear(hidden, layer.mlp.hidden);   // expand to 4x wider
    hidden = hidden.map((h) => h.relu());         // nonlinearity
    hidden = linear(hidden, layer.mlp.output);    // compress back
    // Residual connection: add back what we had before the MLP
    hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
  }

  // Step 3: Output projection
  // Project the final hidden state to vocabulary size: one score per word.
  // These raw scores (logits) will be passed through softmax later to get
  // a probability distribution over the next token.
  // Note: standard GPT-2 and LLaMA apply a final normalization here
  // (rmsnorm(hidden) before the linear). We skip it for simplicity.
  return linear(hidden, weights.output);
}

Every operation here uses Value nodes, so the entire forward pass builds a computation graph. When we call backward() on the loss later, gradients flow back through every step, from the output projection, through the MLP and attention blocks, all the way to the embedding lookup.

Complete Code

Here is the complete source code for the model, along with all files from previous chapters.

model.ts

/**
 * GPT Model
 *
 * Defines the model's structure: configuration, weight matrices, and
 * parameter collection. Includes the forward pass (gpt) and KV cache
 * for inference. Save/load is introduced in a later chapter.
 */

import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";

// --- Configuration ---

export interface GPTConfig {
  nLayer: number;
  nEmbd: number;
  blockSize: number;
  nHead: number;
  headDim: number;
  vocabSize: number;
}

interface Attention {
  query: Matrix;
  key: Matrix;
  value: Matrix;
  output: Matrix;
}

interface MLP {
  hidden: Matrix;
  output: Matrix;
}

interface Layer {
  attention: Attention;
  mlp: MLP;
}

export interface Weights {
  tokenEmbedding: Matrix;
  positionEmbedding: Matrix;
  output: Matrix;
  layers: Layer[];
}

/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
  config: GPTConfig;
  weights: Weights;
  params: Value[];
}

// --- Model Creation ---

function matrix(nout: number, nin: number, std = 0.08): Matrix {
  return Array.from({ length: nout }, () =>
    Array.from({ length: nin }, () => new Value(gauss(0, std)))
  );
}

/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
  const { nEmbd, nLayer, blockSize, vocabSize } = config;

  const weights: Weights = {
    tokenEmbedding: matrix(vocabSize, nEmbd),
    positionEmbedding: matrix(blockSize, nEmbd),
    output: matrix(vocabSize, nEmbd),
    layers: Array.from({ length: nLayer }, () => ({
      attention: {
        query: matrix(nEmbd, nEmbd),
        key: matrix(nEmbd, nEmbd),
        value: matrix(nEmbd, nEmbd),
        output: matrix(nEmbd, nEmbd),
      },
      mlp: {
        hidden: matrix(4 * nEmbd, nEmbd),
        output: matrix(nEmbd, 4 * nEmbd),
      },
    })),
  };

  const allMatrices: Matrix[] = [
    weights.tokenEmbedding,
    weights.positionEmbedding,
    weights.output,
    ...weights.layers.flatMap((layer) => [
      layer.attention.query, layer.attention.key,
      layer.attention.value, layer.attention.output,
      layer.mlp.hidden, layer.mlp.output,
    ]),
  ];
  const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));

  return { config, weights, params };
}

// --- KV Cache ---

/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
  keys: Value[][][];
  values: Value[][][];
} {
  return {
    keys: Array.from({ length: model.config.nLayer }, () => []),
    values: Array.from({ length: model.config.nLayer }, () => []),
  };
}

// --- Forward Pass ---

/**
 * Run one step of the GPT: given a token at a position, return logits
 * over the vocabulary for the next token.
 *
 * The keys/values caches are mutated (appended to) on each call —
 * this is the KV cache that avoids recomputing attention for past positions.
 */
export function gpt(
  model: Model,
  tokenId: number,
  posId: number,
  keys: Value[][][],
  values: Value[][][],
): Value[] {
  const { nLayer, nHead, headDim } = model.config;
  const { weights } = model;

  // Step 1: Embedding lookup
  // Combine "what word is this?" with "where does it appear?" into a single
  // vector. This is the hidden state that flows through the rest of the network.
  const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
  const positionVector: Value[] = weights.positionEmbedding[posId];
  let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));

  // Normalize before the first layer to keep values at a stable scale
  hidden = rmsnorm(hidden);

  // Step 2: Transformer layers
  // Each layer has two blocks: attention (gather context from other tokens)
  // followed by MLP (process the gathered information). Both use residual
  // connections so the input is added back to the output of each block.
  for (let li = 0; li < nLayer; li++) {
    const layer = weights.layers[li];

    // --- Attention block: look at previous tokens to gather context ---

    // Save the hidden state so we can add it back after the block (residual)
    const beforeAttention: Value[] = hidden;
    hidden = rmsnorm(hidden);

    // Project the hidden state into query, key, and value vectors.
    // These are structurally identical projections — training teaches them
    // to play different roles: query asks "what am I looking for?",
    // key advertises "what do I contain?", value carries "what to retrieve".
    const query: Value[] = linear(hidden, layer.attention.query);
    const key: Value[] = linear(hidden, layer.attention.key);
    const value: Value[] = linear(hidden, layer.attention.value);

    // Cache the key and value so future tokens can attend to this position
    keys[li].push(key);
    values[li].push(value);

    // Each head independently attends to a different slice of the vectors,
    // allowing the model to track multiple relationships at once
    const attentionOutput: Value[] = [];
    for (let h = 0; h < nHead; h++) {
      const headStart = h * headDim;
      const headQuery = query.slice(headStart, headStart + headDim);
      const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
      const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));

      // Scaled dot-product attention: score = (query · key) / √headDim
      const attnLogits = headKeys.map((cachedKey) =>
        vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
      );

      // Softmax converts scores into weights that sum to 1
      const attnWeights = softmax(attnLogits);

      // Weighted sum of value vectors — high-scoring positions contribute more
      for (let j = 0; j < headDim; j++) {
        attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
      }
    }

    // Project concatenated head outputs back to the hidden dimension
    hidden = linear(attentionOutput, layer.attention.output);
    // Residual connection: add back what we had before attention
    hidden = hidden.map((h, i) => h.add(beforeAttention[i]));

    // --- MLP block: process each token's representation independently ---

    const beforeMLP: Value[] = hidden;
    hidden = rmsnorm(hidden);
    hidden = linear(hidden, layer.mlp.hidden);   // expand to 4x wider
    hidden = hidden.map((h) => h.relu());         // nonlinearity
    hidden = linear(hidden, layer.mlp.output);    // compress back
    // Residual connection: add back what we had before the MLP
    hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
  }

  // Step 3: Output projection
  // Project the final hidden state to vocabulary size — one score per word.
  // These raw scores (logits) will be passed through softmax later to get
  // a probability distribution over the next token.
  return linear(hidden, weights.output);
}

rng.ts

/**
 * Seeded Random Number Generator
 *
 * JavaScript has no built-in seeded RNG, so we roll our own:
 * - mulberry32 for uniform random numbers
 * - Box-Muller transform for Gaussian (normal) distribution
 * - Fisher-Yates for array shuffling
 * - Weighted random choice for sampling from probability distributions
 */

let _rngState = 0;

export function seed(s: number): void {
  _rngState = s | 0;
}

/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
  _rngState = (_rngState + 0x6d2b79f5) | 0;
  let t = _rngState;
  t = Math.imul(t ^ (t >>> 15), t | 1);
  t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
  return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}

/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
  let u1 = random();
  while (u1 === 0) u1 = random();
  const u2 = random();
  return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}

/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
  for (let i = arr.length - 1; i > 0; i--) {
    const j = Math.floor(random() * (i + 1));
    [arr[i], arr[j]] = [arr[j], arr[i]];
  }
}

/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
  const total = weights.reduce((a, b) => a + b, 0);
  let r = random() * total;
  for (let i = 0; i < weights.length; i++) {
    r -= weights[i];
    if (r <= 0) return i;
  }
  return weights.length - 1;
}

nn.ts

/**
 * Neural Network Primitives
 *
 * The building blocks that the GPT architecture assembles:
 * - linear: matrix-vector multiply (the fundamental neural net operation)
 * - softmax: convert raw scores into a probability distribution
 * - rmsnorm: normalize activations to stabilize training
 *
 * These mirror PyTorch's torch.nn.functional — general-purpose operations
 * used by the model, training loop, and inference.
 */

import { Value, vsum } from "./autograd.js";

export type Matrix = Value[][];

/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
  return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}

/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
  const maxVal = Math.max(...logits.map((v) => v.data));
  const exps = logits.map((v) => v.sub(maxVal).exp());
  const total = vsum(exps);
  return exps.map((e) => e.div(total));
}

/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

autograd.ts

/**
 * Autograd Engine
 *
 * A scalar-valued automatic differentiation engine. Each Value node records:
 * - its forward-pass result (data)
 * - its gradient w.r.t. the loss (grad), filled in by backward()
 * - its children in the computation graph and the local derivatives
 *
 * The backward() method applies the chain rule via reverse-mode autodiff:
 * topologically sort the graph, then propagate gradients from output to inputs.
 *
 * "If I nudge this parameter slightly, how does the loss change?"
 */

export class Value {
  data: number;
  grad: number;
  children: Value[];
  localGrads: number[];

  constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
    this.data = data;
    this.grad = 0;
    this.children = children;
    this.localGrads = localGrads;
  }

  add(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data + o.data, [this, o], [1, 1]);
  }

  mul(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data * o.data, [this, o], [o.data, this.data]);
  }

  pow(n: number): Value {
    return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
  }

  log(): Value {
    return new Value(Math.log(this.data), [this], [1 / this.data]);
  }

  exp(): Value {
    return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
  }

  relu(): Value {
    return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
  }

  neg(): Value {
    return this.mul(-1);
  }

  sub(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.add(o.neg());
  }

  div(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.mul(o.pow(-1));
  }

  backward(): void {
    const topo: Value[] = [];
    const visited = new Set<Value>();
    const buildTopo = (v: Value): void => {
      if (!visited.has(v)) {
        visited.add(v);
        for (const child of v.children) buildTopo(child);
        topo.push(v);
      }
    };
    buildTopo(this);
    this.grad = 1;
    for (const v of topo.reverse()) {
      for (let i = 0; i < v.children.length; i++) {
        v.children[i].grad += v.localGrads[i] * v.grad;
      }
    }
  }
}

/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
  return values.reduce((acc, v) => acc.add(v), new Value(0));
}

tokenizer.ts

/**
 * Tokenizer
 *
 * Translates strings to sequences of integers ("tokens") and back.
 * Builds a word-level vocabulary from the training corpus, with a BOS
 * (Beginning of Sequence) delimiter appended on each side.
 */

export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

Training

We have data, a tokenizer, a model that can turn tokens into predictions, and the math to compute gradients. Now we train.

Training means showing the model sentences and adjusting its parameters to make better predictions. Each step processes one sentence through five stages:

"the cat eats a muffin"
    |
[Tokenize]  ->  [596, 541, 89, 152, 0, 358, 596]
    |
[Feed tokens one at a time through gpt()]
    |  position 0: feed BOS   -> predict 597 scores -> target: "the"
    |  position 1: feed "the" -> predict 597 scores -> target: "cat"
    |  position 2: feed "cat" -> predict 597 scores -> target: "eats"
    |  ...
    |
[Compute loss]  ->  how wrong was each prediction?
    |               -log(probability of the correct word)
    |
[Backward pass]  ->  loss.backward()
    |                 fills in .grad on all 63,296 parameters
    |
[Adam optimizer]  ->  nudge each parameter to reduce the loss
    |
repeat 5,000 times

The gpt() function from the model chapter handles the forward pass: turning a token into 597 scores. This chapter covers what happens around it: measuring how wrong those predictions are, computing gradients, and updating the weights.

Before training, the model’s predictions are random noise. After 5,000 steps, it has learned enough about English sentence structure to predict the next word with roughly 100x better accuracy than chance.

The Training Configuration

const model = train(
  untrained,
  {
    numSteps: 5000,       // how many sentences to learn from
    learningRate: 0.01,   // how aggressively to update (decays to 0)
    beta1: 0.85,          // momentum decay rate
    beta2: 0.99,          // gradient magnitude decay rate
    epsAdam: 1e-8,        // numerical stability term
  },
  sentences,
  tokenizer,
);

5,000 steps takes a few minutes on a laptop. Each step processes one sentence, building a full computation graph from tokens to loss, running backpropagation, and updating all parameters.

What Each Parameter Controls

numSteps is how many sentences the model trains on. The training loop cycles through the dataset one sentence per step (sentences[step % sentences.length]). One full pass through every sentence is called an epoch. If you had ~400 unique sentences, 5,000 steps would mean roughly 12 epochs, cycling through the data 12 times. Our dataset has 30,000 sentences, so 5,000 steps covers only the first sixth with no cycling at all. You can increase numSteps or reduce the dataset size to get more epochs, but be careful: cycling through the same sentences too many times risks overfitting, where the model memorizes the training data instead of learning general patterns.

learningRate controls how large each parameter update is. At 0.01, each step nudges parameters by roughly 1% of the gradient signal. This decays linearly to 0 over training: large steps early to learn quickly, small steps later to fine-tune. If the learning rate is too high, the model overshoots and loss oscillates instead of decreasing. Too low and learning is painfully slow.

beta1 and beta2 control the Adam optimizer’s memory. beta1 = 0.85 means momentum remembers 85% of recent gradient directions and incorporates 15% of the new one. This smooths out noisy gradients. beta2 = 0.99 means the magnitude tracker is more conservative, updating slowly to get a stable estimate of each parameter’s typical gradient size.

epsAdam is a tiny constant (0.00000001) added to prevent division by zero in the Adam update. It never meaningfully affects the training; it is purely a safety net for numerical stability.

The Training Loop

Each iteration of the training loop processes one sentence through five stages: pick a sentence, feed its tokens through the model, measure the error, compute gradients, and update the weights. Let’s walk through each part, then see the complete loop.

1. Pick a Sentence and Tokenize It

const sentence = sentences[step % sentences.length];
const tokens = tokenizer.encode(sentence);
const n = Math.min(model.config.blockSize, tokens.length - 1);

We cycle through the training sentences using modular arithmetic: step 0 picks sentence 0, step 1 picks sentence 1, and so on, wrapping back to 0 when we reach the end. The tokenizer turns “the cat eats a muffin” into [596, 541, 89, 152, 0, 358, 596]: a BOS marker, then each word, then BOS again.

The variable n is the number of predictions we will make. We subtract 1 from the token count because the last token has no “next word” to predict. We also cap at blockSize (16) because the position embedding table only has 16 rows. The model cannot handle longer sequences.

2. Feed Tokens and Predict the Next Word

We create a fresh KV cache for each sentence (each sentence is a standalone training example; we don’t carry context between sentences). Then we process the sentence token by token. At each position, the model sees all previous tokens through the cache and outputs 597 scores as its prediction for what comes next.

Because we have the complete sentence, we already know the right answer at every position. That is what makes training possible: we can measure how wrong each prediction is.

Position 0: feed BOS
  context: [BOS]
  model predicts → 597 scores
  correct answer: "the"

Position 1: feed "the"
  context: [BOS, "the"]
  model predicts → 597 scores
  correct answer: "cat"

Position 2: feed "cat"
  context: [BOS, "the", "cat"]
  model predicts → 597 scores
  correct answer: "eats"

Position 3: feed "eats"
  context: [BOS, "the", "cat", "eats"]
  model predicts → 597 scores
  correct answer: "a"

Position 4: feed "a"
  context: [BOS, "the", "cat", "eats", "a"]
  model predicts → 597 scores
  correct answer: "muffin"

Position 5: feed "muffin"
  context: [BOS, "the", "cat", "eats", "a", "muffin"]
  model predicts → 597 scores
  correct answer: BOS (end of sentence)

Each call to gpt() builds a computation graph of Value nodes: thousands of multiply, add, and relu operations connected by pointers. All six predictions within a sentence share the same graph (through the KV cache), so a single backward() call later can propagate gradients through all of them at once.

Early on, the model’s scores are essentially random. “The” might rank 400th out of 597. The loss measures exactly how bad that is.

3. Compute the Loss

The loss measures how wrong the model is. We use cross-entropy loss: for each position, we convert the logits to probabilities (via softmax), then take the negative log of the probability assigned to the correct next token:

const probs = softmax(logits);
losses.push(probs[targetId].log().neg());

Why negative log? Because it has exactly the shape we want:

When the model assigns probability 1.0 to the right answer: loss = -log(1) = 0. Perfect, no penalty.
When the model assigns probability 0.5, uncertain but in the ballpark: loss = -log(0.5) = 0.693. A gentle nudge.
When the model assigns probability 0.01, almost entirely wrong: loss = -log(0.01) = 4.605. A hard shove.

The penalty grows slowly at first, then accelerates toward infinity. This is what pushes the model to avoid being confidently wrong. A model that puts 90% probability on the wrong word gets punished far more than one that spreads its bets.

Model assigns to correct word	Loss	Meaning
0.90	0.105	Low loss: strong, correct prediction
0.50	0.693	Moderate loss: uncertain but reasonable
0.17	1.772	High loss: mostly wrong
0.01	4.605	Very high loss: confidently wrong

The total loss for the sentence is the average across all positions:

const loss = vsum(losses).div(n);

Averaging means the model is equally penalized whether the sentence has 3 words or 15. Otherwise long sentences would dominate the gradient signal and short sentences would barely contribute to learning.

4. Compute Gradients (Backward Pass)

One call to backward() fills in the gradient for every parameter:

loss.backward();

backward() returns nothing; it works entirely through side effects. It walks the computation graph built during the forward pass (all those Value nodes connected by pointers) and sets the .grad field on every node that participated in the computation. After it runs, each parameter’s .grad answers the question: if I increase this parameter by a tiny amount, how much does the loss change?

A positive gradient means “increasing this parameter makes the loss worse,” so the optimizer will decrease it. A negative gradient means “increasing this parameter makes the loss better,” so the optimizer will increase it. A large gradient means this parameter has a big effect on the current prediction; a small gradient means it barely matters for this sentence.

Sparse Embedding Gradients

Not all 63,296 parameters get meaningful gradients from every sentence. The layer weights (attention, MLP, output projection) are used in full on every forward pass (linear() computes a dot product with every row) so they always receive gradients. But embedding lookup is just array indexing: only the rows for tokens that appear in the current sentence enter the computation graph.

For a 7-token sentence like “the cat eats a muffin”, only 6 unique token embedding rows are touched (plus 6 position embedding rows). The remaining 591 token embedding rows keep their .grad at 0 and are not updated this step. Over the full training run, each word appears in many sentences, so every embedding row eventually gets trained, but frequent words get far more updates than rare ones.

5. Update Parameters (Adam Optimizer)

The simplest approach would be param -= learningRate * grad: nudge each parameter in the direction opposite its gradient. But this has two problems:

Noisy gradients. Each step only sees one sentence, so the gradient for any single step might point in a slightly wrong direction. We want to smooth out that noise by remembering which direction gradients have been pointing recently.
One size doesn’t fit all. Some parameters have consistently large gradients (they are sensitive), others have small ones (they are stable). A single learning rate either over-shoots the sensitive ones or under-shoots the stable ones.

The Adam optimizer solves both problems. It maintains two running averages for each parameter:

mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;

mBuf (momentum): a running average of the gradient direction. With beta1 = 0.85, each new gradient contributes 15% and the history contributes 85%. This smooths out noise: if the gradient has been pointing left for several steps, the momentum keeps pushing left even if one step says right.
vBuf (velocity): a running average of the squared gradient magnitude. This tracks how large the gradients have been for this parameter. Parameters with consistently large gradients get a smaller effective step size, and parameters with small gradients get a larger one.

Bias Correction

At the start of training, both mBuf and vBuf are initialized to zero. That makes the early running averages severely biased toward zero. After one step, mBuf is only 15% of the first gradient. The bias correction fixes this:

const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
const vHat = vBuf[i] / (1 - beta2 ** (step + 1));

At step 0: 1 - 0.85^1 = 0.15, so mHat divides by 0.15, effectively undoing the 85% shrinkage. By step 20, 1 - 0.85^21 ≈ 0.97, and the correction is negligible. This ensures the early updates are the right size from the start.

The Actual Update

params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
params[i].grad = 0;  // reset for next step

The update divides the smoothed gradient (mHat) by the square root of the smoothed squared gradient (vHat). This ratio normalizes the step: parameters where gradients are consistently large take proportionally smaller steps. The epsAdam (1e-8) prevents division by zero when vHat is tiny.

Learning Rate Decay

The learning rate also decays linearly from 0.01 to 0 over the course of training:

const lrT = learningRate * (1 - step / numSteps);

At step 0, lrT = 0.01. At step 2500 (halfway), lrT = 0.005. At step 4999, lrT ≈ 0. Large steps early let the model learn quickly. Small steps later let it fine-tune without overshooting.

Finally, we reset the gradient to zero so the next sentence starts with a clean slate. Without this, gradients would accumulate across sentences, and each update would be influenced by all previous sentences rather than just the current one.

The Complete Loop

Here is the full training loop with all five stages together:

for (let step = 0; step < numSteps; step++) {
  // --- Step 1: Pick a sentence and tokenize ---
  // Cycle through training data; the tokenizer wraps the sentence with BOS markers
  const sentence = sentences[step % sentences.length];
  const tokens = tokenizer.encode(sentence);
  const n = Math.min(model.config.blockSize, tokens.length - 1);

  // --- Step 2: Forward pass ---
  // Feed tokens one at a time, building a computation graph.
  // Each gpt() call adds to the KV cache so later tokens can attend to earlier ones.
  const { keys, values } = createKVCache(model);
  const losses: Value[] = [];

  for (let posId = 0; posId < n; posId++) {
    const tokenId = tokens[posId];
    const targetId = tokens[posId + 1];  // the word we want the model to predict
    const logits = gpt(model, tokenId, posId, keys, values);
    const probs = softmax(logits);
    // Cross-entropy loss: -log(probability assigned to the correct word)
    // High probability on the right word → low loss; low probability → high loss
    losses.push(probs[targetId].log().neg());
  }

  // Average loss over the sentence so short and long sentences contribute equally
  const loss = vsum(losses).div(n);

  // --- Step 3: Backward pass ---
  // Walk the computation graph and fill in .grad on every Value node.
  // After this, each parameter knows how much it contributed to the loss.
  loss.backward();

  // --- Step 4: Adam optimizer with linear learning rate decay ---
  // Large steps early (lr = 0.01) to learn quickly, decaying to 0 to fine-tune
  const lrT = learningRate * (1 - step / numSteps);
  for (let i = 0; i < params.length; i++) {
    // Update momentum (smoothed gradient direction)
    mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
    // Update velocity (smoothed gradient magnitude)
    vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
    // Bias correction: counteract the zero-initialization of mBuf/vBuf
    const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
    const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
    // Update: step in the smoothed gradient direction, scaled by the inverse
    // of the gradient magnitude so sensitive parameters take smaller steps
    params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
    params[i].grad = 0;  // reset for next sentence
  }
}

Every operation in the forward pass uses Value nodes, so the entire path from token to loss is a single computation graph. When backward() runs, gradients flow back through every step, from the loss, through softmax, through every layer’s attention and MLP blocks, all the way to the embedding lookup. That is the whole training loop: predict, measure, learn, repeat.

Watching It Learn

When training runs, the loss steadily decreases:

step    1 / 5000 | loss 6.3917
step  500 / 5000 | loss 3.2184
step 1000 / 5000 | loss 2.8549
step 2000 / 5000 | loss 2.4012
step 3000 / 5000 | loss 2.1538
step 4000 / 5000 | loss 1.9107
step 5000 / 5000 | loss 1.7623

What the Numbers Mean

The initial loss of ~6.39 corresponds to random guessing: -log(1/597) ≈ 6.39. The model has no idea which of the 597 words comes next, so it assigns roughly equal probability to all of them.

As training progresses, the model learns patterns: “the” often follows BOS, “cat” often follows “the”, verbs follow nouns, and the loss drops. Here is what each milestone roughly means:

Loss	Probability on correct word	What’s happening
6.39	0.17% (1/597)	Random guessing, no knowledge
3.22	4.0%	Learning common words and sentence starters
2.85	5.8%	Picking up basic word-order patterns
2.40	9.1%	Noun-verb associations forming
1.76	17.2%	Solid grasp of sentence structure

A loss of ~1.76 means the model is, on average, assigning about e^(-1.76) ≈ 17% probability to the correct next word. That may sound low, but remember: it is choosing among 597 words. Assigning 17% to the right answer means it has narrowed the field from 597 equally likely candidates to roughly 6 plausible ones. That is a 100x improvement over where it started.

Where the Learning Happens

The steepest drop occurs in the first 1,000 steps. This is when the model learns the easiest patterns: which words are common, which words tend to start sentences, and basic category associations (articles before nouns, nouns before verbs). The later steps produce smaller improvements as the model works on harder patterns: less frequent words, longer-range dependencies, and the subtle differences between similar sentence structures.

This is typical of neural network training: easy patterns are learned first, and each subsequent improvement is harder to achieve than the last.

Complete Code

Here is the complete source code for the training loop, along with all files from previous chapters.

train.ts

/**
 * Training Loop
 *
 * Each step processes one sentence through five stages:
 * 1. Pick a sentence and tokenize it
 * 2. Forward: feed tokens one at a time, predicting the next token at each position
 * 3. Compute cross-entropy loss (how surprised the model is at the correct answer)
 * 4. Backward: compute gradients via the chain rule
 * 5. Adam optimizer: update parameters to reduce loss
 *
 * Loss starts at ~6.39 (random guessing among 597 tokens = -log(1/597))
 * and decreases to ~1.76 over 5,000 steps.
 */

import { Value, vsum } from "./autograd.js";
import { softmax } from "./nn.js";
import { type Model, gpt, createKVCache } from "./model.js";
import type { Tokenizer } from "./tokenizer.js";

export interface TrainConfig {
  numSteps: number;
  learningRate: number;
  beta1: number;
  beta2: number;
  epsAdam: number;
}

/** Train the model on the dataset and return the trained model. */
export function train(
  model: Model,
  trainConfig: TrainConfig,
  sentences: string[],
  tokenizer: Tokenizer,
): Model {
  const { numSteps, learningRate, beta1, beta2, epsAdam } = trainConfig;
  const { params } = model;

  // Adam optimizer maintains two running averages per parameter:
  // mBuf: smoothed gradient direction (momentum)
  // vBuf: smoothed gradient magnitude (for adaptive step sizes)
  const mBuf = new Float64Array(params.length);
  const vBuf = new Float64Array(params.length);

  for (let step = 0; step < numSteps; step++) {
    // --- Step 1: Pick a sentence and tokenize ---
    // Cycle through training data; the tokenizer wraps the sentence with BOS markers
    const sentence = sentences[step % sentences.length];
    const tokens = tokenizer.encode(sentence);
    const n = Math.min(model.config.blockSize, tokens.length - 1);

    // --- Step 2: Forward pass ---
    // Feed tokens one at a time, building a computation graph.
    // Each gpt() call adds to the KV cache so later tokens can attend to earlier ones.
    const { keys, values } = createKVCache(model);
    const losses: Value[] = [];

    for (let posId = 0; posId < n; posId++) {
      const tokenId = tokens[posId];
      const targetId = tokens[posId + 1];  // the word we want the model to predict
      const logits = gpt(model, tokenId, posId, keys, values);
      const probs = softmax(logits);
      // Cross-entropy loss: -log(probability assigned to the correct word)
      // High probability on the right word → low loss; low probability → high loss
      losses.push(probs[targetId].log().neg());
    }

    // Average loss over the sentence so short and long sentences contribute equally
    const loss = vsum(losses).div(n);

    // --- Step 3: Backward pass ---
    // Walk the computation graph and fill in .grad on every Value node.
    // After this, each parameter knows how much it contributed to the loss.
    loss.backward();

    // --- Step 4: Adam optimizer with linear learning rate decay ---
    // Large steps early (lr = 0.01) to learn quickly, decaying to 0 to fine-tune
    const lrT = learningRate * (1 - step / numSteps);
    for (let i = 0; i < params.length; i++) {
      // Update momentum (smoothed gradient direction)
      mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
      // Update velocity (smoothed gradient magnitude)
      vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
      // Bias correction: counteract the zero-initialization of mBuf/vBuf
      const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
      const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
      // Update: step in the smoothed gradient direction, scaled by the inverse
      // of the gradient magnitude so sensitive parameters take smaller steps
      params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
      params[i].grad = 0;  // reset for next sentence
    }

    process.stdout.write(
      `\rstep ${String(step + 1).padStart(4)} / ${numSteps} | loss ${loss.data.toFixed(4)}`
    );
  }

  return model;
}

model.ts

/**
 * GPT Model
 *
 * The transformer architecture: a function that maps input tokens to a
 * probability distribution over what comes next.
 *
 * Follows GPT-2 with minor simplifications:
 * - RMSNorm instead of LayerNorm
 * - No biases
 * - ReLU instead of GeLU
 *
 * Config: 32 embedding dims, 4 attention heads, 2 layers, 16 max context
 * → 63,296 parameters total
 */

import { readFileSync, writeFileSync } from "node:fs";
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";

// --- Configuration ---

export interface GPTConfig {
  nLayer: number;
  nEmbd: number;
  blockSize: number;
  nHead: number;
  headDim: number;
  vocabSize: number;
}

interface Attention {
  query: Matrix;
  key: Matrix;
  value: Matrix;
  output: Matrix;
}

interface MLP {
  hidden: Matrix;
  output: Matrix;
}

interface Layer {
  attention: Attention;
  mlp: MLP;
}

export interface Weights {
  tokenEmbedding: Matrix;
  positionEmbedding: Matrix;
  output: Matrix;
  layers: Layer[];
}

/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
  config: GPTConfig;
  weights: Weights;
  params: Value[];
}

// --- Model Creation ---

function matrix(nout: number, nin: number, std = 0.08): Matrix {
  return Array.from({ length: nout }, () =>
    Array.from({ length: nin }, () => new Value(gauss(0, std)))
  );
}

/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
  const { nEmbd, nLayer, blockSize, vocabSize } = config;

  const weights: Weights = {
    tokenEmbedding: matrix(vocabSize, nEmbd),
    positionEmbedding: matrix(blockSize, nEmbd),
    output: matrix(vocabSize, nEmbd),
    layers: Array.from({ length: nLayer }, () => ({
      attention: {
        query: matrix(nEmbd, nEmbd),
        key: matrix(nEmbd, nEmbd),
        value: matrix(nEmbd, nEmbd),
        output: matrix(nEmbd, nEmbd),
      },
      mlp: {
        hidden: matrix(4 * nEmbd, nEmbd),
        output: matrix(nEmbd, 4 * nEmbd),
      },
    })),
  };

  const allMatrices: Matrix[] = [
    weights.tokenEmbedding,
    weights.positionEmbedding,
    weights.output,
    ...weights.layers.flatMap((layer) => [
      layer.attention.query, layer.attention.key,
      layer.attention.value, layer.attention.output,
      layer.mlp.hidden, layer.mlp.output,
    ]),
  ];
  const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));

  return { config, weights, params };
}

// --- Save / Load ---

/** Save a trained model to a JSON file (config + parameter values). */
export function saveModel(model: Model, path: string): void {
  const data = {
    config: model.config,
    weights: model.params.map((p) => p.data),
  };
  writeFileSync(path, JSON.stringify(data));
}

/** Load a model from a JSON file. Recreates the structure, then fills in the learned weights. */
export function loadModel(path: string): Model {
  const data = JSON.parse(readFileSync(path, "utf-8"));
  const model = createModel(data.config);
  for (let i = 0; i < model.params.length; i++) {
    model.params[i].data = data.weights[i];
  }
  return model;
}

// --- KV Cache ---

/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
  keys: Value[][][];
  values: Value[][][];
} {
  return {
    keys: Array.from({ length: model.config.nLayer }, () => []),
    values: Array.from({ length: model.config.nLayer }, () => []),
  };
}

// --- Forward Pass ---

/**
 * Run one step of the GPT: given a token at a position, return logits
 * over the vocabulary for the next token.
 *
 * The keys/values caches are mutated (appended to) on each call —
 * this is the KV cache that avoids recomputing attention for past positions.
 */
export function gpt(
  model: Model,
  tokenId: number,
  posId: number,
  keys: Value[][][],
  values: Value[][][],
): Value[] {
  const { nLayer, nHead, headDim } = model.config;
  const { weights } = model;

  // Step 1: Embedding lookup
  // Combine "what word is this?" with "where does it appear?" into a single
  // vector. This is the hidden state that flows through the rest of the network.
  const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
  const positionVector: Value[] = weights.positionEmbedding[posId];
  let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));

  // Normalize before the first layer to keep values at a stable scale
  hidden = rmsnorm(hidden);

  // Step 2: Transformer layers
  // Each layer has two blocks: attention (gather context from other tokens)
  // followed by MLP (process the gathered information). Both use residual
  // connections so the input is added back to the output of each block.
  for (let li = 0; li < nLayer; li++) {
    const layer = weights.layers[li];

    // --- Attention block: look at previous tokens to gather context ---

    // Save the hidden state so we can add it back after the block (residual)
    const beforeAttention: Value[] = hidden;
    hidden = rmsnorm(hidden);

    // Project the hidden state into query, key, and value vectors.
    // These are structurally identical projections — training teaches them
    // to play different roles: query asks "what am I looking for?",
    // key advertises "what do I contain?", value carries "what to retrieve".
    const query: Value[] = linear(hidden, layer.attention.query);
    const key: Value[] = linear(hidden, layer.attention.key);
    const value: Value[] = linear(hidden, layer.attention.value);

    // Cache the key and value so future tokens can attend to this position
    keys[li].push(key);
    values[li].push(value);

    // Each head independently attends to a different slice of the vectors,
    // allowing the model to track multiple relationships at once
    const attentionOutput: Value[] = [];
    for (let h = 0; h < nHead; h++) {
      const headStart = h * headDim;
      const headQuery = query.slice(headStart, headStart + headDim);
      const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
      const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));

      // Scaled dot-product attention: score = (query · key) / √headDim
      const attnLogits = headKeys.map((cachedKey) =>
        vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
      );

      // Softmax converts scores into weights that sum to 1
      const attnWeights = softmax(attnLogits);

      // Weighted sum of value vectors — high-scoring positions contribute more
      for (let j = 0; j < headDim; j++) {
        attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
      }
    }

    // Project concatenated head outputs back to the hidden dimension
    hidden = linear(attentionOutput, layer.attention.output);
    // Residual connection: add back what we had before attention
    hidden = hidden.map((h, i) => h.add(beforeAttention[i]));

    // --- MLP block: process each token's representation independently ---

    const beforeMLP: Value[] = hidden;
    hidden = rmsnorm(hidden);
    hidden = linear(hidden, layer.mlp.hidden);   // expand to 4x wider
    hidden = hidden.map((h) => h.relu());         // nonlinearity
    hidden = linear(hidden, layer.mlp.output);    // compress back
    // Residual connection: add back what we had before the MLP
    hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
  }

  // Step 3: Output projection
  // Project the final hidden state to vocabulary size — one score per word.
  // These raw scores (logits) will be passed through softmax later to get
  // a probability distribution over the next token.
  return linear(hidden, weights.output);
}

rng.ts

/**
 * Seeded Random Number Generator
 *
 * JavaScript has no built-in seeded RNG, so we roll our own:
 * - mulberry32 for uniform random numbers
 * - Box-Muller transform for Gaussian (normal) distribution
 * - Fisher-Yates for array shuffling
 * - Weighted random choice for sampling from probability distributions
 */

let _rngState = 0;

export function seed(s: number): void {
  _rngState = s | 0;
}

/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
  _rngState = (_rngState + 0x6d2b79f5) | 0;
  let t = _rngState;
  t = Math.imul(t ^ (t >>> 15), t | 1);
  t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
  return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}

/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
  let u1 = random();
  while (u1 === 0) u1 = random();
  const u2 = random();
  return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}

/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
  for (let i = arr.length - 1; i > 0; i--) {
    const j = Math.floor(random() * (i + 1));
    [arr[i], arr[j]] = [arr[j], arr[i]];
  }
}

/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
  const total = weights.reduce((a, b) => a + b, 0);
  let r = random() * total;
  for (let i = 0; i < weights.length; i++) {
    r -= weights[i];
    if (r <= 0) return i;
  }
  return weights.length - 1;
}

nn.ts

/**
 * Neural Network Primitives
 *
 * The building blocks that the GPT architecture assembles:
 * - linear: matrix-vector multiply (the fundamental neural net operation)
 * - softmax: convert raw scores into a probability distribution
 * - rmsnorm: normalize activations to stabilize training
 *
 * These mirror PyTorch's torch.nn.functional — general-purpose operations
 * used by the model, training loop, and inference.
 */

import { Value, vsum } from "./autograd.js";

export type Matrix = Value[][];

/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
  return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}

/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
  const maxVal = Math.max(...logits.map((v) => v.data));
  const exps = logits.map((v) => v.sub(maxVal).exp());
  const total = vsum(exps);
  return exps.map((e) => e.div(total));
}

/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

autograd.ts

/**
 * Autograd Engine
 *
 * A scalar-valued automatic differentiation engine. Each Value node records:
 * - its forward-pass result (data)
 * - its gradient w.r.t. the loss (grad), filled in by backward()
 * - its children in the computation graph and the local derivatives
 *
 * The backward() method applies the chain rule via reverse-mode autodiff:
 * topologically sort the graph, then propagate gradients from output to inputs.
 *
 * "If I nudge this parameter slightly, how does the loss change?"
 */

export class Value {
  data: number;
  grad: number;
  children: Value[];
  localGrads: number[];

  constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
    this.data = data;
    this.grad = 0;
    this.children = children;
    this.localGrads = localGrads;
  }

  add(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data + o.data, [this, o], [1, 1]);
  }

  mul(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data * o.data, [this, o], [o.data, this.data]);
  }

  pow(n: number): Value {
    return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
  }

  log(): Value {
    return new Value(Math.log(this.data), [this], [1 / this.data]);
  }

  exp(): Value {
    return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
  }

  relu(): Value {
    return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
  }

  neg(): Value {
    return this.mul(-1);
  }

  sub(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.add(o.neg());
  }

  div(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.mul(o.pow(-1));
  }

  backward(): void {
    const topo: Value[] = [];
    const visited = new Set<Value>();
    const buildTopo = (v: Value): void => {
      if (!visited.has(v)) {
        visited.add(v);
        for (const child of v.children) buildTopo(child);
        topo.push(v);
      }
    };
    buildTopo(this);
    this.grad = 1;
    for (const v of topo.reverse()) {
      for (let i = 0; i < v.children.length; i++) {
        v.children[i].grad += v.localGrads[i] * v.grad;
      }
    }
  }
}

/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
  return values.reduce((acc, v) => acc.add(v), new Value(0));
}

tokenizer.ts

/**
 * Tokenizer
 *
 * Translates strings to sequences of integers ("tokens") and back.
 * Builds a word-level vocabulary from the training corpus, with a BOS
 * (Beginning of Sequence) delimiter appended on each side.
 */

export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

Saving the Model

Training is expensive. We do not want to retrain every time we want to generate sentences. So we save the trained model to disk and load it later.

What Gets Saved

A model is just configuration (the shape) plus parameters (the learned numbers). We save both as JSON:

export function saveModel(model: Model, path: string): void {
  const data = {
    config: model.config,
    weights: model.params.map((p) => p.data),
  };
  writeFileSync(path, JSON.stringify(data));
}

The config tells us the architecture (nLayer: 2, nEmbd: 32, ...) and the weights array holds all 63,296 parameter values as plain numbers. The file is about 1.5 MB of JSON.

Loading the Model

To load, we recreate the model structure from the config, then fill in the learned weights:

export function loadModel(path: string): Model {
  const data = JSON.parse(readFileSync(path, "utf-8"));
  const model = createModel(data.config);
  for (let i = 0; i < model.params.length; i++) {
    model.params[i].data = data.weights[i];
  }
  return model;
}

createModel(data.config) allocates all the weight matrices with random values (just like before training), then we overwrite every parameter with the learned value. The ordering is deterministic (createModel always creates matrices in the same order), so the flat weights array maps back to the correct matrices.

Why This Matters

This separation (train once, generate many times) is how all production LLMs work. Training a large model can cost hundreds of millions of dollars in compute. But once trained, the model weights are saved and can be loaded for inference on any machine. The training code, the optimizer state, the gradient buffers: none of that is needed for inference. Just the architecture and the numbers.

Training (expensive, done once):
  train  ->  phrases-model.json

Inference (cheap, done many times):
  phrases-model.json  ->  generate  ->  sentences

We will wire up these scripts in the smoke test chapter.

Generation

The model is trained. The weights are saved. Now we load them and generate new sentences.

Generation works one word at a time. Start with the BOS token, feed it to the model, get 597 scores back, pick a word, feed that word back in, and repeat. Each step asks the same question: given everything so far, what word comes next? The output of one step becomes the input to the next. This is called autoregressive generation.

The interesting problem is how to pick the next word. The model outputs 597 raw scores (logits), but choosing the highest-scoring word every time produces dull, repetitive text. Choosing completely at random produces nonsense. The sampling strategy (temperature, top-k, and top-p) controls where the output lands between those extremes.

BOS → model → 597 logits → temperature → top-k → top-p → softmax → sample → next token
 ↑                                                                              |
 └────────────────────────────────────────────────────────────────────────────────┘

This chapter covers the generation loop, the KV cache that makes it efficient, and the sampling knobs that shape the output.

The Generation Loop

Generation works by starting with the BOS token and repeatedly asking the model: “given everything so far, what word comes next?”

let tokenId = tokenizer.BOS;      // start with BOS
const tokens: number[] = [];

for (let posId = 0; posId < model.config.blockSize; posId++) {
  const logits = gpt(model, tokenId, posId, keys, values);

  // 1. Temperature scaling
  let scores = logits.map((l) => l.data / temperature);

  // 2. Top-k: keep only the k highest scores
  if (topK > 0 && topK < scores.length) {
    const sorted = [...scores].sort((a, b) => b - a);
    const cutoff = sorted[topK];
    scores = scores.map((s) => (s >= cutoff ? s : -Infinity));
  }

  // 3. Top-p: keep smallest set whose probabilities sum to p
  if (topP < 1.0) { /* accumulate sorted probs, zero out the rest */ }

  // 4. Softmax and sample (plain math, no autograd needed at inference)
  const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
  const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
  const total = exps.reduce((a, b) => a + b, 0);
  const probs = exps.map((e) => e / total);
  tokenId = weightedChoice(probs);
  if (tokenId === tokenizer.BOS) break;
  tokens.push(tokenId);
}

return tokenizer.decode(tokens);

The full pipeline for turning logits into a sampled token:

597 logits -> temperature scaling -> top-k filter -> top-p filter -> softmax -> weighted sample

Step by Step

1. Forward pass. Feed the current token and position into the model. It returns 597 logits, one raw score per word in the vocabulary.

2. Temperature scaling. Divide all logits by the temperature. This controls how “creative” the model is:

Temperature	Effect
0.3 (low)	Sharper probabilities. Model picks the most likely word almost every time. Repetitive but safe.
0.8 (medium)	Balanced. Some variety, mostly coherent.
1.5 (high)	Flatter probabilities. Model considers unlikely words. More creative but may produce nonsense.

3. Top-k filter. If topK is set (e.g., 10), sort the scores and set everything outside the top k to negative infinity. Those words get zero probability after softmax and can never be chosen.

597 scores -> keep top 10 -> set 587 to -infinity

This hard cutoff prevents the model from ever picking a very unlikely word. The downside is that a fixed k does not adapt. Sometimes the model is very confident and only 2-3 words make sense; other times, 50 words are plausible.

4. Top-p filter (nucleus sampling). If topP is set (e.g., 0.9), compute probabilities from the current scores, sort by probability, and accumulate until the sum reaches p. Zero out everything past the cutoff.

scores -> softmax -> sort -> accumulate until sum >= 0.9 -> zero out the rest

If the model is confident (“the” has probability 0.85), maybe only 2-3 words survive. If the model is uncertain and the distribution is flat, maybe 50 words survive. Top-p adapts to the model’s confidence at each position, which is why it is generally preferred over top-k.

5. Softmax and sample. Convert the filtered scores to a probability distribution and randomly pick a word weighted by those probabilities. The word “the” might have probability 0.3 and “a” might have 0.15. The model does not always pick the top choice, which gives the output variety.

6. Check for BOS. If the sampled token is BOS, the sentence is done. Otherwise, feed this new token back in and repeat.

7. Decode. Convert the collected token IDs back to words.

Comparing Sampling Strategies

Strategy	What it does	Trade-off
Temperature	Reshapes the entire distribution	Simple, but low-probability junk words can still be picked
Top-k	Hard cutoff at k words	Fixed k does not adapt to model confidence
Top-p	Adaptive cutoff by cumulative probability	Handles both confident and uncertain positions well

In practice, these are combined. A typical production configuration might use temperature=0.7, top_p=0.9, and top_k=50 together. Temperature reshapes the distribution, then top-k and top-p trim the tail. Our implementation supports all three, individually or combined.

The KV Cache

Generation uses the same KV cache introduced in the model chapter. We create a fresh cache before generating and pass it to every gpt() call:

const { keys, values } = createKVCache(model);

Each call appends the current token’s key and value vectors to the cache, so the next call can attend to everything generated so far. The model never reprocesses old tokens; it just reads their cached K/V vectors.

During training, the KV cache saves redundant work within a single sentence. During generation, it is essential: without it, generating a 16-token sentence would mean reprocessing the entire sequence from scratch at every position: 1 + 2 + 3 + … + 16 = 136 forward passes instead of 16.

Example Output

With temperature 0.8, the trained model produces sentences like:

 1. the brown horse sits
 2. we fly to the store
 3. the girl reads a card
 4. mom cooked the muffin
 5. the kitten runs fast
 6. seven ducks swim at dusk
 7. i like the tall tree
 8. the cow eats

These are not sentences from the training data. The model invented them. But they follow the same patterns: articles before nouns, verbs in the right place, reasonable word combinations. The model learned the structure of simple English from just 5,000 training steps.

Complete Code

Here is the complete source code for text generation, along with all files from previous chapters.

generate.ts

/**
 * Inference / Text Generation
 *
 * Starting from the BOS token, the model predicts one token at a time:
 * 1. Forward the current token through the model to get logits
 * 2. Apply temperature scaling (lower = more conservative, higher = more creative)
 * 3. Apply top-k filtering (keep only the k most likely tokens)
 * 4. Apply top-p / nucleus filtering (keep smallest set summing to p)
 * 5. Convert to probabilities via softmax
 * 6. Randomly sample the next token from that distribution
 * 7. Stop when BOS is produced (end of sequence) or max length is reached
 */

import { type Model, gpt, createKVCache } from "./model.js";
import { weightedChoice } from "./rng.js";
import type { Tokenizer } from "./tokenizer.js";

export interface GenerateOptions {
  temperature?: number;
  topK?: number;
  topP?: number;
}

/** Generate new text samples from a trained model. */
export function generate(
  model: Model,
  tokenizer: Tokenizer,
  numSamples: number,
  options: GenerateOptions = {},
): string[] {
  const { temperature = 0.8, topK = 0, topP = 1.0 } = options;
  const samples: string[] = [];

  for (let i = 0; i < numSamples; i++) {
    const { keys, values } = createKVCache(model);
    let tokenId = tokenizer.BOS;
    const tokens: number[] = [];

    for (let posId = 0; posId < model.config.blockSize; posId++) {
      const logits = gpt(model, tokenId, posId, keys, values);

      // Temperature scaling
      let scores = logits.map((l) => l.data / temperature);

      // Top-k: keep only the k highest scores
      if (topK > 0 && topK < scores.length) {
        const sorted = [...scores].sort((a, b) => b - a);
        const cutoff = sorted[topK - 1];
        scores = scores.map((s) => (s >= cutoff ? s : -Infinity));
      }

      // Top-p (nucleus): keep smallest set whose probabilities sum to p
      if (topP < 1.0) {
        const indices = scores.map((s, idx) => idx);
        indices.sort((a, b) => scores[b] - scores[a]);
        // Compute softmax on current scores to get probabilities for filtering
        const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
        const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
        const total = exps.reduce((a, b) => a + b, 0);
        const probs = exps.map((e) => e / total);
        let cumSum = 0;
        let exceeded = false;
        for (const idx of indices) {
          if (exceeded) {
            scores[idx] = -Infinity;
          }
          cumSum += probs[idx];
          if (cumSum > topP) {
            exceeded = true;
          }
        }
      }

      // Softmax and sample (using plain numbers, not Value, for efficiency)
      const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
      const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
      const total = exps.reduce((a, b) => a + b, 0);
      const probs = exps.map((e) => e / total);

      tokenId = weightedChoice(probs);
      if (tokenId === tokenizer.BOS) break;
      tokens.push(tokenId);
    }

    samples.push(tokenizer.decode(tokens));
  }

  return samples;
}

train.ts

/**
 * Training Loop
 *
 * Each step:
 * 1. Pick a sentence and tokenize it
 * 2. Forward each token through the model, predicting the next token
 * 3. Compute cross-entropy loss (how surprised the model is at the correct answer)
 * 4. Backward pass: compute gradients via the chain rule
 * 5. Adam optimizer: update parameters to reduce loss
 *
 * Loss starts at ~6.39 (random guessing among 597 tokens = -log(1/597))
 * and decreases to ~1.76 over 5,000 steps.
 */

import { Value, vsum } from "./autograd.js";
import { softmax } from "./nn.js";
import { type Model, gpt, createKVCache } from "./model.js";
import type { Tokenizer } from "./tokenizer.js";

export interface TrainConfig {
  numSteps: number;
  learningRate: number;
  beta1: number;
  beta2: number;
  epsAdam: number;
}

/** Train the model on the dataset and return the trained model. */
export function train(
  model: Model,
  trainConfig: TrainConfig,
  sentences: string[],
  tokenizer: Tokenizer,
): Model {
  const { numSteps, learningRate, beta1, beta2, epsAdam } = trainConfig;
  const { params } = model;

  // Adam optimizer moment buffers
  const mBuf = new Float64Array(params.length);
  const vBuf = new Float64Array(params.length);

  for (let step = 0; step < numSteps; step++) {
    // Tokenize a single sentence, surrounded by BOS on both sides
    const sentence = sentences[step % sentences.length];
    const tokens = tokenizer.encode(sentence);
    const n = Math.min(model.config.blockSize, tokens.length - 1);

    // Forward: build the computation graph from tokens to loss
    const { keys, values } = createKVCache(model);
    const losses: Value[] = [];

    for (let posId = 0; posId < n; posId++) {
      const tokenId = tokens[posId];
      const targetId = tokens[posId + 1];
      const logits = gpt(model, tokenId, posId, keys, values);
      const probs = softmax(logits);
      losses.push(probs[targetId].log().neg());
    }

    // Average cross-entropy loss over the sequence
    const loss = vsum(losses).div(n);

    // Backward: walk the computation graph and fill in .grad on every Value node.
    // Returns void — gradients are stored as a side effect on the same param objects
    // that the Adam loop below reads from.
    loss.backward();

    // Adam update with linear learning rate decay
    const lrT = learningRate * (1 - step / numSteps);
    for (let i = 0; i < params.length; i++) {
      mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
      vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
      const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
      const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
      params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
      params[i].grad = 0;
    }

    process.stdout.write(
      `\rstep ${String(step + 1).padStart(4)} / ${numSteps} | loss ${loss.data.toFixed(4)}`
    );
  }

  return model;
}

model.ts

/**
 * GPT Model
 *
 * The transformer architecture: a function that maps input tokens to a
 * probability distribution over what comes next.
 *
 * Follows GPT-2 with minor simplifications:
 * - RMSNorm instead of LayerNorm
 * - No biases
 * - ReLU instead of GeLU
 *
 * Config: 32 embedding dims, 4 attention heads, 2 layers, 16 max context
 * → 63,296 parameters total
 */

import { readFileSync, writeFileSync } from "node:fs";
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";

// --- Configuration ---

export interface GPTConfig {
  nLayer: number;
  nEmbd: number;
  blockSize: number;
  nHead: number;
  headDim: number;
  vocabSize: number;
}

interface Attention {
  query: Matrix;
  key: Matrix;
  value: Matrix;
  output: Matrix;
}

interface MLP {
  hidden: Matrix;
  output: Matrix;
}

interface Layer {
  attention: Attention;
  mlp: MLP;
}

export interface Weights {
  tokenEmbedding: Matrix;
  positionEmbedding: Matrix;
  output: Matrix;
  layers: Layer[];
}

/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
  config: GPTConfig;
  weights: Weights;
  params: Value[];
}

// --- Model Creation ---

function matrix(nout: number, nin: number, std = 0.08): Matrix {
  return Array.from({ length: nout }, () =>
    Array.from({ length: nin }, () => new Value(gauss(0, std)))
  );
}

/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
  const { nEmbd, nLayer, blockSize, vocabSize } = config;

  const weights: Weights = {
    tokenEmbedding: matrix(vocabSize, nEmbd),
    positionEmbedding: matrix(blockSize, nEmbd),
    output: matrix(vocabSize, nEmbd),
    layers: Array.from({ length: nLayer }, () => ({
      attention: {
        query: matrix(nEmbd, nEmbd),
        key: matrix(nEmbd, nEmbd),
        value: matrix(nEmbd, nEmbd),
        output: matrix(nEmbd, nEmbd),
      },
      mlp: {
        hidden: matrix(4 * nEmbd, nEmbd),
        output: matrix(nEmbd, 4 * nEmbd),
      },
    })),
  };

  const allMatrices: Matrix[] = [
    weights.tokenEmbedding,
    weights.positionEmbedding,
    weights.output,
    ...weights.layers.flatMap((layer) => [
      layer.attention.query, layer.attention.key,
      layer.attention.value, layer.attention.output,
      layer.mlp.hidden, layer.mlp.output,
    ]),
  ];
  const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));

  return { config, weights, params };
}

// --- Save / Load ---

/** Save a trained model to a JSON file (config + parameter values). */
export function saveModel(model: Model, path: string): void {
  const data = {
    config: model.config,
    weights: model.params.map((p) => p.data),
  };
  writeFileSync(path, JSON.stringify(data));
}

/** Load a model from a JSON file. Recreates the structure, then fills in the learned weights. */
export function loadModel(path: string): Model {
  const data = JSON.parse(readFileSync(path, "utf-8"));
  const model = createModel(data.config);
  for (let i = 0; i < model.params.length; i++) {
    model.params[i].data = data.weights[i];
  }
  return model;
}

// --- KV Cache ---

/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
  keys: Value[][][];
  values: Value[][][];
} {
  return {
    keys: Array.from({ length: model.config.nLayer }, () => []),
    values: Array.from({ length: model.config.nLayer }, () => []),
  };
}

// --- Forward Pass ---

/**
 * Run one step of the GPT: given a token at a position, return logits
 * over the vocabulary for the next token.
 *
 * The keys/values caches are mutated (appended to) on each call —
 * this is the KV cache that avoids recomputing attention for past positions.
 */
export function gpt(
  model: Model,
  tokenId: number,
  posId: number,
  keys: Value[][][],
  values: Value[][][],
): Value[] {
  const { nLayer, nHead, headDim } = model.config;
  const { weights } = model;

  // Step 1: Embedding lookup
  // Combine "what word is this?" with "where does it appear?" into a single
  // vector. This is the hidden state that flows through the rest of the network.
  const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
  const positionVector: Value[] = weights.positionEmbedding[posId];
  let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));

  // Normalize before the first layer to keep values at a stable scale
  hidden = rmsnorm(hidden);

  // Step 2: Transformer layers
  // Each layer has two blocks: attention (gather context from other tokens)
  // followed by MLP (process the gathered information). Both use residual
  // connections so the input is added back to the output of each block.
  for (let li = 0; li < nLayer; li++) {
    const layer = weights.layers[li];

    // --- Attention block: look at previous tokens to gather context ---

    // Save the hidden state so we can add it back after the block (residual)
    const beforeAttention: Value[] = hidden;
    hidden = rmsnorm(hidden);

    // Project the hidden state into query, key, and value vectors.
    // These are structurally identical projections — training teaches them
    // to play different roles: query asks "what am I looking for?",
    // key advertises "what do I contain?", value carries "what to retrieve".
    const query: Value[] = linear(hidden, layer.attention.query);
    const key: Value[] = linear(hidden, layer.attention.key);
    const value: Value[] = linear(hidden, layer.attention.value);

    // Cache the key and value so future tokens can attend to this position
    keys[li].push(key);
    values[li].push(value);

    // Each head independently attends to a different slice of the vectors,
    // allowing the model to track multiple relationships at once
    const attentionOutput: Value[] = [];
    for (let h = 0; h < nHead; h++) {
      const headStart = h * headDim;
      const headQuery = query.slice(headStart, headStart + headDim);
      const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
      const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));

      // Scaled dot-product attention: score = (query · key) / √headDim
      const attnLogits = headKeys.map((cachedKey) =>
        vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
      );

      // Softmax converts scores into weights that sum to 1
      const attnWeights = softmax(attnLogits);

      // Weighted sum of value vectors — high-scoring positions contribute more
      for (let j = 0; j < headDim; j++) {
        attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
      }
    }

    // Project concatenated head outputs back to the hidden dimension
    hidden = linear(attentionOutput, layer.attention.output);
    // Residual connection: add back what we had before attention
    hidden = hidden.map((h, i) => h.add(beforeAttention[i]));

    // --- MLP block: process each token's representation independently ---

    const beforeMLP: Value[] = hidden;
    hidden = rmsnorm(hidden);
    hidden = linear(hidden, layer.mlp.hidden);   // expand to 4x wider
    hidden = hidden.map((h) => h.relu());         // nonlinearity
    hidden = linear(hidden, layer.mlp.output);    // compress back
    // Residual connection: add back what we had before the MLP
    hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
  }

  // Step 3: Output projection
  // Project the final hidden state to vocabulary size — one score per word.
  // These raw scores (logits) will be passed through softmax later to get
  // a probability distribution over the next token.
  return linear(hidden, weights.output);
}

rng.ts

/**
 * Seeded Random Number Generator
 *
 * JavaScript has no built-in seeded RNG, so we roll our own:
 * - mulberry32 for uniform random numbers
 * - Box-Muller transform for Gaussian (normal) distribution
 * - Fisher-Yates for array shuffling
 * - Weighted random choice for sampling from probability distributions
 */

let _rngState = 0;

export function seed(s: number): void {
  _rngState = s | 0;
}

/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
  _rngState = (_rngState + 0x6d2b79f5) | 0;
  let t = _rngState;
  t = Math.imul(t ^ (t >>> 15), t | 1);
  t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
  return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}

/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
  let u1 = random();
  while (u1 === 0) u1 = random();
  const u2 = random();
  return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}

/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
  for (let i = arr.length - 1; i > 0; i--) {
    const j = Math.floor(random() * (i + 1));
    [arr[i], arr[j]] = [arr[j], arr[i]];
  }
}

/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
  const total = weights.reduce((a, b) => a + b, 0);
  let r = random() * total;
  for (let i = 0; i < weights.length; i++) {
    r -= weights[i];
    if (r <= 0) return i;
  }
  return weights.length - 1;
}

nn.ts

/**
 * Neural Network Primitives
 *
 * The building blocks that the GPT architecture assembles:
 * - linear: matrix-vector multiply (the fundamental neural net operation)
 * - softmax: convert raw scores into a probability distribution
 * - rmsnorm: normalize activations to stabilize training
 *
 * These mirror PyTorch's torch.nn.functional — general-purpose operations
 * used by the model, training loop, and inference.
 */

import { Value, vsum } from "./autograd.js";

export type Matrix = Value[][];

/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
  return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}

/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
  const maxVal = Math.max(...logits.map((v) => v.data));
  const exps = logits.map((v) => v.sub(maxVal).exp());
  const total = vsum(exps);
  return exps.map((e) => e.div(total));
}

/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

autograd.ts

/**
 * Autograd Engine
 *
 * A scalar-valued automatic differentiation engine. Each Value node records:
 * - its forward-pass result (data)
 * - its gradient w.r.t. the loss (grad), filled in by backward()
 * - its children in the computation graph and the local derivatives
 *
 * The backward() method applies the chain rule via reverse-mode autodiff:
 * topologically sort the graph, then propagate gradients from output to inputs.
 *
 * "If I nudge this parameter slightly, how does the loss change?"
 */

export class Value {
  data: number;
  grad: number;
  children: Value[];
  localGrads: number[];

  constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
    this.data = data;
    this.grad = 0;
    this.children = children;
    this.localGrads = localGrads;
  }

  add(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data + o.data, [this, o], [1, 1]);
  }

  mul(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data * o.data, [this, o], [o.data, this.data]);
  }

  pow(n: number): Value {
    return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
  }

  log(): Value {
    return new Value(Math.log(this.data), [this], [1 / this.data]);
  }

  exp(): Value {
    return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
  }

  relu(): Value {
    return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
  }

  neg(): Value {
    return this.mul(-1);
  }

  sub(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.add(o.neg());
  }

  div(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.mul(o.pow(-1));
  }

  backward(): void {
    const topo: Value[] = [];
    const visited = new Set<Value>();
    const buildTopo = (v: Value): void => {
      if (!visited.has(v)) {
        visited.add(v);
        for (const child of v.children) buildTopo(child);
        topo.push(v);
      }
    };
    buildTopo(this);
    this.grad = 1;
    for (const v of topo.reverse()) {
      for (let i = 0; i < v.children.length; i++) {
        v.children[i].grad += v.localGrads[i] * v.grad;
      }
    }
  }
}

/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
  return values.reduce((acc, v) => acc.add(v), new Value(0));
}

tokenizer.ts

/**
 * Tokenizer
 *
 * Translates strings to sequences of integers ("tokens") and back.
 * Builds a word-level vocabulary from the training corpus, with a BOS
 * (Beginning of Sequence) delimiter appended on each side.
 */

export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

Smoke Test

Time to run it. In this lab you will train the model, inspect the saved weights, and generate sentences with different sampling strategies.

Train the Model

npx tsx src/phrases-train.ts

You should see output like this:

num sentences: 30000
vocab size: 597 (596 words + BOS)
num params: 63296
step    1 / 5000 | loss 6.3917
step  500 / 5000 | loss 3.2184
step 1000 / 5000 | loss 2.8549
step 2000 / 5000 | loss 2.4012
step 3000 / 5000 | loss 2.1538
step 4000 / 5000 | loss 1.9107
step 5000 / 5000 | loss 1.7623
model saved to phrases-model.json

Verify: Loss Decreases

The loss should start near 6.39 (random guessing: -log(1/597)) and drop below 2.0 by step 5000. If the loss is not decreasing, something is wrong.

Verify: Model File Exists

ls -lh phrases-model.json

The file should be roughly 1.5 MB. It contains the model configuration and all 63,296 learned parameter values as JSON.

Generate Sentences

Load the saved model and generate with default settings:

npx tsx src/phrases-generate.ts

loaded phrases-model.json: 63296 params, vocab 597
generating 20 sentences (temperature=0.8):

   1. the puppy wants to jump
   2. the pig swims by the park
   3. i hold the pie
   4. the fish skips
   5. the teacher is sweet
   ...

Verify: Output Makes Sense

The generated sentences should:

Start with plausible words (“the”, “a”, “i”, “we”, “mom”, etc.)
Follow basic subject-verb patterns
Use words from the training vocabulary
End naturally (not cut off mid-thought)

They will not be perfect. You will see occasional oddities like “the fish quacks” or “i toes”. The model has only 63K parameters and 5,000 training steps. But the overall structure should clearly resemble English sentences.

Try Different Sampling Strategies

Low Temperature (Conservative)

npx tsx src/phrases-generate.ts 10 --temp=0.3

With low temperature, the model almost always picks the highest-probability word. Output will be repetitive but grammatically safer.

High Temperature (Creative)

npx tsx src/phrases-generate.ts 10 --temp=1.5

With high temperature, the model considers unlikely words. Output will be more varied but may include nonsensical combinations.

Top-k Filtering

npx tsx src/phrases-generate.ts 10 --top-k=10

Only the 10 most likely words can be chosen at each position. This trims the long tail of unlikely words.

Nucleus Sampling (Top-p)

npx tsx src/phrases-generate.ts 10 --top-p=0.9

Keep the smallest set of words whose probabilities sum to 0.9. Adapts to the model’s confidence at each position.

Combined

npx tsx src/phrases-generate.ts 10 --temp=0.7 --top-p=0.9 --top-k=50

All three strategies working together. This is how production LLMs typically configure their sampling.

What You Have Built

A language model. The same architecture, at wildly different scale, behind ChatGPT, Claude, and every other LLM.

The entire system is ~450 lines of TypeScript across 9 files, with zero dependencies beyond the TypeScript compiler. Every piece is explicit:

You can read exactly how token 541 becomes a 32-dimensional vector
You can trace exactly how attention computes which words to focus on
You can inspect exactly how each gradient flows backward through the network
You can see exactly how the optimizer adjusts each of the 63,296 parameters

The only difference between this and a production LLM is scale. Same concepts. Same math. Same architecture. Just more parameters, more data, more compute, and a few engineering optimizations to make it fast.

our model:         597 vocab,    63K params,  30K sentences,     minutes on a laptop
production LLM:  ~100K vocab, ~1T+ params,   trillions of tokens, months on a cluster

The hard part is not the size. The hard part is understanding what happens inside, and now you do.

Complete Code

Here is the complete source code for the smoke test entry points, along with all files from previous chapters.

phrases-train.ts

/**
 * MicroGPT — Train on Phrases
 *
 * Trains a word-level GPT on grade 1 English sentences and saves the model.
 * Run inference afterwards with: npx tsx phrases-generate.ts
 *
 *   npx tsx phrases-train.ts
 */

import { readFileSync } from "node:fs";
import { seed, shuffle } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { createModel, saveModel } from "./model.js";
import { train } from "./train.js";

// 1. Seed the RNG for reproducibility
seed(42);

// 2. Load dataset: 30K grade 1 sentences
const sentences = readFileSync("data/grade1_sentences.txt", "utf-8")
  .split("\n")
  .filter((l) => l.trim())
  .map((l) => l.trim().toLowerCase());
shuffle(sentences);
console.log(`num sentences: ${sentences.length}`);

// 3. Build word-level tokenizer from the corpus
const tokenizer = createWordTokenizer(sentences);
console.log(`vocab size: ${tokenizer.vocabSize} (${tokenizer.vocabSize - 1} words + BOS)`);

// 4. Create the model — bigger than the names model to handle the larger vocabulary
const untrained = createModel({
  nLayer: 2,
  nEmbd: 32,
  blockSize: 16,
  nHead: 4,
  headDim: 8,
  vocabSize: tokenizer.vocabSize,
});
console.log(`num params: ${untrained.params.length}`);

// 5. Train the model
const model = train(
  untrained,
  { numSteps: 5000, learningRate: 0.01, beta1: 0.85, beta2: 0.99, epsAdam: 1e-8 },
  sentences,
  tokenizer,
);

// 6. Save the trained model to disk
saveModel(model, "phrases-model.json");
console.log("\nmodel saved to phrases-model.json");

phrases-generate.ts

/**
 * MicroGPT — Generate Phrases
 *
 * Loads a trained model and generates new sentences.
 * Train first with: npx tsx phrases-train.ts
 *
 *   npx tsx phrases-generate.ts
 *   npx tsx phrases-generate.ts 50                  # generate 50 sentences
 *   npx tsx phrases-generate.ts 20 --temp=0.3       # low temperature
 *   npx tsx phrases-generate.ts 20 --top-k=10       # top-k filtering
 *   npx tsx phrases-generate.ts 20 --top-p=0.9      # nucleus sampling
 *   npx tsx phrases-generate.ts 20 --temp=0.7 --top-p=0.9 --top-k=50
 *   npx tsx phrases-generate.ts 20 --model=phrases-fine-tuned-model.json
 */

import { readFileSync } from "node:fs";
import { seed } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { loadModel } from "./model.js";
import { generate, type GenerateOptions } from "./generate.js";

// Parse CLI arguments
const args = process.argv.slice(2);
const positional = args.filter((a) => !a.startsWith("--"));
const flags = Object.fromEntries(
  args.filter((a) => a.startsWith("--")).map((a) => {
    const [k, v] = a.slice(2).split("=");
    return [k, v];
  })
);

const numSamples = parseInt(positional[0] ?? "20", 10);
const options: GenerateOptions = {
  temperature: parseFloat(flags["temp"] ?? "0.8"),
  topK: parseInt(flags["top-k"] ?? "0", 10),
  topP: parseFloat(flags["top-p"] ?? "1.0"),
};

// 1. Seed the RNG
seed(Date.now());

// 2. Rebuild the tokenizer from the same corpus (needed for encode/decode)
const corpus = readFileSync("data/grade1_sentences.txt", "utf-8")
  .split("\n")
  .filter((l) => l.trim())
  .map((l) => l.trim().toLowerCase());
const tokenizer = createWordTokenizer(corpus);

// 3. Load the trained model
const modelPath = flags["model"] ?? "phrases-model.json";
const model = loadModel(modelPath);
console.log(`loaded ${modelPath}: ${model.params.length} params, vocab ${model.config.vocabSize}`);

// 4. Generate
const parts = [`temperature=${options.temperature}`];
if (options.topK! > 0) parts.push(`top-k=${options.topK}`);
if (options.topP! < 1.0) parts.push(`top-p=${options.topP}`);
console.log(`generating ${numSamples} sentences (${parts.join(", ")}):\n`);

const sentences = generate(model, tokenizer, numSamples, options);
sentences.forEach((s, i) =>
  console.log(`  ${String(i + 1).padStart(2)}. ${s}`)
);

generate.ts

/**
 * Inference / Text Generation
 *
 * Starting from the BOS token, the model predicts one token at a time:
 * 1. Forward the current token through the model to get logits
 * 2. Apply temperature scaling (lower = more conservative, higher = more creative)
 * 3. Apply top-k filtering (keep only the k most likely tokens)
 * 4. Apply top-p / nucleus filtering (keep smallest set summing to p)
 * 5. Convert to probabilities via softmax
 * 6. Randomly sample the next token from that distribution
 * 7. Stop when BOS is produced (end of sequence) or max length is reached
 */

import { type Model, gpt, createKVCache } from "./model.js";
import { weightedChoice } from "./rng.js";
import type { Tokenizer } from "./tokenizer.js";

export interface GenerateOptions {
  temperature?: number;
  topK?: number;
  topP?: number;
}

/** Generate new text samples from a trained model. */
export function generate(
  model: Model,
  tokenizer: Tokenizer,
  numSamples: number,
  options: GenerateOptions = {},
): string[] {
  const { temperature = 0.8, topK = 0, topP = 1.0 } = options;
  const samples: string[] = [];

  for (let i = 0; i < numSamples; i++) {
    const { keys, values } = createKVCache(model);
    let tokenId = tokenizer.BOS;
    const tokens: number[] = [];

    for (let posId = 0; posId < model.config.blockSize; posId++) {
      const logits = gpt(model, tokenId, posId, keys, values);

      // Temperature scaling
      let scores = logits.map((l) => l.data / temperature);

      // Top-k: keep only the k highest scores
      if (topK > 0 && topK < scores.length) {
        const sorted = [...scores].sort((a, b) => b - a);
        const cutoff = sorted[topK - 1];
        scores = scores.map((s) => (s >= cutoff ? s : -Infinity));
      }

      // Top-p (nucleus): keep smallest set whose probabilities sum to p
      if (topP < 1.0) {
        const indices = scores.map((s, idx) => idx);
        indices.sort((a, b) => scores[b] - scores[a]);
        // Compute softmax on current scores to get probabilities for filtering
        const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
        const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
        const total = exps.reduce((a, b) => a + b, 0);
        const probs = exps.map((e) => e / total);
        let cumSum = 0;
        let exceeded = false;
        for (const idx of indices) {
          if (exceeded) {
            scores[idx] = -Infinity;
          }
          cumSum += probs[idx];
          if (cumSum > topP) {
            exceeded = true;
          }
        }
      }

      // Softmax and sample (using plain numbers, not Value, for efficiency)
      const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
      const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
      const total = exps.reduce((a, b) => a + b, 0);
      const probs = exps.map((e) => e / total);

      tokenId = weightedChoice(probs);
      if (tokenId === tokenizer.BOS) break;
      tokens.push(tokenId);
    }

    samples.push(tokenizer.decode(tokens));
  }

  return samples;
}

train.ts

/**
 * Training Loop
 *
 * Each step:
 * 1. Pick a sentence and tokenize it
 * 2. Forward each token through the model, predicting the next token
 * 3. Compute cross-entropy loss (how surprised the model is at the correct answer)
 * 4. Backward pass: compute gradients via the chain rule
 * 5. Adam optimizer: update parameters to reduce loss
 *
 * Loss starts at -log(1/vocabSize) (random guessing) and decreases as
 * the model learns to predict the next word.
 */

import { Value, vsum } from "./autograd.js";
import { softmax } from "./nn.js";
import { type Model, gpt, createKVCache } from "./model.js";
import type { Tokenizer } from "./tokenizer.js";

export interface TrainConfig {
  numSteps: number;
  learningRate: number;
  beta1: number;
  beta2: number;
  epsAdam: number;
}

/** Train the model on the dataset and return the trained model. */
export function train(
  model: Model,
  trainConfig: TrainConfig,
  sentences: string[],
  tokenizer: Tokenizer,
): Model {
  const { numSteps, learningRate, beta1, beta2, epsAdam } = trainConfig;
  const { params } = model;

  // Adam optimizer moment buffers
  const mBuf = new Float64Array(params.length);
  const vBuf = new Float64Array(params.length);

  for (let step = 0; step < numSteps; step++) {
    // Tokenize a single sentence, surrounded by BOS on both sides
    const sentence = sentences[step % sentences.length];
    const tokens = tokenizer.encode(sentence);
    const n = Math.min(model.config.blockSize, tokens.length - 1);

    // Forward: build the computation graph from tokens to loss
    const { keys, values } = createKVCache(model);
    const losses: Value[] = [];

    for (let posId = 0; posId < n; posId++) {
      const tokenId = tokens[posId];
      const targetId = tokens[posId + 1];
      const logits = gpt(model, tokenId, posId, keys, values);
      const probs = softmax(logits);
      losses.push(probs[targetId].log().neg());
    }

    // Average cross-entropy loss over the sequence
    const loss = vsum(losses).div(n);

    // Backward: walk the computation graph and fill in .grad on every Value node.
    // Returns void — gradients are stored as a side effect on the same param objects
    // that the Adam loop below reads from.
    loss.backward();

    // Adam update with linear learning rate decay
    const lrT = learningRate * (1 - step / numSteps);
    for (let i = 0; i < params.length; i++) {
      mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
      vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
      const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
      const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
      params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
      params[i].grad = 0;
    }

    process.stdout.write(
      `\rstep ${String(step + 1).padStart(4)} / ${numSteps} | loss ${loss.data.toFixed(4)}`
    );
  }

  return model;
}

model.ts

/**
 * GPT Model
 *
 * The transformer architecture: a function that maps input tokens to a
 * probability distribution over what comes next.
 *
 * Follows GPT-2 with minor simplifications:
 * - RMSNorm instead of LayerNorm
 * - No biases
 * - ReLU instead of GeLU
 *
 * Config: 32 embedding dims, 4 attention heads, 2 layers, 16 max context
 * → 63,296 parameters total
 */

import { readFileSync, writeFileSync } from "node:fs";
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";

// --- Configuration ---

export interface GPTConfig {
  nLayer: number;
  nEmbd: number;
  blockSize: number;
  nHead: number;
  headDim: number;
  vocabSize: number;
}

interface Attention {
  query: Matrix;
  key: Matrix;
  value: Matrix;
  output: Matrix;
}

interface MLP {
  hidden: Matrix;
  output: Matrix;
}

interface Layer {
  attention: Attention;
  mlp: MLP;
}

export interface Weights {
  tokenEmbedding: Matrix;
  positionEmbedding: Matrix;
  output: Matrix;
  layers: Layer[];
}

/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
  config: GPTConfig;
  weights: Weights;
  params: Value[];
}

// --- Model Creation ---

function matrix(nout: number, nin: number, std = 0.08): Matrix {
  return Array.from({ length: nout }, () =>
    Array.from({ length: nin }, () => new Value(gauss(0, std)))
  );
}

/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
  const { nEmbd, nLayer, blockSize, vocabSize } = config;

  const weights: Weights = {
    tokenEmbedding: matrix(vocabSize, nEmbd),
    positionEmbedding: matrix(blockSize, nEmbd),
    output: matrix(vocabSize, nEmbd),
    layers: Array.from({ length: nLayer }, () => ({
      attention: {
        query: matrix(nEmbd, nEmbd),
        key: matrix(nEmbd, nEmbd),
        value: matrix(nEmbd, nEmbd),
        output: matrix(nEmbd, nEmbd),
      },
      mlp: {
        hidden: matrix(4 * nEmbd, nEmbd),
        output: matrix(nEmbd, 4 * nEmbd),
      },
    })),
  };

  const allMatrices: Matrix[] = [
    weights.tokenEmbedding,
    weights.positionEmbedding,
    weights.output,
    ...weights.layers.flatMap((layer) => [
      layer.attention.query, layer.attention.key,
      layer.attention.value, layer.attention.output,
      layer.mlp.hidden, layer.mlp.output,
    ]),
  ];
  const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));

  return { config, weights, params };
}

// --- Save / Load ---

/** Save a trained model to a JSON file (config + parameter values). */
export function saveModel(model: Model, path: string): void {
  const data = {
    config: model.config,
    weights: model.params.map((p) => p.data),
  };
  writeFileSync(path, JSON.stringify(data));
}

/** Load a model from a JSON file. Recreates the structure, then fills in the learned weights. */
export function loadModel(path: string): Model {
  const data = JSON.parse(readFileSync(path, "utf-8"));
  const model = createModel(data.config);
  for (let i = 0; i < model.params.length; i++) {
    model.params[i].data = data.weights[i];
  }
  return model;
}

// --- KV Cache ---

/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
  keys: Value[][][];
  values: Value[][][];
} {
  return {
    keys: Array.from({ length: model.config.nLayer }, () => []),
    values: Array.from({ length: model.config.nLayer }, () => []),
  };
}

// --- Forward Pass ---

/**
 * Run one step of the GPT: given a token at a position, return logits
 * over the vocabulary for the next token.
 *
 * The keys/values caches are mutated (appended to) on each call —
 * this is the KV cache that avoids recomputing attention for past positions.
 */
export function gpt(
  model: Model,
  tokenId: number,
  posId: number,
  keys: Value[][][],
  values: Value[][][],
): Value[] {
  const { nLayer, nHead, headDim } = model.config;
  const { weights } = model;

  // Step 1: Embedding lookup
  // Combine "what word is this?" with "where does it appear?" into a single
  // vector. This is the hidden state that flows through the rest of the network.
  const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
  const positionVector: Value[] = weights.positionEmbedding[posId];
  let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));

  // Normalize before the first layer to keep values at a stable scale
  hidden = rmsnorm(hidden);

  // Step 2: Transformer layers
  // Each layer has two blocks: attention (gather context from other tokens)
  // followed by MLP (process the gathered information). Both use residual
  // connections so the input is added back to the output of each block.
  for (let li = 0; li < nLayer; li++) {
    const layer = weights.layers[li];

    // --- Attention block: look at previous tokens to gather context ---

    // Save the hidden state so we can add it back after the block (residual)
    const beforeAttention: Value[] = hidden;
    hidden = rmsnorm(hidden);

    // Project the hidden state into query, key, and value vectors.
    // These are structurally identical projections — training teaches them
    // to play different roles: query asks "what am I looking for?",
    // key advertises "what do I contain?", value carries "what to retrieve".
    const query: Value[] = linear(hidden, layer.attention.query);
    const key: Value[] = linear(hidden, layer.attention.key);
    const value: Value[] = linear(hidden, layer.attention.value);

    // Cache the key and value so future tokens can attend to this position
    keys[li].push(key);
    values[li].push(value);

    // Each head independently attends to a different slice of the vectors,
    // allowing the model to track multiple relationships at once
    const attentionOutput: Value[] = [];
    for (let h = 0; h < nHead; h++) {
      const headStart = h * headDim;
      const headQuery = query.slice(headStart, headStart + headDim);
      const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
      const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));

      // Scaled dot-product attention: score = (query · key) / √headDim
      const attnLogits = headKeys.map((cachedKey) =>
        vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
      );

      // Softmax converts scores into weights that sum to 1
      const attnWeights = softmax(attnLogits);

      // Weighted sum of value vectors — high-scoring positions contribute more
      for (let j = 0; j < headDim; j++) {
        attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
      }
    }

    // Project concatenated head outputs back to the hidden dimension
    hidden = linear(attentionOutput, layer.attention.output);
    // Residual connection: add back what we had before attention
    hidden = hidden.map((h, i) => h.add(beforeAttention[i]));

    // --- MLP block: process each token's representation independently ---

    const beforeMLP: Value[] = hidden;
    hidden = rmsnorm(hidden);
    hidden = linear(hidden, layer.mlp.hidden);   // expand to 4x wider
    hidden = hidden.map((h) => h.relu());         // nonlinearity
    hidden = linear(hidden, layer.mlp.output);    // compress back
    // Residual connection: add back what we had before the MLP
    hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
  }

  // Step 3: Output projection
  // Project the final hidden state to vocabulary size — one score per word.
  // These raw scores (logits) will be passed through softmax later to get
  // a probability distribution over the next token.
  return linear(hidden, weights.output);
}

rng.ts

/**
 * Seeded Random Number Generator
 *
 * JavaScript has no built-in seeded RNG, so we roll our own:
 * - mulberry32 for uniform random numbers
 * - Box-Muller transform for Gaussian (normal) distribution
 * - Fisher-Yates for array shuffling
 * - Weighted random choice for sampling from probability distributions
 */

let _rngState = 0;

export function seed(s: number): void {
  _rngState = s | 0;
}

/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
  _rngState = (_rngState + 0x6d2b79f5) | 0;
  let t = _rngState;
  t = Math.imul(t ^ (t >>> 15), t | 1);
  t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
  return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}

/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
  let u1 = random();
  while (u1 === 0) u1 = random();
  const u2 = random();
  return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}

/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
  for (let i = arr.length - 1; i > 0; i--) {
    const j = Math.floor(random() * (i + 1));
    [arr[i], arr[j]] = [arr[j], arr[i]];
  }
}

/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
  const total = weights.reduce((a, b) => a + b, 0);
  let r = random() * total;
  for (let i = 0; i < weights.length; i++) {
    r -= weights[i];
    if (r <= 0) return i;
  }
  return weights.length - 1;
}

nn.ts

/**
 * Neural Network Primitives
 *
 * The building blocks that the GPT architecture assembles:
 * - linear: matrix-vector multiply (the fundamental neural net operation)
 * - softmax: convert raw scores into a probability distribution
 * - rmsnorm: normalize activations to stabilize training
 *
 * These mirror PyTorch's torch.nn.functional — general-purpose operations
 * used by the model, training loop, and inference.
 */

import { Value, vsum } from "./autograd.js";

export type Matrix = Value[][];

/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
  return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}

/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
  const maxVal = Math.max(...logits.map((v) => v.data));
  const exps = logits.map((v) => v.sub(maxVal).exp());
  const total = vsum(exps);
  return exps.map((e) => e.div(total));
}

/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

autograd.ts

/**
 * Autograd Engine
 *
 * A scalar-valued automatic differentiation engine. Each Value node records:
 * - its forward-pass result (data)
 * - its gradient w.r.t. the loss (grad), filled in by backward()
 * - its children in the computation graph and the local derivatives
 *
 * The backward() method applies the chain rule via reverse-mode autodiff:
 * topologically sort the graph, then propagate gradients from output to inputs.
 *
 * "If I nudge this parameter slightly, how does the loss change?"
 */

export class Value {
  data: number;
  grad: number;
  children: Value[];
  localGrads: number[];

  constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
    this.data = data;
    this.grad = 0;
    this.children = children;
    this.localGrads = localGrads;
  }

  add(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data + o.data, [this, o], [1, 1]);
  }

  mul(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data * o.data, [this, o], [o.data, this.data]);
  }

  pow(n: number): Value {
    return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
  }

  log(): Value {
    return new Value(Math.log(this.data), [this], [1 / this.data]);
  }

  exp(): Value {
    return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
  }

  relu(): Value {
    return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
  }

  neg(): Value {
    return this.mul(-1);
  }

  sub(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.add(o.neg());
  }

  div(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.mul(o.pow(-1));
  }

  backward(): void {
    const topo: Value[] = [];
    const visited = new Set<Value>();
    const buildTopo = (v: Value): void => {
      if (!visited.has(v)) {
        visited.add(v);
        for (const child of v.children) buildTopo(child);
        topo.push(v);
      }
    };
    buildTopo(this);
    this.grad = 1;
    for (const v of topo.reverse()) {
      for (let i = 0; i < v.children.length; i++) {
        v.children[i].grad += v.localGrads[i] * v.grad;
      }
    }
  }
}

/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
  return values.reduce((acc, v) => acc.add(v), new Value(0));
}

tokenizer.ts

/**
 * Tokenizer
 *
 * Translates strings to sequences of integers ("tokens") and back.
 * Builds a word-level vocabulary from the training corpus, with a BOS
 * (Beginning of Sequence) delimiter appended on each side.
 */

export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

Fine-Tuning

Our model generates declarative sentences: “the cat runs to the park.” But what if we want it to generate questions instead? We could train a new model from scratch on question data, but there is a much better way: fine-tuning.

Fine-tuning means taking a model that already knows a language and continuing to train it on a small, specialized dataset. The model keeps its existing knowledge (what words mean, how sentences are structured) and learns a new pattern (how to form questions) on top of it.

This is exactly how ChatGPT works. OpenAI first trains a base model on trillions of tokens of internet text (pre-training), then fine-tunes it on a much smaller dataset of conversations (fine-tuning). The base model learns language; the fine-tuning teaches it to be a helpful assistant.

The Question Dataset

Our vocabulary has 596 words. We cannot introduce new words without resizing the embedding tables, so the question dataset must use only words the model already knows. This is realistic: real fine-tuning rarely changes the tokenizer.

The vocabulary includes four question words: can, do, is, and where. That is enough for four question patterns:

can the cat run
do the birds fly
is the dog big
where is the ball

The file data/questions.txt contains 150 sentences following these patterns. Each sentence uses only words from the existing vocabulary.

Run the Fine-Tuning

The Fine-Tuning Script

Fine-tuning reuses the exact same train() function from pre-training. The only differences are the inputs:

// Load the ORIGINAL corpus to build the SAME tokenizer.
// The tokenizer must match the one used during pre-training,
// because the model's embeddings are tied to specific token IDs.
const originalDocs = readFileSync("data/grade1_sentences.txt", "utf-8")
  .split("\n")
  .filter((l) => l.trim())
  .map((l) => l.trim().toLowerCase());
const tokenizer = createWordTokenizer(originalDocs);

This is critical. The tokenizer assigns each word an integer index by sorting all unique words alphabetically. If we built the tokenizer from the question dataset instead, the word “cat” might get a different index. The model’s embedding for index 89 means “cat” because that is what it learned during pre-training. Using a different tokenizer would scramble every word.

// Load the pre-trained model
const model = loadModel("phrases-model.json");

// Load the fine-tuning dataset
const questionDocs = readFileSync("data/questions.txt", "utf-8")
  .split("\n")
  .filter((l) => l.trim())
  .map((l) => l.trim().toLowerCase());

Then we call the same train() function with two key changes to the configuration:

const fineTuned = train(
  model,
  {
    numSteps: 1000,         // 1,000 vs 5,000 for pre-training
    learningRate: 0.001,    // 0.001 vs 0.01, 10x lower
    beta1: 0.85,
    beta2: 0.99,
    epsAdam: 1e-8,
  },
  questionDocs,
  tokenizer,
);

Why Fewer Steps?

The pre-trained model already knows that “the” comes before nouns and that “cat” is an animal. It does not need to relearn any of this. It only needs to learn that sentences can start with “can”, “do”, “is”, or “where” and that what follows is a slightly different word order. 1,000 steps is enough to shift the distribution without destroying what the model already knows.

Why a Lower Learning Rate?

The learning rate controls how aggressively each gradient update changes the parameters. During pre-training, we used 0.01 because the model was starting from random noise and needed to move fast. During fine-tuning, the parameters are already in a good place. A large learning rate would overshoot, destroying the carefully learned representations. A 10x smaller learning rate (0.001) nudges the model gently toward question patterns while preserving its existing knowledge.

Running It

npx tsx src/phrases-fine-tune.ts

You should see output like this:

vocab size: 597 (596 words + BOS)
loaded pre-trained model: 63296 params
fine-tuning sentences: 150
step    1 / 1000 | loss 2.5073
step  500 / 1000 | loss 0.9592
step 1000 / 1000 | loss 1.0530

model saved to phrases-fine-tuned-model.json

Notice two things:

The loss starts around 2.5, not 6.39. The pre-trained model is not guessing randomly. It already assigns reasonable probability to the correct next word, even for question patterns it has rarely seen. This head start is the whole point of fine-tuning.
The loss drops quickly. In pre-training, the model needed 5,000 steps to go from 6.39 to ~1.76. Here it reaches similar loss in just 1,000 steps because it already understands the building blocks.

Generate Questions

Use the --model flag to generate from the fine-tuned model:

npx tsx src/phrases-generate.ts 20 --model=phrases-fine-tuned-model.json

Compare this with the base model:

npx tsx src/phrases-generate.ts 20

The fine-tuned model should produce noticeably more questions: sentences starting with “can”, “do”, “is”, and “where”. The base model will mostly produce declarative sentences.

The fine-tuned model will still produce some declarative sentences too. It has not completely forgotten how to make statements; it has just shifted its probability distribution to favor question patterns.

Catastrophic Forgetting

What happens if we fine-tune too aggressively? Try increasing the learning rate or step count:

{ numSteps: 5000, learningRate: 0.01, ... }  // same as pre-training

With these settings, the model will almost exclusively produce questions and lose its ability to generate varied declarative sentences. This is called catastrophic forgetting: the new data completely overwrites the old knowledge.

This is one of the fundamental tensions in fine-tuning. You want the model to learn the new task, but you do not want it to forget everything else. In practice, this is managed by:

Low learning rates (what we did here)
Few training steps (what we did here)
Freezing layers (not updating early transformer layers, only the later ones)
LoRA (adding small adapter matrices instead of modifying the original weights)

Our approach is the simplest version: just be gentle with the learning rate and stop early. It works well for small models and small distribution shifts.

What This Teaches

Fine-tuning is the most common way LLMs are deployed in the real world:

Pre-training (expensive, done once):
  trillions of tokens  ->  base model

Fine-tuning (cheap, done many times):
  base model + small dataset  ->  specialized model

GPT-4 is fine-tuned for conversation. Code Llama is fine-tuned for programming. Med-PaLM is fine-tuned for medical questions. Each starts from the same kind of base model and adapts it with a small, targeted dataset.

The entire fine-tuning pipeline here is five lines of setup and one call to train(). The same function, the same optimizer, the same math. The only differences are the data and the hyperparameters. That simplicity is not a shortcut. It is how fine-tuning actually works.

Complete Code

Here is the complete source code for fine-tuning, along with all files from previous chapters.

phrases-fine-tune.ts

/**
 * MicroGPT — Fine-Tune on Questions
 *
 * Loads a pre-trained model and fine-tunes it on a small question dataset.
 * The model learns to produce questions while preserving its existing knowledge.
 *
 *   npx tsx src/phrases-fine-tune.ts
 */

import { readFileSync } from "node:fs";
import { seed, shuffle } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { loadModel, saveModel } from "./model.js";
import { train } from "./train.js";

// 1. Seed the RNG for reproducibility
seed(42);

// 2. Load the ORIGINAL corpus to build the SAME tokenizer.
//    The tokenizer must match the one used during pre-training,
//    because the model's embeddings are tied to specific token IDs.
const originalDocs = readFileSync("data/grade1_sentences.txt", "utf-8")
  .split("\n")
  .filter((l) => l.trim())
  .map((l) => l.trim().toLowerCase());
const tokenizer = createWordTokenizer(originalDocs);
console.log(`vocab size: ${tokenizer.vocabSize} (${tokenizer.vocabSize - 1} words + BOS)`);

// 3. Load the pre-trained model
const model = loadModel("phrases-model.json");
console.log(`loaded pre-trained model: ${model.params.length} params`);

// 4. Load the fine-tuning dataset (questions)
const questionDocs = readFileSync("data/questions.txt", "utf-8")
  .split("\n")
  .filter((l) => l.trim())
  .map((l) => l.trim().toLowerCase());
shuffle(questionDocs);
console.log(`fine-tuning sentences: ${questionDocs.length}`);

// 5. Fine-tune: fewer steps, lower learning rate.
//    - Lower learning rate (0.001 vs 0.01) preserves existing knowledge
//    - Fewer steps (1000 vs 5000) since we're adapting, not learning from scratch
const fineTuned = train(
  model,
  { numSteps: 1000, learningRate: 0.001, beta1: 0.85, beta2: 0.99, epsAdam: 1e-8 },
  questionDocs,
  tokenizer,
);

// 6. Save the fine-tuned model
saveModel(fineTuned, "phrases-fine-tuned-model.json");
console.log("\nmodel saved to phrases-fine-tuned-model.json");

phrases-train.ts

/**
 * MicroGPT — Train on Phrases
 *
 * Trains a word-level GPT on grade 1 English sentences and saves the model.
 * Run inference afterwards with: npx tsx phrases-generate.ts
 *
 *   npx tsx phrases-train.ts
 */

import { readFileSync } from "node:fs";
import { seed, shuffle } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { createModel, saveModel } from "./model.js";
import { train } from "./train.js";

// 1. Seed the RNG for reproducibility
seed(42);

// 2. Load dataset: 30K grade 1 sentences
const sentences = readFileSync("data/grade1_sentences.txt", "utf-8")
  .split("\n")
  .filter((l) => l.trim())
  .map((l) => l.trim().toLowerCase());
shuffle(sentences);
console.log(`num sentences: ${sentences.length}`);

// 3. Build word-level tokenizer from the corpus
const tokenizer = createWordTokenizer(sentences);
console.log(`vocab size: ${tokenizer.vocabSize} (${tokenizer.vocabSize - 1} words + BOS)`);

// 4. Create the model — bigger than the names model to handle the larger vocabulary
const untrained = createModel({
  nLayer: 2,
  nEmbd: 32,
  blockSize: 16,
  nHead: 4,
  headDim: 8,
  vocabSize: tokenizer.vocabSize,
});
console.log(`num params: ${untrained.params.length}`);

// 5. Train the model
const model = train(
  untrained,
  { numSteps: 5000, learningRate: 0.01, beta1: 0.85, beta2: 0.99, epsAdam: 1e-8 },
  sentences,
  tokenizer,
);

// 6. Save the trained model to disk
saveModel(model, "phrases-model.json");
console.log("\nmodel saved to phrases-model.json");

phrases-generate.ts

/**
 * MicroGPT — Generate Phrases
 *
 * Loads a trained model and generates new sentences.
 * Train first with: npx tsx phrases-train.ts
 *
 *   npx tsx phrases-generate.ts
 *   npx tsx phrases-generate.ts 50                  # generate 50 sentences
 *   npx tsx phrases-generate.ts 20 --temp=0.3       # low temperature
 *   npx tsx phrases-generate.ts 20 --top-k=10       # top-k filtering
 *   npx tsx phrases-generate.ts 20 --top-p=0.9      # nucleus sampling
 *   npx tsx phrases-generate.ts 20 --temp=0.7 --top-p=0.9 --top-k=50
 *   npx tsx phrases-generate.ts 20 --model=phrases-fine-tuned-model.json
 */

import { readFileSync } from "node:fs";
import { seed } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { loadModel } from "./model.js";
import { generate, type GenerateOptions } from "./generate.js";

// Parse CLI arguments
const args = process.argv.slice(2);
const positional = args.filter((a) => !a.startsWith("--"));
const flags = Object.fromEntries(
  args.filter((a) => a.startsWith("--")).map((a) => {
    const [k, v] = a.slice(2).split("=");
    return [k, v];
  })
);

const numSamples = parseInt(positional[0] ?? "20", 10);
const options: GenerateOptions = {
  temperature: parseFloat(flags["temp"] ?? "0.8"),
  topK: parseInt(flags["top-k"] ?? "0", 10),
  topP: parseFloat(flags["top-p"] ?? "1.0"),
};

// 1. Seed the RNG
seed(Date.now());

// 2. Rebuild the tokenizer from the same corpus (needed for encode/decode)
const corpus = readFileSync("data/grade1_sentences.txt", "utf-8")
  .split("\n")
  .filter((l) => l.trim())
  .map((l) => l.trim().toLowerCase());
const tokenizer = createWordTokenizer(corpus);

// 3. Load the trained model
const modelPath = flags["model"] ?? "phrases-model.json";
const model = loadModel(modelPath);
console.log(`loaded ${modelPath}: ${model.params.length} params, vocab ${model.config.vocabSize}`);

// 4. Generate
const parts = [`temperature=${options.temperature}`];
if (options.topK! > 0) parts.push(`top-k=${options.topK}`);
if (options.topP! < 1.0) parts.push(`top-p=${options.topP}`);
console.log(`generating ${numSamples} sentences (${parts.join(", ")}):\n`);

const sentences = generate(model, tokenizer, numSamples, options);
sentences.forEach((s, i) =>
  console.log(`  ${String(i + 1).padStart(2)}. ${s}`)
);

generate.ts

/**
 * Inference / Text Generation
 *
 * Starting from the BOS token, the model predicts one token at a time:
 * 1. Forward the current token through the model to get logits
 * 2. Apply temperature scaling (lower = more conservative, higher = more creative)
 * 3. Apply top-k filtering (keep only the k most likely tokens)
 * 4. Apply top-p / nucleus filtering (keep smallest set summing to p)
 * 5. Convert to probabilities via softmax
 * 6. Randomly sample the next token from that distribution
 * 7. Stop when BOS is produced (end of sequence) or max length is reached
 */

import { type Model, gpt, createKVCache } from "./model.js";
import { weightedChoice } from "./rng.js";
import type { Tokenizer } from "./tokenizer.js";

export interface GenerateOptions {
  temperature?: number;
  topK?: number;
  topP?: number;
}

/** Generate new text samples from a trained model. */
export function generate(
  model: Model,
  tokenizer: Tokenizer,
  numSamples: number,
  options: GenerateOptions = {},
): string[] {
  const { temperature = 0.8, topK = 0, topP = 1.0 } = options;
  const samples: string[] = [];

  for (let i = 0; i < numSamples; i++) {
    const { keys, values } = createKVCache(model);
    let tokenId = tokenizer.BOS;
    const tokens: number[] = [];

    for (let posId = 0; posId < model.config.blockSize; posId++) {
      const logits = gpt(model, tokenId, posId, keys, values);

      // Temperature scaling
      let scores = logits.map((l) => l.data / temperature);

      // Top-k: keep only the k highest scores
      if (topK > 0 && topK < scores.length) {
        const sorted = [...scores].sort((a, b) => b - a);
        const cutoff = sorted[topK - 1];
        scores = scores.map((s) => (s >= cutoff ? s : -Infinity));
      }

      // Top-p (nucleus): keep smallest set whose probabilities sum to p
      if (topP < 1.0) {
        const indices = scores.map((s, idx) => idx);
        indices.sort((a, b) => scores[b] - scores[a]);
        // Compute softmax on current scores to get probabilities for filtering
        const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
        const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
        const total = exps.reduce((a, b) => a + b, 0);
        const probs = exps.map((e) => e / total);
        let cumSum = 0;
        let exceeded = false;
        for (const idx of indices) {
          if (exceeded) {
            scores[idx] = -Infinity;
          }
          cumSum += probs[idx];
          if (cumSum > topP) {
            exceeded = true;
          }
        }
      }

      // Softmax and sample (using plain numbers, not Value, for efficiency)
      const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
      const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
      const total = exps.reduce((a, b) => a + b, 0);
      const probs = exps.map((e) => e / total);

      tokenId = weightedChoice(probs);
      if (tokenId === tokenizer.BOS) break;
      tokens.push(tokenId);
    }

    samples.push(tokenizer.decode(tokens));
  }

  return samples;
}

train.ts

/**
 * Training Loop
 *
 * Each step:
 * 1. Pick a sentence and tokenize it
 * 2. Forward each token through the model, predicting the next token
 * 3. Compute cross-entropy loss (how surprised the model is at the correct answer)
 * 4. Backward pass: compute gradients via the chain rule
 * 5. Adam optimizer: update parameters to reduce loss
 *
 * Loss starts at -log(1/vocabSize) (random guessing) and decreases as
 * the model learns to predict the next word.
 */

import { Value, vsum } from "./autograd.js";
import { softmax } from "./nn.js";
import { type Model, gpt, createKVCache } from "./model.js";
import type { Tokenizer } from "./tokenizer.js";

export interface TrainConfig {
  numSteps: number;
  learningRate: number;
  beta1: number;
  beta2: number;
  epsAdam: number;
}

/** Train the model on the dataset and return the trained model. */
export function train(
  model: Model,
  trainConfig: TrainConfig,
  sentences: string[],
  tokenizer: Tokenizer,
): Model {
  const { numSteps, learningRate, beta1, beta2, epsAdam } = trainConfig;
  const { params } = model;

  // Adam optimizer moment buffers
  const mBuf = new Float64Array(params.length);
  const vBuf = new Float64Array(params.length);

  for (let step = 0; step < numSteps; step++) {
    // Tokenize a single sentence, surrounded by BOS on both sides
    const sentence = sentences[step % sentences.length];
    const tokens = tokenizer.encode(sentence);
    const n = Math.min(model.config.blockSize, tokens.length - 1);

    // Forward: build the computation graph from tokens to loss
    const { keys, values } = createKVCache(model);
    const losses: Value[] = [];

    for (let posId = 0; posId < n; posId++) {
      const tokenId = tokens[posId];
      const targetId = tokens[posId + 1];
      const logits = gpt(model, tokenId, posId, keys, values);
      const probs = softmax(logits);
      losses.push(probs[targetId].log().neg());
    }

    // Average cross-entropy loss over the sequence
    const loss = vsum(losses).div(n);

    // Backward: walk the computation graph and fill in .grad on every Value node.
    // Returns void — gradients are stored as a side effect on the same param objects
    // that the Adam loop below reads from.
    loss.backward();

    // Adam update with linear learning rate decay
    const lrT = learningRate * (1 - step / numSteps);
    for (let i = 0; i < params.length; i++) {
      mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
      vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
      const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
      const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
      params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
      params[i].grad = 0;
    }

    process.stdout.write(
      `\rstep ${String(step + 1).padStart(4)} / ${numSteps} | loss ${loss.data.toFixed(4)}`
    );
  }

  return model;
}

model.ts

/**
 * GPT Model
 *
 * The transformer architecture: a function that maps input tokens to a
 * probability distribution over what comes next.
 *
 * Follows GPT-2 with minor simplifications:
 * - RMSNorm instead of LayerNorm
 * - No biases
 * - ReLU instead of GeLU
 *
 * Config: 32 embedding dims, 4 attention heads, 2 layers, 16 max context
 * → 63,296 parameters total
 */

import { readFileSync, writeFileSync } from "node:fs";
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";

// --- Configuration ---

export interface GPTConfig {
  nLayer: number;
  nEmbd: number;
  blockSize: number;
  nHead: number;
  headDim: number;
  vocabSize: number;
}

interface Attention {
  query: Matrix;
  key: Matrix;
  value: Matrix;
  output: Matrix;
}

interface MLP {
  hidden: Matrix;
  output: Matrix;
}

interface Layer {
  attention: Attention;
  mlp: MLP;
}

export interface Weights {
  tokenEmbedding: Matrix;
  positionEmbedding: Matrix;
  output: Matrix;
  layers: Layer[];
}

/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
  config: GPTConfig;
  weights: Weights;
  params: Value[];
}

// --- Model Creation ---

function matrix(nout: number, nin: number, std = 0.08): Matrix {
  return Array.from({ length: nout }, () =>
    Array.from({ length: nin }, () => new Value(gauss(0, std)))
  );
}

/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
  const { nEmbd, nLayer, blockSize, vocabSize } = config;

  const weights: Weights = {
    tokenEmbedding: matrix(vocabSize, nEmbd),
    positionEmbedding: matrix(blockSize, nEmbd),
    output: matrix(vocabSize, nEmbd),
    layers: Array.from({ length: nLayer }, () => ({
      attention: {
        query: matrix(nEmbd, nEmbd),
        key: matrix(nEmbd, nEmbd),
        value: matrix(nEmbd, nEmbd),
        output: matrix(nEmbd, nEmbd),
      },
      mlp: {
        hidden: matrix(4 * nEmbd, nEmbd),
        output: matrix(nEmbd, 4 * nEmbd),
      },
    })),
  };

  const allMatrices: Matrix[] = [
    weights.tokenEmbedding,
    weights.positionEmbedding,
    weights.output,
    ...weights.layers.flatMap((layer) => [
      layer.attention.query, layer.attention.key,
      layer.attention.value, layer.attention.output,
      layer.mlp.hidden, layer.mlp.output,
    ]),
  ];
  const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));

  return { config, weights, params };
}

// --- Save / Load ---

/** Save a trained model to a JSON file (config + parameter values). */
export function saveModel(model: Model, path: string): void {
  const data = {
    config: model.config,
    weights: model.params.map((p) => p.data),
  };
  writeFileSync(path, JSON.stringify(data));
}

/** Load a model from a JSON file. Recreates the structure, then fills in the learned weights. */
export function loadModel(path: string): Model {
  const data = JSON.parse(readFileSync(path, "utf-8"));
  const model = createModel(data.config);
  for (let i = 0; i < model.params.length; i++) {
    model.params[i].data = data.weights[i];
  }
  return model;
}

// --- KV Cache ---

/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
  keys: Value[][][];
  values: Value[][][];
} {
  return {
    keys: Array.from({ length: model.config.nLayer }, () => []),
    values: Array.from({ length: model.config.nLayer }, () => []),
  };
}

// --- Forward Pass ---

/**
 * Run one step of the GPT: given a token at a position, return logits
 * over the vocabulary for the next token.
 *
 * The keys/values caches are mutated (appended to) on each call —
 * this is the KV cache that avoids recomputing attention for past positions.
 */
export function gpt(
  model: Model,
  tokenId: number,
  posId: number,
  keys: Value[][][],
  values: Value[][][],
): Value[] {
  const { nLayer, nHead, headDim } = model.config;
  const { weights } = model;

  // Step 1: Embedding lookup
  // Combine "what word is this?" with "where does it appear?" into a single
  // vector. This is the hidden state that flows through the rest of the network.
  const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
  const positionVector: Value[] = weights.positionEmbedding[posId];
  let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));

  // Normalize before the first layer to keep values at a stable scale
  hidden = rmsnorm(hidden);

  // Step 2: Transformer layers
  // Each layer has two blocks: attention (gather context from other tokens)
  // followed by MLP (process the gathered information). Both use residual
  // connections so the input is added back to the output of each block.
  for (let li = 0; li < nLayer; li++) {
    const layer = weights.layers[li];

    // --- Attention block: look at previous tokens to gather context ---

    // Save the hidden state so we can add it back after the block (residual)
    const beforeAttention: Value[] = hidden;
    hidden = rmsnorm(hidden);

    // Project the hidden state into query, key, and value vectors.
    // These are structurally identical projections — training teaches them
    // to play different roles: query asks "what am I looking for?",
    // key advertises "what do I contain?", value carries "what to retrieve".
    const query: Value[] = linear(hidden, layer.attention.query);
    const key: Value[] = linear(hidden, layer.attention.key);
    const value: Value[] = linear(hidden, layer.attention.value);

    // Cache the key and value so future tokens can attend to this position
    keys[li].push(key);
    values[li].push(value);

    // Each head independently attends to a different slice of the vectors,
    // allowing the model to track multiple relationships at once
    const attentionOutput: Value[] = [];
    for (let h = 0; h < nHead; h++) {
      const headStart = h * headDim;
      const headQuery = query.slice(headStart, headStart + headDim);
      const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
      const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));

      // Scaled dot-product attention: score = (query · key) / √headDim
      const attnLogits = headKeys.map((cachedKey) =>
        vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
      );

      // Softmax converts scores into weights that sum to 1
      const attnWeights = softmax(attnLogits);

      // Weighted sum of value vectors — high-scoring positions contribute more
      for (let j = 0; j < headDim; j++) {
        attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
      }
    }

    // Project concatenated head outputs back to the hidden dimension
    hidden = linear(attentionOutput, layer.attention.output);
    // Residual connection: add back what we had before attention
    hidden = hidden.map((h, i) => h.add(beforeAttention[i]));

    // --- MLP block: process each token's representation independently ---

    const beforeMLP: Value[] = hidden;
    hidden = rmsnorm(hidden);
    hidden = linear(hidden, layer.mlp.hidden);   // expand to 4x wider
    hidden = hidden.map((h) => h.relu());         // nonlinearity
    hidden = linear(hidden, layer.mlp.output);    // compress back
    // Residual connection: add back what we had before the MLP
    hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
  }

  // Step 3: Output projection
  // Project the final hidden state to vocabulary size — one score per word.
  // These raw scores (logits) will be passed through softmax later to get
  // a probability distribution over the next token.
  return linear(hidden, weights.output);
}

rng.ts

/**
 * Seeded Random Number Generator
 *
 * JavaScript has no built-in seeded RNG, so we roll our own:
 * - mulberry32 for uniform random numbers
 * - Box-Muller transform for Gaussian (normal) distribution
 * - Fisher-Yates for array shuffling
 * - Weighted random choice for sampling from probability distributions
 */

let _rngState = 0;

export function seed(s: number): void {
  _rngState = s | 0;
}

/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
  _rngState = (_rngState + 0x6d2b79f5) | 0;
  let t = _rngState;
  t = Math.imul(t ^ (t >>> 15), t | 1);
  t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
  return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}

/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
  let u1 = random();
  while (u1 === 0) u1 = random();
  const u2 = random();
  return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}

/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
  for (let i = arr.length - 1; i > 0; i--) {
    const j = Math.floor(random() * (i + 1));
    [arr[i], arr[j]] = [arr[j], arr[i]];
  }
}

/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
  const total = weights.reduce((a, b) => a + b, 0);
  let r = random() * total;
  for (let i = 0; i < weights.length; i++) {
    r -= weights[i];
    if (r <= 0) return i;
  }
  return weights.length - 1;
}

nn.ts

/**
 * Neural Network Primitives
 *
 * The building blocks that the GPT architecture assembles:
 * - linear: matrix-vector multiply (the fundamental neural net operation)
 * - softmax: convert raw scores into a probability distribution
 * - rmsnorm: normalize activations to stabilize training
 *
 * These mirror PyTorch's torch.nn.functional — general-purpose operations
 * used by the model, training loop, and inference.
 */

import { Value, vsum } from "./autograd.js";

export type Matrix = Value[][];

/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
  return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}

/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
  const maxVal = Math.max(...logits.map((v) => v.data));
  const exps = logits.map((v) => v.sub(maxVal).exp());
  const total = vsum(exps);
  return exps.map((e) => e.div(total));
}

/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
  const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
  const scale = ms.add(1e-5).pow(-0.5);
  return input.map((xi) => xi.mul(scale));
}

autograd.ts

/**
 * Autograd Engine
 *
 * A scalar-valued automatic differentiation engine. Each Value node records:
 * - its forward-pass result (data)
 * - its gradient w.r.t. the loss (grad), filled in by backward()
 * - its children in the computation graph and the local derivatives
 *
 * The backward() method applies the chain rule via reverse-mode autodiff:
 * topologically sort the graph, then propagate gradients from output to inputs.
 *
 * "If I nudge this parameter slightly, how does the loss change?"
 */

export class Value {
  data: number;
  grad: number;
  children: Value[];
  localGrads: number[];

  constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
    this.data = data;
    this.grad = 0;
    this.children = children;
    this.localGrads = localGrads;
  }

  add(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data + o.data, [this, o], [1, 1]);
  }

  mul(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return new Value(this.data * o.data, [this, o], [o.data, this.data]);
  }

  pow(n: number): Value {
    return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
  }

  log(): Value {
    return new Value(Math.log(this.data), [this], [1 / this.data]);
  }

  exp(): Value {
    return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
  }

  relu(): Value {
    return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
  }

  neg(): Value {
    return this.mul(-1);
  }

  sub(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.add(o.neg());
  }

  div(other: Value | number): Value {
    const o = typeof other === "number" ? new Value(other) : other;
    return this.mul(o.pow(-1));
  }

  backward(): void {
    const topo: Value[] = [];
    const visited = new Set<Value>();
    const buildTopo = (v: Value): void => {
      if (!visited.has(v)) {
        visited.add(v);
        for (const child of v.children) buildTopo(child);
        topo.push(v);
      }
    };
    buildTopo(this);
    this.grad = 1;
    for (const v of topo.reverse()) {
      for (let i = 0; i < v.children.length; i++) {
        v.children[i].grad += v.localGrads[i] * v.grad;
      }
    }
  }
}

/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
  return values.reduce((acc, v) => acc.add(v), new Value(0));
}

tokenizer.ts

/**
 * Tokenizer
 *
 * Translates strings to sequences of integers ("tokens") and back.
 * Builds a word-level vocabulary from the training corpus, with a BOS
 * (Beginning of Sequence) delimiter appended on each side.
 */

export interface Tokenizer {
  vocabSize: number;
  BOS: number;
  encode(sentence: string): number[];
  decode(tokens: number[]): string;
}

/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
  const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
  const BOS = words.length;
  const vocabSize = words.length + 1;

  return {
    vocabSize,
    BOS,
    encode(sentence: string): number[] {
      return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
    },
    decode(tokens: number[]): string {
      return tokens.map((t) => words[t]).join(" ");
    },
  };
}

Keyboard shortcuts

LLMs, the Hard Way