LLMs, the Hard Way
Build a GPT language model from scratch in TypeScript — no frameworks, no libraries, just math.
This tutorial walks you through every piece of a working language model: from tokenization to attention to training to generation. By the end you will have trained a tiny GPT that writes grade-school sentences and can be fine-tuned to generate questions.
The code is intentionally minimal. The entire model fits in a few hundred lines of TypeScript. Every operation — every matrix multiply, every gradient — is written by hand so you can see exactly what happens inside a transformer.
What you will build
- A tokenizer that maps words to integers
- An autograd engine that computes gradients automatically
- Neural network primitives: linear layers, softmax, normalization
- A transformer model with attention, MLPs, and residual connections
- A training loop with the Adam optimizer
- Text generation with temperature, top-k, and top-p sampling
- Fine-tuning to adapt the model to a new task
How to read this tutorial
The tutorial is split into three parts:
- Building the Model — the components that make up a transformer
- Training and Inference — teaching the model and using it
- Putting It to Work — end-to-end training, generation, and fine-tuning
Each section builds on the last. The code is cumulative: by the final chapter you have a complete, working system.
Prerequisites
In this lab you will set up the tools needed to complete this tutorial and examine the training data that the model will learn from.
Software
This tutorial requires Node.js (v18 or later) and TypeScript. Verify they are installed:
node --version
v22.15.0
Clone the Repository
Clone the tutorial repository and install dependencies:
git clone https://github.com/irbull/llms-the-hard-way.git
cd llms-the-hard-way
npm install
The Training Data
Every language model starts with a question: what do we want it to learn?
We want ours to learn how simple English sentences are put together. The
training corpus is 30,000 grade-1-level sentences in
data/grade1_sentences.txt:
wc -l data/grade1_sentences.txt
30000 data/grade1_sentences.txt
Look at a few:
head -10 data/grade1_sentences.txt
nan has the nut
the shy bed is old
nan mixed
the bird is wide
the goat likes to go
the cow yells
we dance up the garden
we swing up the store
the cat eats a muffin
the hen walks at dusk
Each sentence is short (2-8 words), uses basic vocabulary, and follows simple subject-verb-object patterns. The entire vocabulary is only 596 unique words, a fraction of what a real LLM would handle, but enough to see every concept in action.
What the Model Will Learn
We will never tell the model anything about English grammar. We will not define “noun” or “verb” or “sentence.” We will simply show it thousands of sentences and ask: given the words so far, what word comes next?
That is all a language model does. It learns to predict the next token. And from this single task, next-token prediction, structure emerges: the model learns that “the” is often followed by a noun, that sentences end after a handful of words, that certain words cluster together.
It’s All Just Scale
Our model has 596 words, 2 layers, and around 63,000 parameters. A model like GPT-4 has a vocabulary of 100,000+ tokens, nearly a hundred layers, and over a trillion parameters. It trains on trillions of words scraped from books, code repositories, scientific papers, and the open web.
But the architecture is the same. Embeddings, attention, feed-forward layers, residual connections, softmax over the vocabulary — every piece you will build in this tutorial is a piece that ships in production LLMs. The difference is not a difference in kind. It is a difference in scale.
When a large model writes working code, answers a medical question, or translates between languages, it is doing exactly what our tiny model does: predicting the next token. It just has enough parameters and has seen enough data that “predict the next token” becomes indistinguishable from understanding.
There is no hidden trick. No secret module that “really” does the thinking. The code you are about to write is the whole story.
The Pipeline
Here is the path from raw text to a model that generates new sentences:
training data -> tokenizer -> model -> training -> saved weights
|
tokenizer -> model -> generation <- load weights
We build each piece from scratch, in order. Each chapter introduces one component and ends with a Complete Code page containing the finished source for that stage.
What You Will Build
| Chapter | You will create | Pipeline stage |
|---|---|---|
| The Tokenizer | tokenizer.ts | Turns words into numbers and back |
| The Autograd Engine | autograd.ts | Automatic differentiation (makes training possible) |
| Neural Network Primitives | nn.ts | Linear layers, softmax, normalization |
| The Model | model.ts, rng.ts | The GPT architecture: config, weights, forward pass |
| Training | train.ts | The training loop and optimizer |
| Saving the Model | saveModel, loadModel | Serialize trained weights to disk and load them back |
| Generation | generate.ts | Inference: turning a trained model into sentences |
| Smoke Test | phrases-train.ts, phrases-generate.ts | Entry points to train and generate |
| Fine-Tuning | phrases-fine-tune.ts | Adapt a trained model to new data |
The Tokenizer
Neural networks operate on numbers, not words. The tokenizer is the bridge: it assigns every word in our vocabulary a unique integer, then translates sentences back and forth between text and number sequences.
Building the Vocabulary
We scan every sentence in the training data and collect every unique word:
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
This gives us a sorted list of 596 words:
Index 0: "a"
Index 1: "all"
Index 2: "am"
...
Index 75: "cat"
Index 76: "catch"
...
Index 525: "the"
Index 526: "then"
...
Index 595: "zoo"
We then add one special token: BOS (Beginning of Sequence), assigned index 596. This marker tells the model “a sentence starts here” and “a sentence ends here.” Production models typically use separate BOS and EOS (End of Sequence) tokens. We use a single token for both roles — the model sees BOS at the start and learns that predicting BOS means the sentence is over.
Total vocabulary size: 597 (596 words + 1 BOS token).
Encoding and Decoding
Encoding: Text to Numbers
To encode a sentence, we wrap it in BOS tokens and replace each word with its index:
"the cat eats a muffin"
| encode
[596, 525, 75, 152, 0, 324, 596]
BOS the cat eats a muffin BOS
The BOS token appears at both ends. The opening BOS gives the model a consistent starting signal. The closing BOS tells the model the sentence is complete. During training, the model learns that after certain patterns of words, the next token should be BOS, meaning “stop.”
Decoding: Numbers Back to Text
Decoding reverses the process. Given [525, 75, 152, 0, 324], we look up
each index in the vocabulary and join with spaces:
[525, 75, 152, 0, 324]
| decode
"the cat eats a muffin"
The Code
The tokenizer’s API is captured in a Tokenizer interface:
// tokenizer.ts
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
A factory function builds one from the training corpus:
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}
The tokenizer has no knowledge of English. It does not know that “the” is an article or that “cat” is a noun. It just maps strings to integers. All the meaning will come from training.
Why Word-Level Tokens?
An alternative approach is character-level tokenization, where each letter is a token. Our vocabulary would be just 28 tokens (a-z, space, plus BOS), but the sequences become much longer: “the cat” becomes 7 character tokens instead of 2 word tokens. With only 596 unique words in our corpus, word-level tokenization gives us short sequences that are fast to train on and easy to reason about.
Production LLMs like GPT-4 use a middle ground called subword tokenization (BPE), which splits rare words into pieces while keeping common words whole. Our word-level approach is the simplest version of this idea.
Complete Code
Here is the complete source code for the tokenizer.
tokenizer.ts
/**
* Tokenizer
*
* Translates strings to sequences of integers ("tokens") and back.
* Builds a word-level vocabulary from the training corpus, with a BOS
* (Beginning of Sequence) delimiter appended on each side.
*/
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}
The Autograd Engine
Training a neural network requires answering one question for each of the model’s parameters: if I nudge this number up a tiny bit, does the model get better or worse?
The answer is the parameter’s gradient: the direction and magnitude of change that would reduce the model’s error. Computing gradients by hand for tens of thousands of interconnected parameters would be impossible. Instead, we use automatic differentiation: we build a record of every computation the model performs, then trace backwards through that record to compute all gradients at once.
The Math You Need
Training a neural network relies on a small set of ideas from calculus. If your calculus is rusty — or you never took it — this section covers everything you need before we start writing code.
What Is a Derivative?
A derivative answers one question: if I nudge the input by a tiny amount, how much does the output change?
Formally, df/dx means “the rate of change of f with respect to x.” Intuitively, it is the slope of the function at a given point.
If f(x) = x², then:
- f(3) = 9
- f(3.001) = 9.006001
- Change in output / change in input = 0.006001 / 0.001 ≈ 6.0
The derivative of x² at x = 3 is 6. That number tells you: a tiny nudge to x produces roughly 6 times that nudge in the output.
Reading the Notation
The notation trips people up because it appears in several forms that all mean the same thing.
dy/dx — Read as “the derivative of y with respect to x.” How much does y change when you nudge x? Think of it as “a tiny change in y divided by a tiny change in x” — which is exactly what a slope is.
d/dx [f] — The d/dx part is an operator meaning “take the derivative of whatever follows, with respect to x.” So d/dx [x²] asks “what is the derivative of x² with respect to x?” Answer: 2x. This is just another way of writing the same thing as dy/dx.
d(loss)/dz — “The derivative of loss with respect to z.” You see different letters in the denominator because you are asking about different variables at different points in the computation. At one node you ask “how does loss change if I nudge z?” At the next you ask “how does loss change if I nudge x?”
| Notation | Read as |
|---|---|
| dy/dx | derivative of y with respect to x |
| d/dx [f] | take the derivative of f with respect to x |
| d(loss)/dz | derivative of loss with respect to z |
The d just means “a tiny change in.”
The Rules You Need
You need four derivative rules. That’s it.
Power rule: d/dx [xⁿ] = n · xⁿ⁻¹
The exponent drops down as a coefficient and decreases by one:
- d/dx [x²] = 2x
- d/dx [x³] = 3x²
Constant factor rule: constants pass through, bare constants vanish:
- d/dx [3x] = 3
- d/dx [7] = 0
Addition rule: derivatives distribute over sums:
- d/dx [f + g] = df/dx + dg/dx
Product rule: for f · g:
- d/dx [f · g] = f · dg/dx + g · df/dx
These four rules, plus the chain rule below, are enough to differentiate everything our autograd engine will encounter.
The Chain Rule
This is the rule that makes backpropagation work. When you have nested functions:
y = f(g(x))
the derivative is:
dy/dx = df/dg · dg/dx
In plain English: multiply the local derivatives along the path.
Think of it like unit conversion. If nudging x by 1 changes g by 3, and nudging g by 1 changes y by 5, then nudging x by 1 changes y by 3 × 5 = 15. You just multiply the rates together.
For a longer chain a → b → c → loss, the chain rule says:
d(loss)/da = d(loss)/dc · dc/db · db/da
Each node only needs to know two things: its own local derivative and the gradient flowing in from the node ahead of it. That is exactly what our backward pass will compute.
A Worked Example
Forget neural networks for a moment. Consider four lines of arithmetic:
x = 2.0
y = 3.0
z = x * y // z = 6.0
loss = z + 1 // loss = 7.0
The computation graph:
x(2.0) ──┐
[*] ──→ z(6.0) ──→ [+] ──→ loss(7.0)
y(3.0) ──┘ 1 ──┘
Forward pass (left to right)
Just arithmetic. Compute each value and remember which operation produced it.
Backward pass (right to left)
We want to know: how sensitive is loss to each input? Start at loss.
By convention, d(loss)/d(loss) = 1.0 — the loss is perfectly sensitive to
itself.
Step 1 — through the + node.
loss = z + 1, so d(loss)/dz = 1. Adding a constant doesn’t change the slope — a nudge to z passes through to loss unchanged.
z.grad = 1.0
Step 2 — through the × node.
z = x × y. Using the product rule, dz/dx = y and dz/dy = x. Apply the chain rule:
d(loss)/dx = d(loss)/dz · dz/dx = 1.0 × y = 1.0 × 3.0 = 3.0
d(loss)/dy = d(loss)/dz · dz/dy = 1.0 × x = 1.0 × 2.0 = 2.0
So x.grad = 3.0 and y.grad = 2.0.
Sanity check
We used the chain rule to compute x.grad = 3.0. Can we verify that without any calculus? Yes — just run the computation twice, once normally and once with x nudged by a tiny amount, and measure how much the loss actually changed:
Original: z = 2.0 × 3.0 = 6.0, loss = 7.0
Nudge x: z = 2.001 × 3.0 = 6.003, loss = 7.003
Change in loss / change in x = 0.003 / 0.001 = 3.0 ✓
The brute-force answer (3.0) matches the chain-rule answer (3.0). Two completely independent methods give the same result, which confirms the calculus was correct.
This is what autograd does automatically for every node in the graph, no matter how large the computation grows — except it uses the chain rule, not brute force, because running the full computation twice per parameter would be far too slow.
The Value Class: A Number That Remembers
Every number in the model is not a plain number. It is a Value. A Value
stores three things:
data: the actual number (e.g., 0.03)grad: the gradient, filled in later during the backward passchildrenandlocalGrads: how this Value was computed from other Values
When we perform any operation on Values, the result is a new Value that remembers its children (the input Values) and the local gradients, the partial derivatives of that operation with respect to each input. These local gradients answer: “if I nudge this input, how much does the output change?”
children and localGrads are parallel arrays — one local gradient per
child. A binary operation like add or mul takes two inputs, so both arrays
have two entries. A unary operation like pow, log, or relu takes one
input, so both arrays have one entry. The backward pass pairs them up: the
gradient for children[i] is computed using localGrads[i].
Here is the constructor:
export class Value {
data: number;
grad: number;
children: Value[];
localGrads: number[];
constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
this.data = data;
this.grad = 0;
this.children = children;
this.localGrads = localGrads;
}
A leaf Value (like a model parameter) is created with just a number:
new Value(0.03). The children and localGrads default to empty arrays.
When an operation creates a new Value, it passes in the inputs and their local
gradients.
Every primitive operation in our engine records these derivatives. Here they all are, with the intuition for why each gradient is what it is.
Addition: a + b (local gradients [1, 1])
add(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data + o.data, [this, o], [1, 1]);
}
- d(a + b) / da = 1
- d(a + b) / db = 1
If you nudge a up by 0.001, the sum goes up by exactly 0.001. The output
changes by the same amount as the input, regardless of what the values are.
Hence [1, 1].
| a | b | a + b | Nudge a to 3.001 | New result | Change | Local grad |
|---|---|---|---|---|---|---|
| 3 | 5 | 8 | 3.001 + 5 | 8.001 | 0.001 | 1 |
Multiplication: a * b (local gradients [b, a])
mul(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}
- d(a * b) / da = b
- d(a * b) / db = a
If you nudge a up by 0.001, the product goes up by 0.001 * b. The
sensitivity to a depends on how large b is, and vice versa. Hence
[o.data, this.data], meaning the gradient for each input is the other input’s
value.
| a | b | a * b | Nudge a to 3.001 | New result | Change | Local grad |
|---|---|---|---|---|---|---|
| 3 | 5 | 15 | 3.001 * 5 | 15.005 | 0.005 | 5 (= b) |
Power: a ^ n (local gradient [n * a^(n-1)])
pow(n: number): Value {
return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}
- d(a^n) / da = n * a^(n-1)
This is the classic power rule from calculus. The exponent drops down as a
coefficient, and the power decreases by one. For a^2, the gradient is 2a.
The larger a is, the more sensitive the square is to small changes.
| a | n | a ^ n | Nudge a to 3.001 | New result | Change | Local grad |
|---|---|---|---|---|---|---|
| 3 | 2 | 9 | 3.001^2 | 9.006001 | ~0.006 | 6 (= 2 * 3) |
Log: ln(a) (local gradient [1 / a])
log(): Value {
return new Value(Math.log(this.data), [this], [1 / this.data]);
}
- d(ln(a)) / da = 1 / a
The log function is steep when a is small and flat when a is large. A tiny
nudge to a small number produces a big change in the log; the same nudge to a
large number barely moves it. Note: log(0) is -Infinity and the gradient
1/0 poisons the computation. In our model, log() is only called on softmax
outputs, which are always positive, so this is safe in practice.
| a | ln(a) | Nudge a to 0.101 | New result | Change | Local grad |
|---|---|---|---|---|---|
| 0.1 | -2.303 | ln(0.101) | -2.293 | ~0.010 | 10 (= 1 / 0.1) |
| 10.0 | 2.303 | ln(10.001) | 2.3026 | ~0.0001 | 0.1 (= 1 / 10) |
Exp: e^a (local gradient [e^a])
exp(): Value {
return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}
- d(e^a) / da = e^a
The exponential function is its own derivative. The larger the output is, the
faster it grows. A nudge to a changes the output by an amount proportional
to the output itself.
| a | e^a | Nudge a to 2.001 | New result | Change | Local grad |
|---|---|---|---|---|---|
| 2 | 7.389 | e^2.001 | 7.396 | ~0.007 | 7.389 (= e^2) |
ReLU: max(0, a) (local gradient [1 if a > 0, else 0])
relu(): Value {
return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}
- d(relu(a)) / da = 1 if a > 0, 0 if a <= 0
ReLU is the simplest nonlinearity: it passes positive values through unchanged
and clamps negatives to zero. When a is positive, the gradient is 1 (the
nudge passes through). When a is negative, the gradient is 0 (the output is
stuck at zero, so nudging a does nothing).
| a | relu(a) | Nudge a by 0.001 | New result | Change | Local grad |
|---|---|---|---|---|---|
| 3 | 3 | relu(3.001) | 3.001 | 0.001 | 1 |
| -2 | 0 | relu(-1.999) | 0 | 0 | 0 |
Derived Operations
The remaining operations (neg, sub, div) don’t need their own gradient
rules. They are composed from the primitives above:
neg(a)=a * (-1), usesmulsub(a, b)=a + (-b), usesaddandnegdiv(a, b)=a * b^(-1), usesmulandpow
Because every step is tracked in the computation graph, the chain rule handles these compositions automatically.
The Computation Graph
As the model processes a token, every addition, multiplication, exponentiation and so on creates a new Value node. The result is a computation graph, a tree-like structure connecting the parameters at the leaves to a single loss value at the root.
Building the Graph
Consider a tiny example using the Value operations from the previous section:
const x = new Value(2.0);
const y = new Value(3.0);
const z = x.mul(y); // z = 6.0
const loss = z.add(1); // loss = 7.0
Each operation creates a new Value that records its children and local gradients. After these four lines, the graph looks like this:
x(2.0) ──┐
mul ──→ z(6.0) ──┐
y(3.0) ──┘ add ──→ loss(7.0)
1.0 ────┘
And the internal linkages are:
| Node | .data | .children | .localGrads |
|---|---|---|---|
x | 2.0 | [] | [] |
y | 3.0 | [] | [] |
z | 6.0 | [x, y] | [3.0, 2.0] |
loss | 7.0 | [z, Value(1)] | [1, 1] |
Leaf nodes (x and y) have no children — they are the inputs. The mul
node records both inputs as children and stores [y.data, x.data] as local
gradients (the product rule). The add node stores [1, 1] because addition
passes gradients through unchanged.
This is the entire data structure that backward() will walk. No separate
graph object, no registration step — the graph is just the chain of .children
pointers from the loss back to the leaves.
From Loss to Parameters
In the real model, the graph is much larger — thousands of nodes — but the
structure is the same. The loss at the root points to the softmax outputs,
which point to the logits, which point to the attention and MLP operations,
which eventually point to the embedding lookups and weight matrices. Every
parameter is reachable by following .children pointers from the loss.
The backward() method traverses this graph in reverse (from loss to leaves),
using each node’s .localGrads to propagate the gradient signal. That is the
subject of the next section.
Backward Pass: The Chain Rule
The backward() method answers this question. Starting from the loss (with
gradient 1.0), it walks backward through every node in the graph, applying
the chain rule:
gradient of child += (gradient of parent) * (local gradient)
The += is critical: when a Value feeds into multiple operations, its gradient
must sum contributions from all paths. Using = would overwrite earlier
contributions and produce incorrect gradients.
backward(): void {
const topo: Value[] = [];
const visited = new Set<Value>();
const buildTopo = (v: Value): void => {
if (!visited.has(v)) {
visited.add(v);
for (const child of v.children) buildTopo(child);
topo.push(v);
}
};
buildTopo(this);
this.grad = 1;
for (const v of topo.reverse()) {
for (let i = 0; i < v.children.length; i++) {
v.children[i].grad += v.localGrads[i] * v.grad;
}
}
}
First it sorts the graph topologically (inputs before outputs), then walks it in reverse. For each node, it multiplies its own gradient by each local gradient and accumulates it onto the child’s gradient.
After backward() runs, every Value in the entire computation, including
all model parameters, has its .grad field filled in. Each gradient says:
“increase this parameter slightly and the loss changes by this much.”
Note that backward() does not zero out gradients before running — it sets
this.grad = 1 at the root and accumulates onto whatever .grad values
already exist. Calling it twice on the same graph would double-count every
gradient. In our training loop this is not a problem: each step builds a fresh
computation graph, and we explicitly reset every parameter’s gradient to zero
before the next step.
The vsum Helper
One more utility: vsum adds a list of Values through the computation graph,
so the sum is also differentiable:
export function vsum(values: Value[]): Value {
return values.reduce((acc, v) => acc.add(v), new Value(0));
}
This is used throughout: for dot products, for summing losses, and for normalization.
Complete Code
Here is the complete source code for the autograd engine, along with the tokenizer from the previous chapter.
autograd.ts
/**
* Autograd Engine
*
* A scalar-valued automatic differentiation engine. Each Value node records:
* - its forward-pass result (data)
* - its gradient w.r.t. the loss (grad), filled in by backward()
* - its children in the computation graph and the local derivatives
*
* The backward() method applies the chain rule via reverse-mode autodiff:
* topologically sort the graph, then propagate gradients from output to inputs.
*
* "If I nudge this parameter slightly, how does the loss change?"
*/
export class Value {
data: number;
grad: number;
children: Value[];
localGrads: number[];
constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
this.data = data;
this.grad = 0;
this.children = children;
this.localGrads = localGrads;
}
add(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data + o.data, [this, o], [1, 1]);
}
mul(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}
pow(n: number): Value {
return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}
log(): Value {
return new Value(Math.log(this.data), [this], [1 / this.data]);
}
exp(): Value {
return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}
relu(): Value {
return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}
neg(): Value {
return this.mul(-1);
}
sub(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.add(o.neg());
}
div(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.mul(o.pow(-1));
}
backward(): void {
const topo: Value[] = [];
const visited = new Set<Value>();
const buildTopo = (v: Value): void => {
if (!visited.has(v)) {
visited.add(v);
for (const child of v.children) buildTopo(child);
topo.push(v);
}
};
buildTopo(this);
this.grad = 1;
for (const v of topo.reverse()) {
for (let i = 0; i < v.children.length; i++) {
v.children[i].grad += v.localGrads[i] * v.grad;
}
}
}
}
/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
return values.reduce((acc, v) => acc.add(v), new Value(0));
}
tokenizer.ts
/**
* Tokenizer
*
* Translates strings to sequences of integers ("tokens") and back.
* Builds a word-level vocabulary from the training corpus, with a BOS
* (Beginning of Sequence) delimiter appended on each side.
*/
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}
Neural Network Primitives
On top of the autograd engine from the previous chapter, we build three operations that the model uses throughout. These are general-purpose building blocks, the same operations found in every neural network framework. A neural network only needs a surprisingly small toolkit:
-
A way to combine inputs. Take several numbers, multiply each by a learned weight, and add the results together. This is just a weighted sum, the same idea as computing a course grade from weighted assignment scores. That is what
lineardoes. -
A way to make decisions. The model needs to express confidence: “I think the next word is 70% likely to be cat and 20% likely to be dog.” Raw scores can be any number, but probabilities must be positive and sum to
- That is what
softmaxdoes. It turns arbitrary scores into a proper probability distribution.
- That is what
-
A way to stay stable. Numbers that pass through dozens of weighted sums tend to either explode toward infinity or shrink toward zero. That is what
rmsnormdoes. It rescales a vector so the values stay in a reasonable range, similar to normalizing a set of exam scores to have a consistent spread.
Linear (Matrix-Vector Multiply)
The fundamental neural net operation. It transforms a vector by multiplying it
with a weight matrix. First, a type alias — a Matrix is just a 2D array of
Values:
type Matrix = Value[][];
function linear(input: Value[], weights: Matrix): Value[] {
return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}
Each output element is a dot product: multiply each input by the corresponding weight, then sum the results. Here is a concrete example with a 3-element input and a 2×3 weight matrix:
input = [2, 3, 1]
weights = [[1, 0, -1],
[0, 2, 1]]
output[0] = (1×2) + (0×3) + (-1×1) = 1
output[1] = (0×2) + (2×3) + ( 1×1) = 7
output = [1, 7]
Each row of the weight matrix produces one output element. A weight matrix with shape 2×3 takes a 3-element input and produces a 2-element output. In the actual model, a weight matrix with shape 128×32 takes a 32-element input vector and produces a 128-element output vector. The operation is the same — just more rows and longer dot products.
This is how the model mixes information across dimensions.
Because every multiplication and addition uses Value nodes, the entire
operation is differentiable, so gradients flow back through it during training.
Softmax (Scores to Probabilities)
Neural networks often produce raw, unnormalized scores — numbers that can be positive, negative, large, or small. On their own these scores don’t mean much. Softmax converts a vector of scores into a probability distribution: a vector of positive numbers that sum to 1.
function softmax(logits: Value[]): Value[] {
const maxVal = Math.max(...logits.map((v) => v.data));
const exps = logits.map((v) => v.sub(maxVal).exp());
const total = vsum(exps);
return exps.map((e) => e.div(total));
}
The steps: subtract the maximum value (for numerical stability), exponentiate each element, then divide by the sum. For example:
scores = [2.0, 1.0, 0.1]
subtract max: [0.0, -1.0, -1.9]
exponentiate: [1.0, 0.37, 0.15]
divide by sum: [0.66, 0.24, 0.10]
The largest score gets the highest probability, but every element stays positive and they all sum to 1.
Softmax is used in three places:
- Inside attention, to turn attention scores into attention weights
- During training, to turn output scores into probabilities for computing the loss
- During generation, to turn output scores into a probability distribution for sampling
RMSNorm (Normalization)
Rescales a vector to have roughly unit variance. This prevents numbers from growing too large or too small as they pass through multiple layers:
function rmsnorm(input: Value[]): Value[] {
const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
const scale = ms.add(1e-5).pow(-0.5);
return input.map((xi) => xi.mul(scale));
}
It computes the root mean square of the vector, then divides each element by
it. The 1e-5 prevents division by zero. For example:
input = [3.0, 4.0]
mean of squares: (9 + 16) / 2 = 12.5
root mean square: √12.5 ≈ 3.54
divide each: [3.0 / 3.54, 4.0 / 3.54] ≈ [0.85, 1.13]
The values have been rescaled so they sit near 1 — but their relative proportions (3:4) are preserved. Without normalization, activations can explode or vanish across layers, making training unstable.
Real RMSNorm includes a learnable per-element scale parameter (gamma) that lets each dimension adjust its magnitude after normalization. We omit it for simplicity — our model trains fine without it at this scale.
Summary
These three primitives (linear, softmax, and rmsnorm) are the
building blocks the model assembles in the next chapter. Each one is built entirely from
Value operations, so the autograd engine can compute gradients through all
of them.
| Primitive | What it does | Where it is used |
|---|---|---|
linear | Matrix-vector multiply | Attention projections, MLP layers, output head |
softmax | Scores to probabilities | Attention weights, training loss, generation sampling |
rmsnorm | Normalize to unit variance | Before each transformer sub-block |
Complete Code
Here is the complete source code for the neural network primitives, along with all files from previous chapters.
nn.ts
/**
* Neural Network Primitives
*
* The building blocks that the GPT architecture assembles:
* - linear: matrix-vector multiply (the fundamental neural net operation)
* - softmax: convert raw scores into a probability distribution
* - rmsnorm: normalize activations to stabilize training
*
* These mirror PyTorch's torch.nn.functional — general-purpose operations
* used by the model, training loop, and inference.
*/
import { Value, vsum } from "./autograd.js";
export type Matrix = Value[][];
/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}
/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
const maxVal = Math.max(...logits.map((v) => v.data));
const exps = logits.map((v) => v.sub(maxVal).exp());
const total = vsum(exps);
return exps.map((e) => e.div(total));
}
/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
const scale = ms.add(1e-5).pow(-0.5);
return input.map((xi) => xi.mul(scale));
}
autograd.ts
/**
* Autograd Engine
*
* A scalar-valued automatic differentiation engine. Each Value node records:
* - its forward-pass result (data)
* - its gradient w.r.t. the loss (grad), filled in by backward()
* - its children in the computation graph and the local derivatives
*
* The backward() method applies the chain rule via reverse-mode autodiff:
* topologically sort the graph, then propagate gradients from output to inputs.
*
* "If I nudge this parameter slightly, how does the loss change?"
*/
export class Value {
data: number;
grad: number;
children: Value[];
localGrads: number[];
constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
this.data = data;
this.grad = 0;
this.children = children;
this.localGrads = localGrads;
}
add(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data + o.data, [this, o], [1, 1]);
}
mul(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}
pow(n: number): Value {
return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}
log(): Value {
return new Value(Math.log(this.data), [this], [1 / this.data]);
}
exp(): Value {
return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}
relu(): Value {
return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}
neg(): Value {
return this.mul(-1);
}
sub(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.add(o.neg());
}
div(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.mul(o.pow(-1));
}
backward(): void {
const topo: Value[] = [];
const visited = new Set<Value>();
const buildTopo = (v: Value): void => {
if (!visited.has(v)) {
visited.add(v);
for (const child of v.children) buildTopo(child);
topo.push(v);
}
};
buildTopo(this);
this.grad = 1;
for (const v of topo.reverse()) {
for (let i = 0; i < v.children.length; i++) {
v.children[i].grad += v.localGrads[i] * v.grad;
}
}
}
}
/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
return values.reduce((acc, v) => acc.add(v), new Value(0));
}
tokenizer.ts
/**
* Tokenizer
*
* Translates strings to sequences of integers ("tokens") and back.
* Builds a word-level vocabulary from the training corpus, with a BOS
* (Beginning of Sequence) delimiter appended on each side.
*/
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}
The Model
With the autograd engine and neural net primitives in hand, we can build the model. What is a model, concretely?
A model is a configuration plus a large collection of numbers (parameters).
Before training, these numbers are random. After training, they encode everything the model has learned about language. The architecture (how those numbers are wired together) determines what the model can learn. The training process determines what it does learn.
What the Model Contains
Concretely, the model is a nested structure of weight matrices — arrays of
arrays of Value nodes. Each matrix has a specific shape and role. Here is
every piece:
Model
├── config: { nLayer: 2, nEmbd: 32, vocabSize: 597, ... }
│
├── weights
│ ├── tokenEmbedding [597 × 32] one row per word in the vocabulary
│ ├── positionEmbedding [16 × 32] one row per position in a sentence
│ │
│ ├── layers[0]
│ │ ├── attention
│ │ │ ├── query [32 × 32] three identical projections
│ │ │ ├── key [32 × 32] (roles emerge from training,
│ │ │ ├── value [32 × 32] not from the code)
│ │ │ └── output [32 × 32] combine attention results
│ │ └── mlp
│ │ ├── hidden [128 × 32] expand to 128 dimensions
│ │ └── output [32 × 128] compress back to 32
│ │
│ ├── layers[1] (same structure as layer 0)
│ │
│ └── output [597 × 32] map back to vocabulary scores
│
└── params: all 63,296 Values in a flat array (for the optimizer)
Every number in every matrix is a Value. That means the autograd engine can
compute gradients for all 63,296 of them. The rest of this chapter explains
what each piece does and how they connect.
The Forward Pass
Here is what happens when we feed a token into the model:
Token ID (e.g. 541 = "the") + Position (e.g. 2)
|
[Token Embedding] + [Position Embedding] -> 32-dim vector
|
[RMSNorm] -> normalize the vector
|
+--- Transformer Layer 0 ----------------------+
| [RMSNorm] |
| [Multi-Head Attention] -> look at context |
| [+ Residual Connection] |
| [RMSNorm] |
| [MLP: expand -> ReLU -> compress] -> process|
| [+ Residual Connection] |
+-----------------------------------------------+
|
+--- Transformer Layer 1 ----------------------+
| (same structure) |
+-----------------------------------------------+
|
[output projection] -> 597 raw scores (one per word)
|
"logits", unnormalized predictions for the next word
The output is 597 raw scores, one per word in the vocabulary. These scores are
called logits — unnormalized numbers that can be positive, negative, large,
or small. A higher logit means the model thinks that word is more likely to
come next. On their own logits are not probabilities; they become probabilities
when passed through softmax (from the previous chapter) during training or generation.
Before training, logits are essentially random. After training, if you feed the model “the cat”, the logits for “runs,” “eats,” and “sits” will be much higher than “the” or “zoo.”
Simplifications
Our architecture follows the same structure as GPT-2 and LLaMA, with a few simplifications that keep the code short without changing how the model works:
| Our model | Standard GPT-2 / LLaMA | Why we simplify |
|---|---|---|
| RMSNorm | LayerNorm (GPT-2) / RMSNorm (LLaMA) | Fewer operations, same effect at this scale |
| ReLU activation | GELU (GPT-2) / SiLU (LLaMA) | Simpler gradient, easier to understand |
| No bias terms | Bias on every linear layer (GPT-2) | Fewer parameters, modern models drop them too |
| No learnable norm scale | Learnable gamma per element | One less thing to train, works fine here |
| No final norm before output | RMSNorm before output projection | Skipped for brevity |
None of these affect the core ideas. The architecture, the attention mechanism, the training loop — all identical. A reader who understands this model understands the real thing.
Configuration
The configuration defines the shape of the model:
const model = createModel({
nLayer: 2, // number of transformer layers
nEmbd: 32, // embedding dimension (size of internal vectors)
blockSize: 16, // maximum sequence length (longest sentence we can process)
nHead: 4, // number of attention heads
headDim: 8, // dimension per attention head (nEmbd / nHead)
vocabSize: 597, // our tokenizer's vocabulary size
});
These are small numbers. Production models use nEmbd in the thousands and
dozens of layers. But the architecture is the same, ours just fits in memory
and trains in minutes instead of months.
A note on nHead: with 32 embedding dimensions, 4 heads is a good balance.
Each head gets 32 / 4 = 8 dimensions to work with. Two heads would give 16
dims each (fewer distinct attention patterns), and 8 heads would give 4 dims
each (very little room per head at this small scale).
Parameters: The Model’s Memory
When we create a model, we allocate a set of weight matrices filled with small random numbers:
function matrix(nout: number, nin: number, std = 0.08): Matrix {
return Array.from({ length: nout }, () =>
Array.from({ length: nin }, () => new Value(gauss(0, std)))
);
}
Each weight matrix serves a specific role. Here is every matrix in our model and what it does:
| Matrix | Shape | Purpose |
|---|---|---|
weights.tokenEmbedding | 597 x 32 | Token embeddings: one 32-dim vector per word |
weights.positionEmbedding | 16 x 32 | Position embeddings: one 32-dim vector per position |
weights.output | 597 x 32 | Output projection: maps back to vocabulary |
layers[0].attention.query | 32 x 32 | Attention query weights (layer 0) |
layers[0].attention.key | 32 x 32 | Attention key weights (layer 0) |
layers[0].attention.value | 32 x 32 | Attention value weights (layer 0) |
layers[0].attention.output | 32 x 32 | Attention output weights (layer 0) |
layers[0].mlp.hidden | 128 x 32 | MLP hidden layer (layer 0) |
layers[0].mlp.output | 32 x 128 | MLP output layer (layer 0) |
layers[1].attention.query | 32 x 32 | Attention query weights (layer 1) |
layers[1].attention.key | 32 x 32 | Attention key weights (layer 1) |
layers[1].attention.value | 32 x 32 | Attention value weights (layer 1) |
layers[1].attention.output | 32 x 32 | Attention output weights (layer 1) |
layers[1].mlp.hidden | 128 x 32 | MLP hidden layer (layer 1) |
layers[1].mlp.output | 32 x 128 | MLP output layer (layer 1) |
Total: 63,296 parameters. Every one of these numbers will be adjusted during training.
Embeddings: How Token IDs Become Vectors
This is a crucial concept. The model cannot do math on the integer 541
directly. It needs a representation, a vector of numbers that captures
something about the meaning of the word.
The token embedding table (tokenEmbedding) is a 597 x 32 matrix. Each row is a
32-dimensional vector representing one word in our vocabulary. To look up the
embedding for “the” (token 541):
tokenEmbedding[541] -> [0.03, -0.11, 0.07, ..., 0.02] (32 numbers)
Before training, these vectors are random. The word “the” and the word “cat” have random, meaningless embeddings. After training, words that appear in similar contexts will have similar embeddings. The model discovers relationships between words entirely from the data.
The position embedding table (positionEmbedding) works the same way but for positions
instead of words. positionEmbedding[0] is a 32-dim vector meaning “first word,”
positionEmbedding[3] means “fourth word.” This is how the model knows word order. Without
position embeddings, it would not know whether “the cat chased the dog” and
“the dog chased the cat” are different.
The first thing the model does with any input token is combine both embeddings:
const tokenVector: Value[] = weights.tokenEmbedding[tokenId]; // what word is this?
const positionVector: Value[] = weights.positionEmbedding[posId]; // where in the sentence is it?
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i])); // combine
This combined vector — hidden — is 32 numbers encoding both the word identity
and its position. It is the hidden state that flows through the rest of the
network.
Notice that embedding lookup is array indexing, not a matrix multiply. When the model processes token 541, only row 541 of the embedding table enters the computation graph. The other 596 rows are never touched — they receive no gradients during the backward pass and are not updated. Over many training sentences different words appear, so different rows get updated at different times. Words that appear frequently develop better embeddings than rare ones, simply because they get more gradient updates.
Multi-Head Attention
Each transformer layer has two blocks, both using the primitives from the previous chapter. The first is multi-head attention — the mechanism that lets the model look at previous tokens to inform its prediction.
Three Identical Projections
The model computes three vectors from the current hidden state using linear:
const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[] = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
These are three matrix multiplications with three different weight matrices, all
the same shape ([32 × 32]). Before training, they contain random numbers.
There is nothing in this code that makes query behave differently from key
or value — they are structurally identical linear projections of the same
hidden state.
The names describe what these projections become, not what they are. The labels Q, K, and V are aspirational: they describe roles that emerge through training, not roles assigned by the code. What forces them into distinct roles is the math that follows.
The Attention Computation
Here is what happens to Q, K, and V once they exist:
attention = softmax(Q · K^T / √headDim) × V
This structure imposes hard constraints:
- Q and K always get dot-producted to produce a similarity score. They are structurally forced into a “matching” role — whatever W_q and W_k produce will be compared against each other.
- V always gets weighted-summed using those scores. It is structurally forced into a “what gets retrieved” role — the scores from Q·K gate how much of each V vector flows through.
What’s learned is everything else: what Q and K match on (syntactic role? position? something we can’t name?), and what V carries (whatever information turned out to be useful to retrieve). Q/K/V really are like query/key/value in a database lookup — but the schema of that database, what counts as a match, what’s worth storing — all of that is discovered by training.
The √headDim scaling prevents the dot products from growing so large that
softmax saturates into near-zero gradients, which would kill learning.
Here is how the math notation maps to our code:
| Math | Code | Shape |
|---|---|---|
| Q | query (or headQuery per head) | 32-dim (8 per head) |
| K | key (or headKeys per head) | 32-dim (8 per head) |
| V | value (or headValues per head) | 32-dim (8 per head) |
| W_q, W_k, W_v | layer.attention.query, .key, .value | [32 × 32] |
| √d_k | Math.pow(headDim, -0.5) | scalar (√8) |
Worked Example: “the cat eats”
Suppose we’re processing the sentence “BOS the cat eats” and we’re on “eats” at position 3, trying to predict what comes next. The 32-dim hidden state for “eats” gets projected into Q, K, and V. Each head works on its own 8-dim slice, so let’s follow one head.
Q_eats = hidden @ W_q → an 8-dim vector for this head
The KV cache already holds the key and value vectors for every previous position (more on the cache later). So we have:
K_BOS, K_the, K_cat, K_eats (cached keys for positions 0, 1, 2, 3)
V_BOS, V_the, V_cat, V_eats (cached values for positions 0, 1, 2, 3)
Compute attention scores — how much should “eats” attend to each position:
score("eats", BOS) = Q_eats · K_BOS / √8 = 0.1 (low)
score("eats", "the") = Q_eats · K_the / √8 = 0.5 (low)
score("eats", "cat") = Q_eats · K_cat / √8 = 2.5 (high)
score("eats", "eats") = Q_eats · K_eats / √8 = 1.0 (medium)
After softmax, these become weights that sum to 1.0. The output is a weighted sum of the value vectors:
output = 0.02 × V_BOS + 0.08 × V_the + 0.66 × V_cat + 0.24 × V_eats
“cat” dominates — the information encoded in V_cat flows through strongly. This is how the model routes information from relevant earlier tokens to the current position.
To be honest: those scores are made up. What actually causes Q_eats · K_cat to be high? The weight matrices W_q and W_k learned — from millions of training examples — to project words into a space where verbs and their subjects end up pointing in similar directions. We can observe that this happens. We can’t easily read off why from the weight matrices.
Causal Masking
Our model is autoregressive — it predicts the next token from previous ones. Each position can only attend to itself and earlier positions, never future ones. In our implementation this happens naturally: the KV cache only contains entries for tokens that have already been processed, so there are no future keys to attend to.
Multiple Heads
The “multi-head” part means we split the 32-dim vectors into 4 heads of 8 dimensions each. Each head has its own slice of W_q, W_k, and W_v, so each head runs its own independent attention computation in parallel:
for (let h = 0; h < nHead; h++) {
const headStart = h * headDim;
const headQuery = query.slice(headStart, headStart + headDim); // this head's 8-dim query
const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));
// compute attention scores and weighted sum for this head
}
Different heads learn to attend to different things. On “the cat eats”, the four heads might specialize like this:
Head 0 — tracks grammatical subject:
score("eats" → "cat") = 0.9 → output mostly carries V_cat
Head 1 — tracks recency / local context:
score("eats" → "eats") = 0.8 → output mostly carries V_eats
Head 2 — tracks determiner–noun relationships:
score("cat" → "the") = 0.9 → output mostly carries V_the
Head 3 — honestly, nobody knows:
output = some mix that turns out to be useful
The outputs from all four heads are concatenated back into a 32-dim vector
and projected through one more linear layer (layer.attention.output) to
produce the final attention output.
By the time we’re done, the representation of “eats” has been enriched by: who the subject is, where we are in the sequence, what the determiner structure looks like, and whatever Head 3 noticed.
Why Heads Don’t All Learn the Same Thing
Each head starts with different random weights. Gradient descent is local — it nudges each head’s weights based on where they started, so they drift in different directions. And there’s no benefit to convergence: redundant heads contribute identical information, giving the model no extra expressive power. The loss landscape rewards diversity because diverse heads capture more signal.
One consequence: the capabilities are reproducible but their locations are not. Train the same model twice with different random seeds and you’ll get subject-verb tracking somewhere, but not necessarily in the same head or layer. The capability emerges; its address doesn’t.
MLP (Feed-Forward Network)
Each transformer layer has two blocks. Attention was the first — it mixes information between tokens. The MLP is the second, and it does something complementary: it processes each token’s vector independently.
That distinction matters. Attention is how “cat” gets routed into the representation of “eats.” But attention is mostly a weighted average — it selects and combines, it doesn’t do much computation on its own. The MLP is where per-token nonlinear computation happens. It transforms each vector through learned functions that training has shaped to be useful.
Attention moves information around. The MLP does something with it.
The Code
hidden = linear(hidden, layer.mlp.hidden); // 32 -> 128
hidden = hidden.map((h) => h.relu()); // non-linearity
hidden = linear(hidden, layer.mlp.output); // 128 -> 32
Three operations: expand, apply nonlinearity, compress.
Why Expand and Compress?
The first linear layer projects the 32-dim vector into 128 dimensions — a 4x expansion. The second projects it back down to 32. Why not just do a single 32→32 transformation?
The expanded 128-dim space gives the model room to compute. In 32 dimensions, each element has to carry the final representation directly. In 128 dimensions, the model can build intermediate features — combinations and interactions that are useful for computing the final output but don’t need to survive in the compressed result.
Think of it as scratch space. The expansion gives 128 dimensions to work in; the compression selects what’s worth keeping.
Why ReLU Is Critical
ReLU sets negative values to zero: relu(x) = max(0, x). This is the
nonlinearity between the two linear layers, and without it, the MLP would be
pointless.
Here’s why: a linear transformation followed by another linear transformation
is just… a single linear transformation. Matrix multiplication composes.
W₂(W₁x) = (W₂W₁)x — you could replace the two weight matrices with one
matrix that does the same thing. No matter how many linear layers you stack,
they collapse into one.
ReLU breaks this. By zeroing out different elements depending on the input, it makes the transformation input-dependent — different inputs activate different subsets of the 128 hidden neurons. This is what lets the MLP compute functions that aren’t just weighted sums. It’s what makes depth useful.
What Does It Compute?
Same honesty as attention: we can observe what the MLP does in aggregate, but the individual weight values aren’t interpretable. After attention has gathered context into the “eats” vector — enriching it with information about the subject, position, and determiner structure — the MLP applies a learned nonlinear transformation to that enriched representation.
What patterns does it recognize? What features does it compute? Training shaped the weights to reduce prediction error. The results work. The internals are opaque.
Residual Connections
After each block (attention and MLP), the original input is added back:
hidden = hidden.map((h, i) => h.add(beforeBlock[i]));
where beforeBlock is the hidden state saved before the block started. This
is element-wise addition. The output of each block is added to the input
that went in, not substituted for it. The consequence: a block doesn’t
have to reproduce the entire representation from scratch. It only needs to
learn a useful correction — a delta to add on top of what’s already there.
If a block has nothing useful to contribute, its weights can settle near zero
and the input passes through unchanged.
Why This Matters for Training
During training, the model adjusts weights using gradients — signals that flow backward through the network saying “this weight should go up” or “this weight should go down.” Without residual connections, those signals have to pass through every layer’s transformations on the way back. Each layer can shrink the signal (vanishing gradients) or amplify it out of control (exploding gradients). By the time the signal reaches the early layers, it’s often degraded beyond usefulness.
The residual connection gives gradients a direct path. Because the input is
added to the output (y = f(x) + x), the gradient of y with respect to x
always includes a term of 1 — the identity. No matter what f does to the
gradient, the skip connection ensures the raw signal can still flow backward
unimpeded. This is what makes training deep networks practical.
The KV Cache
The model processes one token at a time. If we have the sentence “the cat eats…”, when we feed it “cat” at position 2, it does not automatically know that “the” came before it. But attention — the mechanism that lets the model look at previous words — needs access to those earlier tokens. How?
The answer is that we cache the key and value vectors computed at every previous position. As we saw in the attention section, each token produces three vectors: a query (Q), a key (K), and a value (V). The query is used immediately and discarded. But the keys and values are needed again every time a future token wants to attend to this position.
The KV cache is just a list that grows by one entry each time we process a token:
After processing BOS at position 0:
keys: [ K_bos ]
values: [ V_bos ]
After processing "the" at position 1:
keys: [ K_bos, K_the ]
values: [ V_bos, V_the ]
After processing "cat" at position 2:
keys: [ K_bos, K_the, K_cat ]
values: [ V_bos, V_the, V_cat ]
When the model processes “cat”, its query vector is compared against all cached keys (BOS, “the”, and “cat” itself) to compute attention scores. Those scores determine how much to weight each cached value vector. This is how “cat” can attend to “the” even though “the” was processed in a previous call.
This is also how causal masking works in our implementation: the cache only contains entries for tokens already processed, so there are no future keys to attend to.
We create a fresh, empty cache at the start of each sentence:
export function createKVCache(model: Model): {
keys: Value[][][];
values: Value[][][];
} {
return {
keys: Array.from({ length: model.config.nLayer }, () => []),
values: Array.from({ length: model.config.nLayer }, () => []),
};
}
The type Value[][][] is one list per layer (we have 2 layers, so 2 lists).
Each list starts empty and grows as tokens are processed. A fresh cache means
a fresh sentence — the model has no memory of previous sentences.
Creating the Model
The createModel function allocates all the weight matrices and collects
every individual number into a flat params array:
export function createModel(config: GPTConfig): Model {
const weights: Weights = {
tokenEmbedding: matrix(vocabSize, nEmbd),
positionEmbedding: matrix(blockSize, nEmbd),
output: matrix(vocabSize, nEmbd),
layers: Array.from({ length: nLayer }, () => ({
attention: {
query: matrix(nEmbd, nEmbd),
key: matrix(nEmbd, nEmbd),
value: matrix(nEmbd, nEmbd),
output: matrix(nEmbd, nEmbd),
},
mlp: {
hidden: matrix(4 * nEmbd, nEmbd),
output: matrix(nEmbd, 4 * nEmbd),
},
})),
};
// Collect all matrices into a flat param array
const allMatrices: Matrix[] = [
weights.tokenEmbedding,
weights.positionEmbedding,
weights.output,
...weights.layers.flatMap((layer) => [
layer.attention.query, layer.attention.key,
layer.attention.value, layer.attention.output,
layer.mlp.hidden, layer.mlp.output,
]),
];
const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));
return { config, weights, params };
}
The params array is what the optimizer will update during training. The
weights object is a typed view into those same parameters. When training updates
params[i], the corresponding entry in weights changes too, because they
are the same Value objects.
Putting It All Together
Every operation (embedding lookup, linear transform, softmax,
rmsnorm, addition, ReLU) is built from Value nodes. The entire forward
pass builds one enormous computation graph. When we call backward() on the
loss, the gradients for all 63,296 parameters are computed in a single sweep
through this graph.
This is what makes neural network training possible: the autograd engine turns the question “how should I change 63,296 numbers to make my predictions better?” into a mechanical, automatic computation.
At this point we have 63,296 random numbers and a blueprint for how to wire them together. The model can process tokens, but its output is nonsense. To make it useful, we need to train it.
Running the Model
The gpt() function wires together every piece from this chapter: it takes a
single token and its position, runs it through the full architecture, and
returns 597 logits. Let’s walk through it before looking at the code.
Step by Step
The function takes five arguments: the model, a token ID, a position, and the KV cache (keys and values) introduced earlier in this chapter.
1. Embedding lookup. Look up the token embedding and the position embedding, and add them together to produce a single 32-element vector:
const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
const positionVector: Value[] = weights.positionEmbedding[posId];
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));
At this point hidden is a 32-dim vector that encodes both which word this
is and where it appears in the sentence.
2. Normalize. Apply rmsnorm to keep the numbers at a stable scale:
hidden = rmsnorm(hidden);
3. Transformer layers. The model has 2 layers. Each one does the same thing: attention, then MLP, each with a residual connection.
3a. Attention. Save the hidden state for the residual connection, normalize, then compute query, key, and value vectors. Notice that normalization comes before each block, not after — this is called pre-norm and is what GPT-2 and LLaMA use. The original transformer paper normalized after each block (post-norm), but pre-norm trains more stably because the residual connection carries unnormalized values, giving gradients a cleaner path backward. Push the key and value into the KV cache so future tokens can attend to this position:
const beforeAttention: Value[] = hidden;
hidden = rmsnorm(hidden);
const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[] = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
keys[li].push(key);
values[li].push(value);
Then split into heads (4 heads of 8 dimensions each). For each head, compute attention scores by comparing this token’s query against all cached keys, run softmax to get weights, and take a weighted sum of the cached values. This is how the model looks at previous tokens to gather context.
Finally, project the concatenated head outputs back to 32 dimensions and add the residual:
hidden = linear(attentionOutput, layer.attention.output);
hidden = hidden.map((h, i) => h.add(beforeAttention[i]));
3b. MLP. Save the hidden state for the residual, normalize, expand from 32 to 128 dimensions, apply ReLU, compress back to 32, and add the residual:
const beforeMLP: Value[] = hidden;
hidden = rmsnorm(hidden);
hidden = linear(hidden, layer.mlp.hidden); // 32 -> 128
hidden = hidden.map((h) => h.relu()); // non-linearity
hidden = linear(hidden, layer.mlp.output); // 128 -> 32
hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
4. Output projection. After all layers, project the 32-dim vector to 597 dimensions — one score per word in the vocabulary:
return linear(hidden, weights.output);
These 597 numbers are the logits: the model’s raw, unnormalized prediction for what word comes next.
The Complete Function
Here is the full gpt() function with all the pieces together:
export function gpt(
model: Model,
tokenId: number,
posId: number,
keys: Value[][][],
values: Value[][][],
): Value[] {
const { nLayer, nHead, headDim } = model.config;
const { weights } = model;
// Step 1: Embedding lookup
// Combine "what word is this?" with "where does it appear?" into a single
// vector. This is the hidden state that flows through the rest of the network.
const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
const positionVector: Value[] = weights.positionEmbedding[posId];
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));
// Normalize before the first layer to keep values at a stable scale
hidden = rmsnorm(hidden);
// Step 2: Transformer layers
// Each layer has two blocks: attention (gather context from other tokens)
// followed by MLP (process the gathered information). Both use residual
// connections so the input is added back to the output of each block.
for (let li = 0; li < nLayer; li++) {
const layer = weights.layers[li];
// --- Attention block: look at previous tokens to gather context ---
// Save the hidden state so we can add it back after the block (residual)
const beforeAttention: Value[] = hidden;
hidden = rmsnorm(hidden);
// Project the hidden state into query, key, and value vectors.
// These are structurally identical projections — training teaches them
// to play different roles: query asks "what am I looking for?",
// key advertises "what do I contain?", value carries "what to retrieve".
const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[] = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
// Cache the key and value so future tokens can attend to this position
keys[li].push(key);
values[li].push(value);
// Each head independently attends to a different slice of the vectors,
// allowing the model to track multiple relationships at once
const attentionOutput: Value[] = [];
for (let h = 0; h < nHead; h++) {
const headStart = h * headDim;
const headQuery = query.slice(headStart, headStart + headDim);
const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));
// Scaled dot-product attention: score = (query · key) / √headDim
const attnLogits = headKeys.map((cachedKey) =>
vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
);
// Softmax converts scores into weights that sum to 1
const attnWeights = softmax(attnLogits);
// Weighted sum of value vectors — high-scoring positions contribute more
for (let j = 0; j < headDim; j++) {
attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
}
}
// Project concatenated head outputs back to the hidden dimension
hidden = linear(attentionOutput, layer.attention.output);
// Residual connection: add back what we had before attention
hidden = hidden.map((h, i) => h.add(beforeAttention[i]));
// --- MLP block: process each token's representation independently ---
const beforeMLP: Value[] = hidden;
hidden = rmsnorm(hidden);
hidden = linear(hidden, layer.mlp.hidden); // expand to 4x wider
hidden = hidden.map((h) => h.relu()); // nonlinearity
hidden = linear(hidden, layer.mlp.output); // compress back
// Residual connection: add back what we had before the MLP
hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
}
// Step 3: Output projection
// Project the final hidden state to vocabulary size — one score per word.
// These raw scores (logits) will be passed through softmax later to get
// a probability distribution over the next token.
// Note: standard GPT-2 and LLaMA apply a final normalization here
// (rmsnorm(hidden) before the linear). We skip it for simplicity.
return linear(hidden, weights.output);
}
Every operation here uses Value nodes, so the entire forward pass builds a
computation graph. When we call backward() on the loss later, gradients flow
back through every step — from the output projection, through the MLP and
attention blocks, all the way to the embedding lookup.
Complete Code
Here is the complete source code for the model, along with all files from previous chapters.
model.ts
/**
* GPT Model
*
* Defines the model's structure: configuration, weight matrices, and
* parameter collection. Includes the forward pass (gpt) and KV cache
* for inference. Save/load is introduced in a later chapter.
*/
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";
// --- Configuration ---
export interface GPTConfig {
nLayer: number;
nEmbd: number;
blockSize: number;
nHead: number;
headDim: number;
vocabSize: number;
}
interface Attention {
query: Matrix;
key: Matrix;
value: Matrix;
output: Matrix;
}
interface MLP {
hidden: Matrix;
output: Matrix;
}
interface Layer {
attention: Attention;
mlp: MLP;
}
export interface Weights {
tokenEmbedding: Matrix;
positionEmbedding: Matrix;
output: Matrix;
layers: Layer[];
}
/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
config: GPTConfig;
weights: Weights;
params: Value[];
}
// --- Model Creation ---
function matrix(nout: number, nin: number, std = 0.08): Matrix {
return Array.from({ length: nout }, () =>
Array.from({ length: nin }, () => new Value(gauss(0, std)))
);
}
/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
const { nEmbd, nLayer, blockSize, vocabSize } = config;
const weights: Weights = {
tokenEmbedding: matrix(vocabSize, nEmbd),
positionEmbedding: matrix(blockSize, nEmbd),
output: matrix(vocabSize, nEmbd),
layers: Array.from({ length: nLayer }, () => ({
attention: {
query: matrix(nEmbd, nEmbd),
key: matrix(nEmbd, nEmbd),
value: matrix(nEmbd, nEmbd),
output: matrix(nEmbd, nEmbd),
},
mlp: {
hidden: matrix(4 * nEmbd, nEmbd),
output: matrix(nEmbd, 4 * nEmbd),
},
})),
};
const allMatrices: Matrix[] = [
weights.tokenEmbedding,
weights.positionEmbedding,
weights.output,
...weights.layers.flatMap((layer) => [
layer.attention.query, layer.attention.key,
layer.attention.value, layer.attention.output,
layer.mlp.hidden, layer.mlp.output,
]),
];
const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));
return { config, weights, params };
}
// --- KV Cache ---
/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
keys: Value[][][];
values: Value[][][];
} {
return {
keys: Array.from({ length: model.config.nLayer }, () => []),
values: Array.from({ length: model.config.nLayer }, () => []),
};
}
// --- Forward Pass ---
/**
* Run one step of the GPT: given a token at a position, return logits
* over the vocabulary for the next token.
*
* The keys/values caches are mutated (appended to) on each call —
* this is the KV cache that avoids recomputing attention for past positions.
*/
export function gpt(
model: Model,
tokenId: number,
posId: number,
keys: Value[][][],
values: Value[][][],
): Value[] {
const { nLayer, nHead, headDim } = model.config;
const { weights } = model;
// Step 1: Embedding lookup
// Combine "what word is this?" with "where does it appear?" into a single
// vector. This is the hidden state that flows through the rest of the network.
const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
const positionVector: Value[] = weights.positionEmbedding[posId];
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));
// Normalize before the first layer to keep values at a stable scale
hidden = rmsnorm(hidden);
// Step 2: Transformer layers
// Each layer has two blocks: attention (gather context from other tokens)
// followed by MLP (process the gathered information). Both use residual
// connections so the input is added back to the output of each block.
for (let li = 0; li < nLayer; li++) {
const layer = weights.layers[li];
// --- Attention block: look at previous tokens to gather context ---
// Save the hidden state so we can add it back after the block (residual)
const beforeAttention: Value[] = hidden;
hidden = rmsnorm(hidden);
// Project the hidden state into query, key, and value vectors.
// These are structurally identical projections — training teaches them
// to play different roles: query asks "what am I looking for?",
// key advertises "what do I contain?", value carries "what to retrieve".
const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[] = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
// Cache the key and value so future tokens can attend to this position
keys[li].push(key);
values[li].push(value);
// Each head independently attends to a different slice of the vectors,
// allowing the model to track multiple relationships at once
const attentionOutput: Value[] = [];
for (let h = 0; h < nHead; h++) {
const headStart = h * headDim;
const headQuery = query.slice(headStart, headStart + headDim);
const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));
// Scaled dot-product attention: score = (query · key) / √headDim
const attnLogits = headKeys.map((cachedKey) =>
vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
);
// Softmax converts scores into weights that sum to 1
const attnWeights = softmax(attnLogits);
// Weighted sum of value vectors — high-scoring positions contribute more
for (let j = 0; j < headDim; j++) {
attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
}
}
// Project concatenated head outputs back to the hidden dimension
hidden = linear(attentionOutput, layer.attention.output);
// Residual connection: add back what we had before attention
hidden = hidden.map((h, i) => h.add(beforeAttention[i]));
// --- MLP block: process each token's representation independently ---
const beforeMLP: Value[] = hidden;
hidden = rmsnorm(hidden);
hidden = linear(hidden, layer.mlp.hidden); // expand to 4x wider
hidden = hidden.map((h) => h.relu()); // nonlinearity
hidden = linear(hidden, layer.mlp.output); // compress back
// Residual connection: add back what we had before the MLP
hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
}
// Step 3: Output projection
// Project the final hidden state to vocabulary size — one score per word.
// These raw scores (logits) will be passed through softmax later to get
// a probability distribution over the next token.
return linear(hidden, weights.output);
}
rng.ts
/**
* Seeded Random Number Generator
*
* JavaScript has no built-in seeded RNG, so we roll our own:
* - mulberry32 for uniform random numbers
* - Box-Muller transform for Gaussian (normal) distribution
* - Fisher-Yates for array shuffling
* - Weighted random choice for sampling from probability distributions
*/
let _rngState = 0;
export function seed(s: number): void {
_rngState = s | 0;
}
/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
_rngState = (_rngState + 0x6d2b79f5) | 0;
let t = _rngState;
t = Math.imul(t ^ (t >>> 15), t | 1);
t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}
/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
let u1 = random();
while (u1 === 0) u1 = random();
const u2 = random();
return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}
/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
for (let i = arr.length - 1; i > 0; i--) {
const j = Math.floor(random() * (i + 1));
[arr[i], arr[j]] = [arr[j], arr[i]];
}
}
/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
const total = weights.reduce((a, b) => a + b, 0);
let r = random() * total;
for (let i = 0; i < weights.length; i++) {
r -= weights[i];
if (r <= 0) return i;
}
return weights.length - 1;
}
nn.ts
/**
* Neural Network Primitives
*
* The building blocks that the GPT architecture assembles:
* - linear: matrix-vector multiply (the fundamental neural net operation)
* - softmax: convert raw scores into a probability distribution
* - rmsnorm: normalize activations to stabilize training
*
* These mirror PyTorch's torch.nn.functional — general-purpose operations
* used by the model, training loop, and inference.
*/
import { Value, vsum } from "./autograd.js";
export type Matrix = Value[][];
/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}
/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
const maxVal = Math.max(...logits.map((v) => v.data));
const exps = logits.map((v) => v.sub(maxVal).exp());
const total = vsum(exps);
return exps.map((e) => e.div(total));
}
/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
const scale = ms.add(1e-5).pow(-0.5);
return input.map((xi) => xi.mul(scale));
}
autograd.ts
/**
* Autograd Engine
*
* A scalar-valued automatic differentiation engine. Each Value node records:
* - its forward-pass result (data)
* - its gradient w.r.t. the loss (grad), filled in by backward()
* - its children in the computation graph and the local derivatives
*
* The backward() method applies the chain rule via reverse-mode autodiff:
* topologically sort the graph, then propagate gradients from output to inputs.
*
* "If I nudge this parameter slightly, how does the loss change?"
*/
export class Value {
data: number;
grad: number;
children: Value[];
localGrads: number[];
constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
this.data = data;
this.grad = 0;
this.children = children;
this.localGrads = localGrads;
}
add(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data + o.data, [this, o], [1, 1]);
}
mul(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}
pow(n: number): Value {
return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}
log(): Value {
return new Value(Math.log(this.data), [this], [1 / this.data]);
}
exp(): Value {
return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}
relu(): Value {
return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}
neg(): Value {
return this.mul(-1);
}
sub(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.add(o.neg());
}
div(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.mul(o.pow(-1));
}
backward(): void {
const topo: Value[] = [];
const visited = new Set<Value>();
const buildTopo = (v: Value): void => {
if (!visited.has(v)) {
visited.add(v);
for (const child of v.children) buildTopo(child);
topo.push(v);
}
};
buildTopo(this);
this.grad = 1;
for (const v of topo.reverse()) {
for (let i = 0; i < v.children.length; i++) {
v.children[i].grad += v.localGrads[i] * v.grad;
}
}
}
}
/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
return values.reduce((acc, v) => acc.add(v), new Value(0));
}
tokenizer.ts
/**
* Tokenizer
*
* Translates strings to sequences of integers ("tokens") and back.
* Builds a word-level vocabulary from the training corpus, with a BOS
* (Beginning of Sequence) delimiter appended on each side.
*/
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}
Training
We have data, a tokenizer, a model that can turn tokens into predictions, and the math to compute gradients. Now we train.
Training means showing the model sentences and adjusting its parameters to make better predictions. Each step processes one sentence through five stages:
"the cat eats a muffin"
|
[Tokenize] -> [596, 541, 89, 152, 0, 358, 596]
|
[Feed tokens one at a time through gpt()]
| position 0: feed BOS -> predict 597 scores -> target: "the"
| position 1: feed "the" -> predict 597 scores -> target: "cat"
| position 2: feed "cat" -> predict 597 scores -> target: "eats"
| ...
|
[Compute loss] -> how wrong was each prediction?
| -log(probability of the correct word)
|
[Backward pass] -> loss.backward()
| fills in .grad on all 63,296 parameters
|
[Adam optimizer] -> nudge each parameter to reduce the loss
|
repeat 5,000 times
The gpt() function from the model chapter handles the forward pass — turning
a token into 597 scores. This chapter covers what happens around it: measuring
how wrong those predictions are, computing gradients, and updating the weights.
Before training, the model’s predictions are random noise. After 5,000 steps, it has learned enough about English sentence structure to predict the next word with roughly 100x better accuracy than chance.
The Training Configuration
const model = train(
untrained,
{
numSteps: 5000, // how many sentences to learn from
learningRate: 0.01, // how aggressively to update (decays to 0)
beta1: 0.85, // momentum decay rate
beta2: 0.99, // gradient magnitude decay rate
epsAdam: 1e-8, // numerical stability term
},
sentences,
tokenizer,
);
5,000 steps takes a few minutes on a laptop. Each step processes one sentence, building a full computation graph from tokens to loss, running backpropagation, and updating all parameters.
What Each Parameter Controls
numSteps is how many sentences the model trains on. The training loop
cycles through the dataset one sentence per step (sentences[step % sentences.length]). One full pass through every sentence is called an epoch.
If you had ~400 unique sentences, 5,000 steps would mean roughly 12 epochs —
cycling through the data 12 times. Our dataset has 30,000 sentences, so 5,000
steps covers only the first sixth with no cycling at all. You can increase
numSteps or reduce the dataset size to get more epochs, but be careful:
cycling through the same sentences too many times risks overfitting, where the
model memorizes the training data instead of learning general patterns.
learningRate controls how large each parameter update is. At 0.01, each
step nudges parameters by roughly 1% of the gradient signal. This decays
linearly to 0 over training — large steps early to learn quickly, small steps
later to fine-tune. If the learning rate is too high, the model overshoots and
loss oscillates instead of decreasing. Too low and learning is painfully slow.
beta1 and beta2 control the Adam optimizer’s memory. beta1 = 0.85
means momentum remembers 85% of recent gradient directions and incorporates 15%
of the new one — this smooths out noisy gradients. beta2 = 0.99 means the
magnitude tracker is more conservative, updating slowly to get a stable
estimate of each parameter’s typical gradient size.
epsAdam is a tiny constant (0.00000001) added to prevent division by zero
in the Adam update. It never meaningfully affects the training — it is purely a
safety net for numerical stability.
The Training Loop
Each iteration of the training loop processes one sentence through five stages: pick a sentence, feed its tokens through the model, measure the error, compute gradients, and update the weights. Let’s walk through each part, then see the complete loop.
1. Pick a Sentence and Tokenize It
const sentence = sentences[step % sentences.length];
const tokens = tokenizer.encode(sentence);
const n = Math.min(model.config.blockSize, tokens.length - 1);
We cycle through the training sentences using modular arithmetic — step 0 picks
sentence 0, step 1 picks sentence 1, and so on, wrapping back to 0 when we
reach the end. The tokenizer turns “the cat eats a muffin” into
[596, 541, 89, 152, 0, 358, 596] — a BOS marker, then each word, then BOS
again.
The variable n is the number of predictions we will make. We subtract 1 from
the token count because the last token has no “next word” to predict. We also
cap at blockSize (16) because the position embedding table only has 16 rows —
the model cannot handle longer sequences.
2. Feed Tokens and Predict the Next Word
We create a fresh KV cache for each sentence (each sentence is a standalone training example — we don’t carry context between sentences). Then we process the sentence token by token. At each position, the model sees all previous tokens through the cache and outputs 597 scores as its prediction for what comes next.
Because we have the complete sentence, we already know the right answer at every position. That is what makes training possible: we can measure how wrong each prediction is.
Position 0: feed BOS
context: [BOS]
model predicts → 597 scores
correct answer: "the"
Position 1: feed "the"
context: [BOS, "the"]
model predicts → 597 scores
correct answer: "cat"
Position 2: feed "cat"
context: [BOS, "the", "cat"]
model predicts → 597 scores
correct answer: "eats"
Position 3: feed "eats"
context: [BOS, "the", "cat", "eats"]
model predicts → 597 scores
correct answer: "a"
Position 4: feed "a"
context: [BOS, "the", "cat", "eats", "a"]
model predicts → 597 scores
correct answer: "muffin"
Position 5: feed "muffin"
context: [BOS, "the", "cat", "eats", "a", "muffin"]
model predicts → 597 scores
correct answer: BOS (end of sentence)
Each call to gpt() builds a computation graph of Value nodes — thousands of
multiply, add, and relu operations connected by pointers. All six predictions
within a sentence share the same graph (through the KV cache), so a single
backward() call later can propagate gradients through all of them at once.
Early on, the model’s scores are essentially random — “the” might rank 400th out of 597. The loss measures exactly how bad that is.
3. Compute the Loss
The loss measures how wrong the model is. We use cross-entropy loss: for each position, we convert the logits to probabilities (via softmax), then take the negative log of the probability assigned to the correct next token:
const probs = softmax(logits);
losses.push(probs[targetId].log().neg());
Why negative log? Because it has exactly the shape we want:
- When the model assigns probability 1.0 to the right answer:
loss =
-log(1) = 0. Perfect — no penalty. - When the model assigns probability 0.5 — uncertain, but in the ballpark:
loss =
-log(0.5) = 0.693. A gentle nudge. - When the model assigns probability 0.01 — almost entirely wrong:
loss =
-log(0.01) = 4.605. A hard shove.
The penalty grows slowly at first, then accelerates toward infinity. This is what pushes the model to avoid being confidently wrong — a model that puts 90% probability on the wrong word gets punished far more than one that spreads its bets.
| Model assigns to correct word | Loss | Meaning |
|---|---|---|
| 0.90 | 0.105 | Low loss — strong, correct prediction |
| 0.50 | 0.693 | Moderate loss — uncertain but reasonable |
| 0.17 | 1.772 | High loss — mostly wrong |
| 0.01 | 4.605 | Very high loss — confidently wrong |
The total loss for the sentence is the average across all positions:
const loss = vsum(losses).div(n);
Averaging means the model is equally penalized whether the sentence has 3 words or 15 — otherwise long sentences would dominate the gradient signal and short sentences would barely contribute to learning.
4. Compute Gradients (Backward Pass)
One call to backward() fills in the gradient for every parameter:
loss.backward();
backward() returns nothing — it works entirely through side effects. It walks
the computation graph built during the forward pass (all those Value nodes
connected by pointers) and sets the .grad field on every node that
participated in the computation. After it runs, each parameter’s .grad
answers the question: if I increase this parameter by a tiny amount, how much
does the loss change?
A positive gradient means “increasing this parameter makes the loss worse” — so the optimizer will decrease it. A negative gradient means “increasing this parameter makes the loss better” — so the optimizer will increase it. A large gradient means this parameter has a big effect on the current prediction; a small gradient means it barely matters for this sentence.
Sparse Embedding Gradients
Not all 63,296 parameters get meaningful gradients from every sentence. The
layer weights (attention, MLP, output projection) are used in full on every
forward pass — linear() computes a dot product with every row — so they
always receive gradients. But embedding lookup is just array indexing: only the
rows for tokens that appear in the current sentence enter the computation graph.
For a 7-token sentence like “the cat eats a muffin”, only 6 unique token
embedding rows are touched (plus 6 position embedding rows). The remaining 591
token embedding rows keep their .grad at 0 and are not updated this step.
Over the full training run, each word appears in many sentences, so every
embedding row eventually gets trained — but frequent words get far more updates
than rare ones.
5. Update Parameters (Adam Optimizer)
The simplest approach would be param -= learningRate * grad — nudge each
parameter in the direction opposite its gradient. But this has two problems:
-
Noisy gradients. Each step only sees one sentence, so the gradient for any single step might point in a slightly wrong direction. We want to smooth out that noise by remembering which direction gradients have been pointing recently.
-
One size doesn’t fit all. Some parameters have consistently large gradients (they are sensitive), others have small ones (they are stable). A single learning rate either over-shoots the sensitive ones or under-shoots the stable ones.
The Adam optimizer solves both problems. It maintains two running averages for each parameter:
mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
-
mBuf(momentum): a running average of the gradient direction. Withbeta1 = 0.85, each new gradient contributes 15% and the history contributes 85%. This smooths out noise — if the gradient has been pointing left for several steps, the momentum keeps pushing left even if one step says right. -
vBuf(velocity): a running average of the squared gradient magnitude. This tracks how large the gradients have been for this parameter. Parameters with consistently large gradients get a smaller effective step size, and parameters with small gradients get a larger one.
Bias Correction
At the start of training, both mBuf and vBuf are initialized to zero. That
makes the early running averages severely biased toward zero — after one step,
mBuf is only 15% of the first gradient. The bias correction fixes this:
const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
At step 0: 1 - 0.85^1 = 0.15, so mHat divides by 0.15 — effectively
undoing the 85% shrinkage. By step 20, 1 - 0.85^21 ≈ 0.97, and the
correction is negligible. This ensures the early updates are the right size
from the start.
The Actual Update
params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
params[i].grad = 0; // reset for next step
The update divides the smoothed gradient (mHat) by the square root of the
smoothed squared gradient (vHat). This ratio normalizes the step: parameters
where gradients are consistently large take proportionally smaller steps.
The epsAdam (1e-8) prevents division by zero when vHat is tiny.
Learning Rate Decay
The learning rate also decays linearly from 0.01 to 0 over the course of training:
const lrT = learningRate * (1 - step / numSteps);
At step 0, lrT = 0.01. At step 2500 (halfway), lrT = 0.005. At step 4999,
lrT ≈ 0. Large steps early let the model learn quickly. Small steps later
let it fine-tune without overshooting.
Finally, we reset the gradient to zero so the next sentence starts with a clean slate. Without this, gradients would accumulate across sentences, and each update would be influenced by all previous sentences rather than just the current one.
The Complete Loop
Here is the full training loop with all five stages together:
for (let step = 0; step < numSteps; step++) {
// --- Step 1: Pick a sentence and tokenize ---
// Cycle through training data; the tokenizer wraps the sentence with BOS markers
const sentence = sentences[step % sentences.length];
const tokens = tokenizer.encode(sentence);
const n = Math.min(model.config.blockSize, tokens.length - 1);
// --- Step 2: Forward pass ---
// Feed tokens one at a time, building a computation graph.
// Each gpt() call adds to the KV cache so later tokens can attend to earlier ones.
const { keys, values } = createKVCache(model);
const losses: Value[] = [];
for (let posId = 0; posId < n; posId++) {
const tokenId = tokens[posId];
const targetId = tokens[posId + 1]; // the word we want the model to predict
const logits = gpt(model, tokenId, posId, keys, values);
const probs = softmax(logits);
// Cross-entropy loss: -log(probability assigned to the correct word)
// High probability on the right word → low loss; low probability → high loss
losses.push(probs[targetId].log().neg());
}
// Average loss over the sentence so short and long sentences contribute equally
const loss = vsum(losses).div(n);
// --- Step 3: Backward pass ---
// Walk the computation graph and fill in .grad on every Value node.
// After this, each parameter knows how much it contributed to the loss.
loss.backward();
// --- Step 4: Adam optimizer with linear learning rate decay ---
// Large steps early (lr = 0.01) to learn quickly, decaying to 0 to fine-tune
const lrT = learningRate * (1 - step / numSteps);
for (let i = 0; i < params.length; i++) {
// Update momentum (smoothed gradient direction)
mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
// Update velocity (smoothed gradient magnitude)
vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
// Bias correction: counteract the zero-initialization of mBuf/vBuf
const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
// Update: step in the smoothed gradient direction, scaled by the inverse
// of the gradient magnitude so sensitive parameters take smaller steps
params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
params[i].grad = 0; // reset for next sentence
}
}
Every operation in the forward pass uses Value nodes, so the entire path from
token to loss is a single computation graph. When backward() runs, gradients
flow back through every step — from the loss, through softmax, through every
layer’s attention and MLP blocks, all the way to the embedding lookup. That is
the whole training loop: predict, measure, learn, repeat.
Watching It Learn
When training runs, the loss steadily decreases:
step 1 / 5000 | loss 6.3917
step 500 / 5000 | loss 3.2184
step 1000 / 5000 | loss 2.8549
step 2000 / 5000 | loss 2.4012
step 3000 / 5000 | loss 2.1538
step 4000 / 5000 | loss 1.9107
step 5000 / 5000 | loss 1.7623
What the Numbers Mean
The initial loss of ~6.39 corresponds to random guessing: -log(1/597) ≈ 6.39.
The model has no idea which of the 597 words comes next, so it assigns roughly
equal probability to all of them.
As training progresses, the model learns patterns — “the” often follows BOS, “cat” often follows “the”, verbs follow nouns — and the loss drops. Here is what each milestone roughly means:
| Loss | Probability on correct word | What’s happening |
|---|---|---|
| 6.39 | 0.17% (1/597) | Random guessing — no knowledge |
| 3.22 | 4.0% | Learning common words and sentence starters |
| 2.85 | 5.8% | Picking up basic word-order patterns |
| 2.40 | 9.1% | Noun-verb associations forming |
| 1.76 | 17.2% | Solid grasp of sentence structure |
A loss of ~1.76 means the model is, on average, assigning about e^(-1.76) ≈ 17% probability to the correct next word. That may sound low, but remember:
it is choosing among 597 words. Assigning 17% to the right answer means it has
narrowed the field from 597 equally likely candidates to roughly 6 plausible
ones. That is a 100x improvement over where it started.
Where the Learning Happens
The steepest drop occurs in the first 1,000 steps. This is when the model learns the easiest patterns: which words are common, which words tend to start sentences, and basic category associations (articles before nouns, nouns before verbs). The later steps produce smaller improvements as the model works on harder patterns — less frequent words, longer-range dependencies, and the subtle differences between similar sentence structures.
This is typical of neural network training: easy patterns are learned first, and each subsequent improvement is harder to achieve than the last.
Complete Code
Here is the complete source code for the training loop, along with all files from previous chapters.
train.ts
/**
* Training Loop
*
* Each step processes one sentence through five stages:
* 1. Pick a sentence and tokenize it
* 2. Forward: feed tokens one at a time, predicting the next token at each position
* 3. Compute cross-entropy loss (how surprised the model is at the correct answer)
* 4. Backward: compute gradients via the chain rule
* 5. Adam optimizer: update parameters to reduce loss
*
* Loss starts at ~6.39 (random guessing among 597 tokens = -log(1/597))
* and decreases to ~1.76 over 5,000 steps.
*/
import { Value, vsum } from "./autograd.js";
import { softmax } from "./nn.js";
import { type Model, gpt, createKVCache } from "./model.js";
import type { Tokenizer } from "./tokenizer.js";
export interface TrainConfig {
numSteps: number;
learningRate: number;
beta1: number;
beta2: number;
epsAdam: number;
}
/** Train the model on the dataset and return the trained model. */
export function train(
model: Model,
trainConfig: TrainConfig,
sentences: string[],
tokenizer: Tokenizer,
): Model {
const { numSteps, learningRate, beta1, beta2, epsAdam } = trainConfig;
const { params } = model;
// Adam optimizer maintains two running averages per parameter:
// mBuf: smoothed gradient direction (momentum)
// vBuf: smoothed gradient magnitude (for adaptive step sizes)
const mBuf = new Float64Array(params.length);
const vBuf = new Float64Array(params.length);
for (let step = 0; step < numSteps; step++) {
// --- Step 1: Pick a sentence and tokenize ---
// Cycle through training data; the tokenizer wraps the sentence with BOS markers
const sentence = sentences[step % sentences.length];
const tokens = tokenizer.encode(sentence);
const n = Math.min(model.config.blockSize, tokens.length - 1);
// --- Step 2: Forward pass ---
// Feed tokens one at a time, building a computation graph.
// Each gpt() call adds to the KV cache so later tokens can attend to earlier ones.
const { keys, values } = createKVCache(model);
const losses: Value[] = [];
for (let posId = 0; posId < n; posId++) {
const tokenId = tokens[posId];
const targetId = tokens[posId + 1]; // the word we want the model to predict
const logits = gpt(model, tokenId, posId, keys, values);
const probs = softmax(logits);
// Cross-entropy loss: -log(probability assigned to the correct word)
// High probability on the right word → low loss; low probability → high loss
losses.push(probs[targetId].log().neg());
}
// Average loss over the sentence so short and long sentences contribute equally
const loss = vsum(losses).div(n);
// --- Step 3: Backward pass ---
// Walk the computation graph and fill in .grad on every Value node.
// After this, each parameter knows how much it contributed to the loss.
loss.backward();
// --- Step 4: Adam optimizer with linear learning rate decay ---
// Large steps early (lr = 0.01) to learn quickly, decaying to 0 to fine-tune
const lrT = learningRate * (1 - step / numSteps);
for (let i = 0; i < params.length; i++) {
// Update momentum (smoothed gradient direction)
mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
// Update velocity (smoothed gradient magnitude)
vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
// Bias correction: counteract the zero-initialization of mBuf/vBuf
const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
// Update: step in the smoothed gradient direction, scaled by the inverse
// of the gradient magnitude so sensitive parameters take smaller steps
params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
params[i].grad = 0; // reset for next sentence
}
process.stdout.write(
`\rstep ${String(step + 1).padStart(4)} / ${numSteps} | loss ${loss.data.toFixed(4)}`
);
}
return model;
}
model.ts
/**
* GPT Model
*
* The transformer architecture: a function that maps input tokens to a
* probability distribution over what comes next.
*
* Follows GPT-2 with minor simplifications:
* - RMSNorm instead of LayerNorm
* - No biases
* - ReLU instead of GeLU
*
* Config: 32 embedding dims, 4 attention heads, 2 layers, 16 max context
* → 63,296 parameters total
*/
import { readFileSync, writeFileSync } from "node:fs";
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";
// --- Configuration ---
export interface GPTConfig {
nLayer: number;
nEmbd: number;
blockSize: number;
nHead: number;
headDim: number;
vocabSize: number;
}
interface Attention {
query: Matrix;
key: Matrix;
value: Matrix;
output: Matrix;
}
interface MLP {
hidden: Matrix;
output: Matrix;
}
interface Layer {
attention: Attention;
mlp: MLP;
}
export interface Weights {
tokenEmbedding: Matrix;
positionEmbedding: Matrix;
output: Matrix;
layers: Layer[];
}
/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
config: GPTConfig;
weights: Weights;
params: Value[];
}
// --- Model Creation ---
function matrix(nout: number, nin: number, std = 0.08): Matrix {
return Array.from({ length: nout }, () =>
Array.from({ length: nin }, () => new Value(gauss(0, std)))
);
}
/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
const { nEmbd, nLayer, blockSize, vocabSize } = config;
const weights: Weights = {
tokenEmbedding: matrix(vocabSize, nEmbd),
positionEmbedding: matrix(blockSize, nEmbd),
output: matrix(vocabSize, nEmbd),
layers: Array.from({ length: nLayer }, () => ({
attention: {
query: matrix(nEmbd, nEmbd),
key: matrix(nEmbd, nEmbd),
value: matrix(nEmbd, nEmbd),
output: matrix(nEmbd, nEmbd),
},
mlp: {
hidden: matrix(4 * nEmbd, nEmbd),
output: matrix(nEmbd, 4 * nEmbd),
},
})),
};
const allMatrices: Matrix[] = [
weights.tokenEmbedding,
weights.positionEmbedding,
weights.output,
...weights.layers.flatMap((layer) => [
layer.attention.query, layer.attention.key,
layer.attention.value, layer.attention.output,
layer.mlp.hidden, layer.mlp.output,
]),
];
const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));
return { config, weights, params };
}
// --- Save / Load ---
/** Save a trained model to a JSON file (config + parameter values). */
export function saveModel(model: Model, path: string): void {
const data = {
config: model.config,
weights: model.params.map((p) => p.data),
};
writeFileSync(path, JSON.stringify(data));
}
/** Load a model from a JSON file. Recreates the structure, then fills in the learned weights. */
export function loadModel(path: string): Model {
const data = JSON.parse(readFileSync(path, "utf-8"));
const model = createModel(data.config);
for (let i = 0; i < model.params.length; i++) {
model.params[i].data = data.weights[i];
}
return model;
}
// --- KV Cache ---
/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
keys: Value[][][];
values: Value[][][];
} {
return {
keys: Array.from({ length: model.config.nLayer }, () => []),
values: Array.from({ length: model.config.nLayer }, () => []),
};
}
// --- Forward Pass ---
/**
* Run one step of the GPT: given a token at a position, return logits
* over the vocabulary for the next token.
*
* The keys/values caches are mutated (appended to) on each call —
* this is the KV cache that avoids recomputing attention for past positions.
*/
export function gpt(
model: Model,
tokenId: number,
posId: number,
keys: Value[][][],
values: Value[][][],
): Value[] {
const { nLayer, nHead, headDim } = model.config;
const { weights } = model;
// Step 1: Embedding lookup
// Combine "what word is this?" with "where does it appear?" into a single
// vector. This is the hidden state that flows through the rest of the network.
const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
const positionVector: Value[] = weights.positionEmbedding[posId];
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));
// Normalize before the first layer to keep values at a stable scale
hidden = rmsnorm(hidden);
// Step 2: Transformer layers
// Each layer has two blocks: attention (gather context from other tokens)
// followed by MLP (process the gathered information). Both use residual
// connections so the input is added back to the output of each block.
for (let li = 0; li < nLayer; li++) {
const layer = weights.layers[li];
// --- Attention block: look at previous tokens to gather context ---
// Save the hidden state so we can add it back after the block (residual)
const beforeAttention: Value[] = hidden;
hidden = rmsnorm(hidden);
// Project the hidden state into query, key, and value vectors.
// These are structurally identical projections — training teaches them
// to play different roles: query asks "what am I looking for?",
// key advertises "what do I contain?", value carries "what to retrieve".
const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[] = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
// Cache the key and value so future tokens can attend to this position
keys[li].push(key);
values[li].push(value);
// Each head independently attends to a different slice of the vectors,
// allowing the model to track multiple relationships at once
const attentionOutput: Value[] = [];
for (let h = 0; h < nHead; h++) {
const headStart = h * headDim;
const headQuery = query.slice(headStart, headStart + headDim);
const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));
// Scaled dot-product attention: score = (query · key) / √headDim
const attnLogits = headKeys.map((cachedKey) =>
vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
);
// Softmax converts scores into weights that sum to 1
const attnWeights = softmax(attnLogits);
// Weighted sum of value vectors — high-scoring positions contribute more
for (let j = 0; j < headDim; j++) {
attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
}
}
// Project concatenated head outputs back to the hidden dimension
hidden = linear(attentionOutput, layer.attention.output);
// Residual connection: add back what we had before attention
hidden = hidden.map((h, i) => h.add(beforeAttention[i]));
// --- MLP block: process each token's representation independently ---
const beforeMLP: Value[] = hidden;
hidden = rmsnorm(hidden);
hidden = linear(hidden, layer.mlp.hidden); // expand to 4x wider
hidden = hidden.map((h) => h.relu()); // nonlinearity
hidden = linear(hidden, layer.mlp.output); // compress back
// Residual connection: add back what we had before the MLP
hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
}
// Step 3: Output projection
// Project the final hidden state to vocabulary size — one score per word.
// These raw scores (logits) will be passed through softmax later to get
// a probability distribution over the next token.
return linear(hidden, weights.output);
}
rng.ts
/**
* Seeded Random Number Generator
*
* JavaScript has no built-in seeded RNG, so we roll our own:
* - mulberry32 for uniform random numbers
* - Box-Muller transform for Gaussian (normal) distribution
* - Fisher-Yates for array shuffling
* - Weighted random choice for sampling from probability distributions
*/
let _rngState = 0;
export function seed(s: number): void {
_rngState = s | 0;
}
/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
_rngState = (_rngState + 0x6d2b79f5) | 0;
let t = _rngState;
t = Math.imul(t ^ (t >>> 15), t | 1);
t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}
/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
let u1 = random();
while (u1 === 0) u1 = random();
const u2 = random();
return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}
/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
for (let i = arr.length - 1; i > 0; i--) {
const j = Math.floor(random() * (i + 1));
[arr[i], arr[j]] = [arr[j], arr[i]];
}
}
/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
const total = weights.reduce((a, b) => a + b, 0);
let r = random() * total;
for (let i = 0; i < weights.length; i++) {
r -= weights[i];
if (r <= 0) return i;
}
return weights.length - 1;
}
nn.ts
/**
* Neural Network Primitives
*
* The building blocks that the GPT architecture assembles:
* - linear: matrix-vector multiply (the fundamental neural net operation)
* - softmax: convert raw scores into a probability distribution
* - rmsnorm: normalize activations to stabilize training
*
* These mirror PyTorch's torch.nn.functional — general-purpose operations
* used by the model, training loop, and inference.
*/
import { Value, vsum } from "./autograd.js";
export type Matrix = Value[][];
/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}
/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
const maxVal = Math.max(...logits.map((v) => v.data));
const exps = logits.map((v) => v.sub(maxVal).exp());
const total = vsum(exps);
return exps.map((e) => e.div(total));
}
/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
const scale = ms.add(1e-5).pow(-0.5);
return input.map((xi) => xi.mul(scale));
}
autograd.ts
/**
* Autograd Engine
*
* A scalar-valued automatic differentiation engine. Each Value node records:
* - its forward-pass result (data)
* - its gradient w.r.t. the loss (grad), filled in by backward()
* - its children in the computation graph and the local derivatives
*
* The backward() method applies the chain rule via reverse-mode autodiff:
* topologically sort the graph, then propagate gradients from output to inputs.
*
* "If I nudge this parameter slightly, how does the loss change?"
*/
export class Value {
data: number;
grad: number;
children: Value[];
localGrads: number[];
constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
this.data = data;
this.grad = 0;
this.children = children;
this.localGrads = localGrads;
}
add(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data + o.data, [this, o], [1, 1]);
}
mul(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}
pow(n: number): Value {
return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}
log(): Value {
return new Value(Math.log(this.data), [this], [1 / this.data]);
}
exp(): Value {
return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}
relu(): Value {
return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}
neg(): Value {
return this.mul(-1);
}
sub(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.add(o.neg());
}
div(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.mul(o.pow(-1));
}
backward(): void {
const topo: Value[] = [];
const visited = new Set<Value>();
const buildTopo = (v: Value): void => {
if (!visited.has(v)) {
visited.add(v);
for (const child of v.children) buildTopo(child);
topo.push(v);
}
};
buildTopo(this);
this.grad = 1;
for (const v of topo.reverse()) {
for (let i = 0; i < v.children.length; i++) {
v.children[i].grad += v.localGrads[i] * v.grad;
}
}
}
}
/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
return values.reduce((acc, v) => acc.add(v), new Value(0));
}
tokenizer.ts
/**
* Tokenizer
*
* Translates strings to sequences of integers ("tokens") and back.
* Builds a word-level vocabulary from the training corpus, with a BOS
* (Beginning of Sequence) delimiter appended on each side.
*/
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}
Saving the Model
Training is expensive. We do not want to retrain every time we want to generate sentences. So we save the trained model to disk and load it later.
What Gets Saved
A model is just configuration (the shape) plus parameters (the learned numbers). We save both as JSON:
export function saveModel(model: Model, path: string): void {
const data = {
config: model.config,
weights: model.params.map((p) => p.data),
};
writeFileSync(path, JSON.stringify(data));
}
The config tells us the architecture (nLayer: 2, nEmbd: 32, ...) and the
weights array holds all 63,296 parameter values as plain numbers. The file is
about 1.5 MB of JSON.
Loading the Model
To load, we recreate the model structure from the config, then fill in the learned weights:
export function loadModel(path: string): Model {
const data = JSON.parse(readFileSync(path, "utf-8"));
const model = createModel(data.config);
for (let i = 0; i < model.params.length; i++) {
model.params[i].data = data.weights[i];
}
return model;
}
createModel(data.config) allocates all the weight matrices with random values
(just like before training), then we overwrite every parameter with the learned
value. The ordering is deterministic (createModel always creates matrices in
the same order), so the flat weights array maps back to the correct matrices.
Why This Matters
This separation (train once, generate many times) is how all production LLMs work. Training a large model can cost hundreds of millions of dollars in compute. But once trained, the model weights are saved and can be loaded for inference on any machine. The training code, the optimizer state, the gradient buffers: none of that is needed for inference. Just the architecture and the numbers.
Training (expensive, done once):
train -> phrases-model.json
Inference (cheap, done many times):
phrases-model.json -> generate -> sentences
We will wire up these scripts in the smoke test chapter.
Generation
The model is trained. The weights are saved. Now we load them and generate new sentences.
Generation works one word at a time. Start with the BOS token, feed it to the model, get 597 scores back, pick a word, feed that word back in, and repeat. Each step asks the same question: given everything so far, what word comes next? The output of one step becomes the input to the next — this is called autoregressive generation.
The interesting problem is how to pick the next word. The model outputs 597 raw scores (logits), but choosing the highest-scoring word every time produces dull, repetitive text. Choosing completely at random produces nonsense. The sampling strategy — temperature, top-k, and top-p — controls where the output lands between those extremes.
BOS → model → 597 logits → temperature → top-k → top-p → softmax → sample → next token
↑ |
└────────────────────────────────────────────────────────────────────────────────┘
This chapter covers the generation loop, the KV cache that makes it efficient, and the sampling knobs that shape the output.
The Generation Loop
Generation works by starting with the BOS token and repeatedly asking the model: “given everything so far, what word comes next?”
let tokenId = tokenizer.BOS; // start with BOS
const tokens: number[] = [];
for (let posId = 0; posId < model.config.blockSize; posId++) {
const logits = gpt(model, tokenId, posId, keys, values);
// 1. Temperature scaling
let scores = logits.map((l) => l.data / temperature);
// 2. Top-k: keep only the k highest scores
if (topK > 0 && topK < scores.length) {
const sorted = [...scores].sort((a, b) => b - a);
const cutoff = sorted[topK];
scores = scores.map((s) => (s >= cutoff ? s : -Infinity));
}
// 3. Top-p: keep smallest set whose probabilities sum to p
if (topP < 1.0) { /* accumulate sorted probs, zero out the rest */ }
// 4. Softmax and sample (plain math — no autograd needed at inference)
const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
const total = exps.reduce((a, b) => a + b, 0);
const probs = exps.map((e) => e / total);
tokenId = weightedChoice(probs);
if (tokenId === tokenizer.BOS) break;
tokens.push(tokenId);
}
return tokenizer.decode(tokens);
The full pipeline for turning logits into a sampled token:
597 logits -> temperature scaling -> top-k filter -> top-p filter -> softmax -> weighted sample
Step by Step
1. Forward pass. Feed the current token and position into the model. It returns 597 logits, one raw score per word in the vocabulary.
2. Temperature scaling. Divide all logits by the temperature. This controls how “creative” the model is:
| Temperature | Effect |
|---|---|
| 0.3 (low) | Sharper probabilities. Model picks the most likely word almost every time. Repetitive but safe. |
| 0.8 (medium) | Balanced. Some variety, mostly coherent. |
| 1.5 (high) | Flatter probabilities. Model considers unlikely words. More creative but may produce nonsense. |
3. Top-k filter. If topK is set (e.g., 10), sort the scores and set
everything outside the top k to negative infinity. Those words get zero
probability after softmax and can never be chosen.
597 scores -> keep top 10 -> set 587 to -infinity
This hard cutoff prevents the model from ever picking a very unlikely word. The downside is that a fixed k does not adapt. Sometimes the model is very confident and only 2-3 words make sense; other times, 50 words are plausible.
4. Top-p filter (nucleus sampling). If topP is set (e.g., 0.9),
compute probabilities from the current scores, sort by probability, and
accumulate until the sum reaches p. Zero out everything past the cutoff.
scores -> softmax -> sort -> accumulate until sum >= 0.9 -> zero out the rest
If the model is confident (“the” has probability 0.85), maybe only 2-3 words survive. If the model is uncertain and the distribution is flat, maybe 50 words survive. Top-p adapts to the model’s confidence at each position, which is why it is generally preferred over top-k.
5. Softmax and sample. Convert the filtered scores to a probability distribution and randomly pick a word weighted by those probabilities. The word “the” might have probability 0.3 and “a” might have 0.15. The model does not always pick the top choice, which gives the output variety.
6. Check for BOS. If the sampled token is BOS, the sentence is done. Otherwise, feed this new token back in and repeat.
7. Decode. Convert the collected token IDs back to words.
Comparing Sampling Strategies
| Strategy | What it does | Trade-off |
|---|---|---|
| Temperature | Reshapes the entire distribution | Simple, but low-probability junk words can still be picked |
| Top-k | Hard cutoff at k words | Fixed k does not adapt to model confidence |
| Top-p | Adaptive cutoff by cumulative probability | Handles both confident and uncertain positions well |
In practice, these are combined. A typical production configuration might use
temperature=0.7, top_p=0.9, and top_k=50 together. Temperature reshapes
the distribution, then top-k and top-p trim the tail. Our implementation
supports all three, individually or combined.
The KV Cache
Generation uses the same KV cache introduced in the model chapter. We
create a fresh cache before generating and pass it to every gpt() call:
const { keys, values } = createKVCache(model);
Each call appends the current token’s key and value vectors to the cache, so the next call can attend to everything generated so far. The model never reprocesses old tokens — it just reads their cached K/V vectors.
During training, the KV cache saves redundant work within a single sentence. During generation, it is essential: without it, generating a 16-token sentence would mean reprocessing the entire sequence from scratch at every position — 1 + 2 + 3 + … + 16 = 136 forward passes instead of 16.
Example Output
With temperature 0.8, the trained model produces sentences like:
1. the brown horse sits
2. we fly to the store
3. the girl reads a card
4. mom cooked the muffin
5. the kitten runs fast
6. seven ducks swim at dusk
7. i like the tall tree
8. the cow eats
These are not sentences from the training data. The model invented them. But they follow the same patterns: articles before nouns, verbs in the right place, reasonable word combinations. The model learned the structure of simple English from just 5,000 training steps.
Complete Code
Here is the complete source code for text generation, along with all files from previous chapters.
generate.ts
/**
* Inference / Text Generation
*
* Starting from the BOS token, the model predicts one token at a time:
* 1. Forward the current token through the model to get logits
* 2. Apply temperature scaling (lower = more conservative, higher = more creative)
* 3. Apply top-k filtering (keep only the k most likely tokens)
* 4. Apply top-p / nucleus filtering (keep smallest set summing to p)
* 5. Convert to probabilities via softmax
* 6. Randomly sample the next token from that distribution
* 7. Stop when BOS is produced (end of sequence) or max length is reached
*/
import { type Model, gpt, createKVCache } from "./model.js";
import { weightedChoice } from "./rng.js";
import type { Tokenizer } from "./tokenizer.js";
export interface GenerateOptions {
temperature?: number;
topK?: number;
topP?: number;
}
/** Generate new text samples from a trained model. */
export function generate(
model: Model,
tokenizer: Tokenizer,
numSamples: number,
options: GenerateOptions = {},
): string[] {
const { temperature = 0.8, topK = 0, topP = 1.0 } = options;
const samples: string[] = [];
for (let i = 0; i < numSamples; i++) {
const { keys, values } = createKVCache(model);
let tokenId = tokenizer.BOS;
const tokens: number[] = [];
for (let posId = 0; posId < model.config.blockSize; posId++) {
const logits = gpt(model, tokenId, posId, keys, values);
// Temperature scaling
let scores = logits.map((l) => l.data / temperature);
// Top-k: keep only the k highest scores
if (topK > 0 && topK < scores.length) {
const sorted = [...scores].sort((a, b) => b - a);
const cutoff = sorted[topK - 1];
scores = scores.map((s) => (s >= cutoff ? s : -Infinity));
}
// Top-p (nucleus): keep smallest set whose probabilities sum to p
if (topP < 1.0) {
const indices = scores.map((s, idx) => idx);
indices.sort((a, b) => scores[b] - scores[a]);
// Compute softmax on current scores to get probabilities for filtering
const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
const total = exps.reduce((a, b) => a + b, 0);
const probs = exps.map((e) => e / total);
let cumSum = 0;
let exceeded = false;
for (const idx of indices) {
if (exceeded) {
scores[idx] = -Infinity;
}
cumSum += probs[idx];
if (cumSum > topP) {
exceeded = true;
}
}
}
// Softmax and sample (using plain numbers, not Value, for efficiency)
const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
const total = exps.reduce((a, b) => a + b, 0);
const probs = exps.map((e) => e / total);
tokenId = weightedChoice(probs);
if (tokenId === tokenizer.BOS) break;
tokens.push(tokenId);
}
samples.push(tokenizer.decode(tokens));
}
return samples;
}
train.ts
/**
* Training Loop
*
* Each step:
* 1. Pick a sentence and tokenize it
* 2. Forward each token through the model, predicting the next token
* 3. Compute cross-entropy loss (how surprised the model is at the correct answer)
* 4. Backward pass: compute gradients via the chain rule
* 5. Adam optimizer: update parameters to reduce loss
*
* Loss starts at ~6.39 (random guessing among 597 tokens = -log(1/597))
* and decreases to ~1.76 over 5,000 steps.
*/
import { Value, vsum } from "./autograd.js";
import { softmax } from "./nn.js";
import { type Model, gpt, createKVCache } from "./model.js";
import type { Tokenizer } from "./tokenizer.js";
export interface TrainConfig {
numSteps: number;
learningRate: number;
beta1: number;
beta2: number;
epsAdam: number;
}
/** Train the model on the dataset and return the trained model. */
export function train(
model: Model,
trainConfig: TrainConfig,
sentences: string[],
tokenizer: Tokenizer,
): Model {
const { numSteps, learningRate, beta1, beta2, epsAdam } = trainConfig;
const { params } = model;
// Adam optimizer moment buffers
const mBuf = new Float64Array(params.length);
const vBuf = new Float64Array(params.length);
for (let step = 0; step < numSteps; step++) {
// Tokenize a single sentence, surrounded by BOS on both sides
const sentence = sentences[step % sentences.length];
const tokens = tokenizer.encode(sentence);
const n = Math.min(model.config.blockSize, tokens.length - 1);
// Forward: build the computation graph from tokens to loss
const { keys, values } = createKVCache(model);
const losses: Value[] = [];
for (let posId = 0; posId < n; posId++) {
const tokenId = tokens[posId];
const targetId = tokens[posId + 1];
const logits = gpt(model, tokenId, posId, keys, values);
const probs = softmax(logits);
losses.push(probs[targetId].log().neg());
}
// Average cross-entropy loss over the sequence
const loss = vsum(losses).div(n);
// Backward: walk the computation graph and fill in .grad on every Value node.
// Returns void — gradients are stored as a side effect on the same param objects
// that the Adam loop below reads from.
loss.backward();
// Adam update with linear learning rate decay
const lrT = learningRate * (1 - step / numSteps);
for (let i = 0; i < params.length; i++) {
mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
params[i].grad = 0;
}
process.stdout.write(
`\rstep ${String(step + 1).padStart(4)} / ${numSteps} | loss ${loss.data.toFixed(4)}`
);
}
return model;
}
model.ts
/**
* GPT Model
*
* The transformer architecture: a function that maps input tokens to a
* probability distribution over what comes next.
*
* Follows GPT-2 with minor simplifications:
* - RMSNorm instead of LayerNorm
* - No biases
* - ReLU instead of GeLU
*
* Config: 32 embedding dims, 4 attention heads, 2 layers, 16 max context
* → 63,296 parameters total
*/
import { readFileSync, writeFileSync } from "node:fs";
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";
// --- Configuration ---
export interface GPTConfig {
nLayer: number;
nEmbd: number;
blockSize: number;
nHead: number;
headDim: number;
vocabSize: number;
}
interface Attention {
query: Matrix;
key: Matrix;
value: Matrix;
output: Matrix;
}
interface MLP {
hidden: Matrix;
output: Matrix;
}
interface Layer {
attention: Attention;
mlp: MLP;
}
export interface Weights {
tokenEmbedding: Matrix;
positionEmbedding: Matrix;
output: Matrix;
layers: Layer[];
}
/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
config: GPTConfig;
weights: Weights;
params: Value[];
}
// --- Model Creation ---
function matrix(nout: number, nin: number, std = 0.08): Matrix {
return Array.from({ length: nout }, () =>
Array.from({ length: nin }, () => new Value(gauss(0, std)))
);
}
/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
const { nEmbd, nLayer, blockSize, vocabSize } = config;
const weights: Weights = {
tokenEmbedding: matrix(vocabSize, nEmbd),
positionEmbedding: matrix(blockSize, nEmbd),
output: matrix(vocabSize, nEmbd),
layers: Array.from({ length: nLayer }, () => ({
attention: {
query: matrix(nEmbd, nEmbd),
key: matrix(nEmbd, nEmbd),
value: matrix(nEmbd, nEmbd),
output: matrix(nEmbd, nEmbd),
},
mlp: {
hidden: matrix(4 * nEmbd, nEmbd),
output: matrix(nEmbd, 4 * nEmbd),
},
})),
};
const allMatrices: Matrix[] = [
weights.tokenEmbedding,
weights.positionEmbedding,
weights.output,
...weights.layers.flatMap((layer) => [
layer.attention.query, layer.attention.key,
layer.attention.value, layer.attention.output,
layer.mlp.hidden, layer.mlp.output,
]),
];
const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));
return { config, weights, params };
}
// --- Save / Load ---
/** Save a trained model to a JSON file (config + parameter values). */
export function saveModel(model: Model, path: string): void {
const data = {
config: model.config,
weights: model.params.map((p) => p.data),
};
writeFileSync(path, JSON.stringify(data));
}
/** Load a model from a JSON file. Recreates the structure, then fills in the learned weights. */
export function loadModel(path: string): Model {
const data = JSON.parse(readFileSync(path, "utf-8"));
const model = createModel(data.config);
for (let i = 0; i < model.params.length; i++) {
model.params[i].data = data.weights[i];
}
return model;
}
// --- KV Cache ---
/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
keys: Value[][][];
values: Value[][][];
} {
return {
keys: Array.from({ length: model.config.nLayer }, () => []),
values: Array.from({ length: model.config.nLayer }, () => []),
};
}
// --- Forward Pass ---
/**
* Run one step of the GPT: given a token at a position, return logits
* over the vocabulary for the next token.
*
* The keys/values caches are mutated (appended to) on each call —
* this is the KV cache that avoids recomputing attention for past positions.
*/
export function gpt(
model: Model,
tokenId: number,
posId: number,
keys: Value[][][],
values: Value[][][],
): Value[] {
const { nLayer, nHead, headDim } = model.config;
const { weights } = model;
// Step 1: Embedding lookup
// Combine "what word is this?" with "where does it appear?" into a single
// vector. This is the hidden state that flows through the rest of the network.
const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
const positionVector: Value[] = weights.positionEmbedding[posId];
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));
// Normalize before the first layer to keep values at a stable scale
hidden = rmsnorm(hidden);
// Step 2: Transformer layers
// Each layer has two blocks: attention (gather context from other tokens)
// followed by MLP (process the gathered information). Both use residual
// connections so the input is added back to the output of each block.
for (let li = 0; li < nLayer; li++) {
const layer = weights.layers[li];
// --- Attention block: look at previous tokens to gather context ---
// Save the hidden state so we can add it back after the block (residual)
const beforeAttention: Value[] = hidden;
hidden = rmsnorm(hidden);
// Project the hidden state into query, key, and value vectors.
// These are structurally identical projections — training teaches them
// to play different roles: query asks "what am I looking for?",
// key advertises "what do I contain?", value carries "what to retrieve".
const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[] = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
// Cache the key and value so future tokens can attend to this position
keys[li].push(key);
values[li].push(value);
// Each head independently attends to a different slice of the vectors,
// allowing the model to track multiple relationships at once
const attentionOutput: Value[] = [];
for (let h = 0; h < nHead; h++) {
const headStart = h * headDim;
const headQuery = query.slice(headStart, headStart + headDim);
const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));
// Scaled dot-product attention: score = (query · key) / √headDim
const attnLogits = headKeys.map((cachedKey) =>
vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
);
// Softmax converts scores into weights that sum to 1
const attnWeights = softmax(attnLogits);
// Weighted sum of value vectors — high-scoring positions contribute more
for (let j = 0; j < headDim; j++) {
attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
}
}
// Project concatenated head outputs back to the hidden dimension
hidden = linear(attentionOutput, layer.attention.output);
// Residual connection: add back what we had before attention
hidden = hidden.map((h, i) => h.add(beforeAttention[i]));
// --- MLP block: process each token's representation independently ---
const beforeMLP: Value[] = hidden;
hidden = rmsnorm(hidden);
hidden = linear(hidden, layer.mlp.hidden); // expand to 4x wider
hidden = hidden.map((h) => h.relu()); // nonlinearity
hidden = linear(hidden, layer.mlp.output); // compress back
// Residual connection: add back what we had before the MLP
hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
}
// Step 3: Output projection
// Project the final hidden state to vocabulary size — one score per word.
// These raw scores (logits) will be passed through softmax later to get
// a probability distribution over the next token.
return linear(hidden, weights.output);
}
rng.ts
/**
* Seeded Random Number Generator
*
* JavaScript has no built-in seeded RNG, so we roll our own:
* - mulberry32 for uniform random numbers
* - Box-Muller transform for Gaussian (normal) distribution
* - Fisher-Yates for array shuffling
* - Weighted random choice for sampling from probability distributions
*/
let _rngState = 0;
export function seed(s: number): void {
_rngState = s | 0;
}
/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
_rngState = (_rngState + 0x6d2b79f5) | 0;
let t = _rngState;
t = Math.imul(t ^ (t >>> 15), t | 1);
t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}
/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
let u1 = random();
while (u1 === 0) u1 = random();
const u2 = random();
return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}
/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
for (let i = arr.length - 1; i > 0; i--) {
const j = Math.floor(random() * (i + 1));
[arr[i], arr[j]] = [arr[j], arr[i]];
}
}
/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
const total = weights.reduce((a, b) => a + b, 0);
let r = random() * total;
for (let i = 0; i < weights.length; i++) {
r -= weights[i];
if (r <= 0) return i;
}
return weights.length - 1;
}
nn.ts
/**
* Neural Network Primitives
*
* The building blocks that the GPT architecture assembles:
* - linear: matrix-vector multiply (the fundamental neural net operation)
* - softmax: convert raw scores into a probability distribution
* - rmsnorm: normalize activations to stabilize training
*
* These mirror PyTorch's torch.nn.functional — general-purpose operations
* used by the model, training loop, and inference.
*/
import { Value, vsum } from "./autograd.js";
export type Matrix = Value[][];
/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}
/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
const maxVal = Math.max(...logits.map((v) => v.data));
const exps = logits.map((v) => v.sub(maxVal).exp());
const total = vsum(exps);
return exps.map((e) => e.div(total));
}
/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
const scale = ms.add(1e-5).pow(-0.5);
return input.map((xi) => xi.mul(scale));
}
autograd.ts
/**
* Autograd Engine
*
* A scalar-valued automatic differentiation engine. Each Value node records:
* - its forward-pass result (data)
* - its gradient w.r.t. the loss (grad), filled in by backward()
* - its children in the computation graph and the local derivatives
*
* The backward() method applies the chain rule via reverse-mode autodiff:
* topologically sort the graph, then propagate gradients from output to inputs.
*
* "If I nudge this parameter slightly, how does the loss change?"
*/
export class Value {
data: number;
grad: number;
children: Value[];
localGrads: number[];
constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
this.data = data;
this.grad = 0;
this.children = children;
this.localGrads = localGrads;
}
add(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data + o.data, [this, o], [1, 1]);
}
mul(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}
pow(n: number): Value {
return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}
log(): Value {
return new Value(Math.log(this.data), [this], [1 / this.data]);
}
exp(): Value {
return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}
relu(): Value {
return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}
neg(): Value {
return this.mul(-1);
}
sub(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.add(o.neg());
}
div(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.mul(o.pow(-1));
}
backward(): void {
const topo: Value[] = [];
const visited = new Set<Value>();
const buildTopo = (v: Value): void => {
if (!visited.has(v)) {
visited.add(v);
for (const child of v.children) buildTopo(child);
topo.push(v);
}
};
buildTopo(this);
this.grad = 1;
for (const v of topo.reverse()) {
for (let i = 0; i < v.children.length; i++) {
v.children[i].grad += v.localGrads[i] * v.grad;
}
}
}
}
/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
return values.reduce((acc, v) => acc.add(v), new Value(0));
}
tokenizer.ts
/**
* Tokenizer
*
* Translates strings to sequences of integers ("tokens") and back.
* Builds a word-level vocabulary from the training corpus, with a BOS
* (Beginning of Sequence) delimiter appended on each side.
*/
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}
Smoke Test
Time to run it. In this lab you will train the model, inspect the saved weights, and generate sentences with different sampling strategies.
Train the Model
npx tsx src/phrases-train.ts
You should see output like this:
num sentences: 30000
vocab size: 597 (596 words + BOS)
num params: 63296
step 1 / 5000 | loss 6.3917
step 500 / 5000 | loss 3.2184
step 1000 / 5000 | loss 2.8549
step 2000 / 5000 | loss 2.4012
step 3000 / 5000 | loss 2.1538
step 4000 / 5000 | loss 1.9107
step 5000 / 5000 | loss 1.7623
model saved to phrases-model.json
Verify: Loss Decreases
The loss should start near 6.39 (random guessing: -log(1/597)) and drop
below 2.0 by step 5000. If the loss is not decreasing, something is wrong.
Verify: Model File Exists
ls -lh phrases-model.json
The file should be roughly 1.5 MB. It contains the model configuration and all 63,296 learned parameter values as JSON.
Generate Sentences
Load the saved model and generate with default settings:
npx tsx src/phrases-generate.ts
loaded phrases-model.json: 63296 params, vocab 597
generating 20 sentences (temperature=0.8):
1. the puppy wants to jump
2. the pig swims by the park
3. i hold the pie
4. the fish skips
5. the teacher is sweet
...
Verify: Output Makes Sense
The generated sentences should:
- Start with plausible words (“the”, “a”, “i”, “we”, “mom”, etc.)
- Follow basic subject-verb patterns
- Use words from the training vocabulary
- End naturally (not cut off mid-thought)
They will not be perfect. You will see occasional oddities like “the fish quacks” or “i toes”. The model has only 63K parameters and 5,000 training steps. But the overall structure should clearly resemble English sentences.
Try Different Sampling Strategies
Low Temperature (Conservative)
npx tsx src/phrases-generate.ts 10 --temp=0.3
With low temperature, the model almost always picks the highest-probability word. Output will be repetitive but grammatically safer.
High Temperature (Creative)
npx tsx src/phrases-generate.ts 10 --temp=1.5
With high temperature, the model considers unlikely words. Output will be more varied but may include nonsensical combinations.
Top-k Filtering
npx tsx src/phrases-generate.ts 10 --top-k=10
Only the 10 most likely words can be chosen at each position. This trims the long tail of unlikely words.
Nucleus Sampling (Top-p)
npx tsx src/phrases-generate.ts 10 --top-p=0.9
Keep the smallest set of words whose probabilities sum to 0.9. Adapts to the model’s confidence at each position.
Combined
npx tsx src/phrases-generate.ts 10 --temp=0.7 --top-p=0.9 --top-k=50
All three strategies working together. This is how production LLMs typically configure their sampling.
What You Have Built
A language model. The same architecture, at wildly different scale, behind ChatGPT, Claude, and every other LLM.
The entire system is ~450 lines of TypeScript across 9 files, with zero dependencies beyond the TypeScript compiler. Every piece is explicit:
- You can read exactly how token 541 becomes a 32-dimensional vector
- You can trace exactly how attention computes which words to focus on
- You can inspect exactly how each gradient flows backward through the network
- You can see exactly how the optimizer adjusts each of the 63,296 parameters
The only difference between this and a production LLM is scale. Same concepts. Same math. Same architecture. Just more parameters, more data, more compute, and a few engineering optimizations to make it fast.
our model: 597 vocab, 63K params, 30K sentences, minutes on a laptop
production LLM: ~100K vocab, ~1T+ params, trillions of tokens, months on a cluster
The hard part is not the size. The hard part is understanding what happens inside, and now you do.
Complete Code
Here is the complete source code for the smoke test entry points, along with all files from previous chapters.
phrases-train.ts
/**
* MicroGPT — Train on Phrases
*
* Trains a word-level GPT on grade 1 English sentences and saves the model.
* Run inference afterwards with: npx tsx phrases-generate.ts
*
* npx tsx phrases-train.ts
*/
import { readFileSync } from "node:fs";
import { seed, shuffle } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { createModel, saveModel } from "./model.js";
import { train } from "./train.js";
// 1. Seed the RNG for reproducibility
seed(42);
// 2. Load dataset: 30K grade 1 sentences
const sentences = readFileSync("data/grade1_sentences.txt", "utf-8")
.split("\n")
.filter((l) => l.trim())
.map((l) => l.trim().toLowerCase());
shuffle(sentences);
console.log(`num sentences: ${sentences.length}`);
// 3. Build word-level tokenizer from the corpus
const tokenizer = createWordTokenizer(sentences);
console.log(`vocab size: ${tokenizer.vocabSize} (${tokenizer.vocabSize - 1} words + BOS)`);
// 4. Create the model — bigger than the names model to handle the larger vocabulary
const untrained = createModel({
nLayer: 2,
nEmbd: 32,
blockSize: 16,
nHead: 4,
headDim: 8,
vocabSize: tokenizer.vocabSize,
});
console.log(`num params: ${untrained.params.length}`);
// 5. Train the model
const model = train(
untrained,
{ numSteps: 5000, learningRate: 0.01, beta1: 0.85, beta2: 0.99, epsAdam: 1e-8 },
sentences,
tokenizer,
);
// 6. Save the trained model to disk
saveModel(model, "phrases-model.json");
console.log("\nmodel saved to phrases-model.json");
phrases-generate.ts
/**
* MicroGPT — Generate Phrases
*
* Loads a trained model and generates new sentences.
* Train first with: npx tsx phrases-train.ts
*
* npx tsx phrases-generate.ts
* npx tsx phrases-generate.ts 50 # generate 50 sentences
* npx tsx phrases-generate.ts 20 --temp=0.3 # low temperature
* npx tsx phrases-generate.ts 20 --top-k=10 # top-k filtering
* npx tsx phrases-generate.ts 20 --top-p=0.9 # nucleus sampling
* npx tsx phrases-generate.ts 20 --temp=0.7 --top-p=0.9 --top-k=50
* npx tsx phrases-generate.ts 20 --model=phrases-fine-tuned-model.json
*/
import { readFileSync } from "node:fs";
import { seed } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { loadModel } from "./model.js";
import { generate, type GenerateOptions } from "./generate.js";
// Parse CLI arguments
const args = process.argv.slice(2);
const positional = args.filter((a) => !a.startsWith("--"));
const flags = Object.fromEntries(
args.filter((a) => a.startsWith("--")).map((a) => {
const [k, v] = a.slice(2).split("=");
return [k, v];
})
);
const numSamples = parseInt(positional[0] ?? "20", 10);
const options: GenerateOptions = {
temperature: parseFloat(flags["temp"] ?? "0.8"),
topK: parseInt(flags["top-k"] ?? "0", 10),
topP: parseFloat(flags["top-p"] ?? "1.0"),
};
// 1. Seed the RNG
seed(Date.now());
// 2. Rebuild the tokenizer from the same corpus (needed for encode/decode)
const corpus = readFileSync("data/grade1_sentences.txt", "utf-8")
.split("\n")
.filter((l) => l.trim())
.map((l) => l.trim().toLowerCase());
const tokenizer = createWordTokenizer(corpus);
// 3. Load the trained model
const modelPath = flags["model"] ?? "phrases-model.json";
const model = loadModel(modelPath);
console.log(`loaded ${modelPath}: ${model.params.length} params, vocab ${model.config.vocabSize}`);
// 4. Generate
const parts = [`temperature=${options.temperature}`];
if (options.topK! > 0) parts.push(`top-k=${options.topK}`);
if (options.topP! < 1.0) parts.push(`top-p=${options.topP}`);
console.log(`generating ${numSamples} sentences (${parts.join(", ")}):\n`);
const sentences = generate(model, tokenizer, numSamples, options);
sentences.forEach((s, i) =>
console.log(` ${String(i + 1).padStart(2)}. ${s}`)
);
generate.ts
/**
* Inference / Text Generation
*
* Starting from the BOS token, the model predicts one token at a time:
* 1. Forward the current token through the model to get logits
* 2. Apply temperature scaling (lower = more conservative, higher = more creative)
* 3. Apply top-k filtering (keep only the k most likely tokens)
* 4. Apply top-p / nucleus filtering (keep smallest set summing to p)
* 5. Convert to probabilities via softmax
* 6. Randomly sample the next token from that distribution
* 7. Stop when BOS is produced (end of sequence) or max length is reached
*/
import { type Model, gpt, createKVCache } from "./model.js";
import { weightedChoice } from "./rng.js";
import type { Tokenizer } from "./tokenizer.js";
export interface GenerateOptions {
temperature?: number;
topK?: number;
topP?: number;
}
/** Generate new text samples from a trained model. */
export function generate(
model: Model,
tokenizer: Tokenizer,
numSamples: number,
options: GenerateOptions = {},
): string[] {
const { temperature = 0.8, topK = 0, topP = 1.0 } = options;
const samples: string[] = [];
for (let i = 0; i < numSamples; i++) {
const { keys, values } = createKVCache(model);
let tokenId = tokenizer.BOS;
const tokens: number[] = [];
for (let posId = 0; posId < model.config.blockSize; posId++) {
const logits = gpt(model, tokenId, posId, keys, values);
// Temperature scaling
let scores = logits.map((l) => l.data / temperature);
// Top-k: keep only the k highest scores
if (topK > 0 && topK < scores.length) {
const sorted = [...scores].sort((a, b) => b - a);
const cutoff = sorted[topK - 1];
scores = scores.map((s) => (s >= cutoff ? s : -Infinity));
}
// Top-p (nucleus): keep smallest set whose probabilities sum to p
if (topP < 1.0) {
const indices = scores.map((s, idx) => idx);
indices.sort((a, b) => scores[b] - scores[a]);
// Compute softmax on current scores to get probabilities for filtering
const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
const total = exps.reduce((a, b) => a + b, 0);
const probs = exps.map((e) => e / total);
let cumSum = 0;
let exceeded = false;
for (const idx of indices) {
if (exceeded) {
scores[idx] = -Infinity;
}
cumSum += probs[idx];
if (cumSum > topP) {
exceeded = true;
}
}
}
// Softmax and sample (using plain numbers, not Value, for efficiency)
const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
const total = exps.reduce((a, b) => a + b, 0);
const probs = exps.map((e) => e / total);
tokenId = weightedChoice(probs);
if (tokenId === tokenizer.BOS) break;
tokens.push(tokenId);
}
samples.push(tokenizer.decode(tokens));
}
return samples;
}
train.ts
/**
* Training Loop
*
* Each step:
* 1. Pick a sentence and tokenize it
* 2. Forward each token through the model, predicting the next token
* 3. Compute cross-entropy loss (how surprised the model is at the correct answer)
* 4. Backward pass: compute gradients via the chain rule
* 5. Adam optimizer: update parameters to reduce loss
*
* Loss starts at -log(1/vocabSize) (random guessing) and decreases as
* the model learns to predict the next word.
*/
import { Value, vsum } from "./autograd.js";
import { softmax } from "./nn.js";
import { type Model, gpt, createKVCache } from "./model.js";
import type { Tokenizer } from "./tokenizer.js";
export interface TrainConfig {
numSteps: number;
learningRate: number;
beta1: number;
beta2: number;
epsAdam: number;
}
/** Train the model on the dataset and return the trained model. */
export function train(
model: Model,
trainConfig: TrainConfig,
sentences: string[],
tokenizer: Tokenizer,
): Model {
const { numSteps, learningRate, beta1, beta2, epsAdam } = trainConfig;
const { params } = model;
// Adam optimizer moment buffers
const mBuf = new Float64Array(params.length);
const vBuf = new Float64Array(params.length);
for (let step = 0; step < numSteps; step++) {
// Tokenize a single sentence, surrounded by BOS on both sides
const sentence = sentences[step % sentences.length];
const tokens = tokenizer.encode(sentence);
const n = Math.min(model.config.blockSize, tokens.length - 1);
// Forward: build the computation graph from tokens to loss
const { keys, values } = createKVCache(model);
const losses: Value[] = [];
for (let posId = 0; posId < n; posId++) {
const tokenId = tokens[posId];
const targetId = tokens[posId + 1];
const logits = gpt(model, tokenId, posId, keys, values);
const probs = softmax(logits);
losses.push(probs[targetId].log().neg());
}
// Average cross-entropy loss over the sequence
const loss = vsum(losses).div(n);
// Backward: walk the computation graph and fill in .grad on every Value node.
// Returns void — gradients are stored as a side effect on the same param objects
// that the Adam loop below reads from.
loss.backward();
// Adam update with linear learning rate decay
const lrT = learningRate * (1 - step / numSteps);
for (let i = 0; i < params.length; i++) {
mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
params[i].grad = 0;
}
process.stdout.write(
`\rstep ${String(step + 1).padStart(4)} / ${numSteps} | loss ${loss.data.toFixed(4)}`
);
}
return model;
}
model.ts
/**
* GPT Model
*
* The transformer architecture: a function that maps input tokens to a
* probability distribution over what comes next.
*
* Follows GPT-2 with minor simplifications:
* - RMSNorm instead of LayerNorm
* - No biases
* - ReLU instead of GeLU
*
* Config: 32 embedding dims, 4 attention heads, 2 layers, 16 max context
* → 63,296 parameters total
*/
import { readFileSync, writeFileSync } from "node:fs";
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";
// --- Configuration ---
export interface GPTConfig {
nLayer: number;
nEmbd: number;
blockSize: number;
nHead: number;
headDim: number;
vocabSize: number;
}
interface Attention {
query: Matrix;
key: Matrix;
value: Matrix;
output: Matrix;
}
interface MLP {
hidden: Matrix;
output: Matrix;
}
interface Layer {
attention: Attention;
mlp: MLP;
}
export interface Weights {
tokenEmbedding: Matrix;
positionEmbedding: Matrix;
output: Matrix;
layers: Layer[];
}
/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
config: GPTConfig;
weights: Weights;
params: Value[];
}
// --- Model Creation ---
function matrix(nout: number, nin: number, std = 0.08): Matrix {
return Array.from({ length: nout }, () =>
Array.from({ length: nin }, () => new Value(gauss(0, std)))
);
}
/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
const { nEmbd, nLayer, blockSize, vocabSize } = config;
const weights: Weights = {
tokenEmbedding: matrix(vocabSize, nEmbd),
positionEmbedding: matrix(blockSize, nEmbd),
output: matrix(vocabSize, nEmbd),
layers: Array.from({ length: nLayer }, () => ({
attention: {
query: matrix(nEmbd, nEmbd),
key: matrix(nEmbd, nEmbd),
value: matrix(nEmbd, nEmbd),
output: matrix(nEmbd, nEmbd),
},
mlp: {
hidden: matrix(4 * nEmbd, nEmbd),
output: matrix(nEmbd, 4 * nEmbd),
},
})),
};
const allMatrices: Matrix[] = [
weights.tokenEmbedding,
weights.positionEmbedding,
weights.output,
...weights.layers.flatMap((layer) => [
layer.attention.query, layer.attention.key,
layer.attention.value, layer.attention.output,
layer.mlp.hidden, layer.mlp.output,
]),
];
const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));
return { config, weights, params };
}
// --- Save / Load ---
/** Save a trained model to a JSON file (config + parameter values). */
export function saveModel(model: Model, path: string): void {
const data = {
config: model.config,
weights: model.params.map((p) => p.data),
};
writeFileSync(path, JSON.stringify(data));
}
/** Load a model from a JSON file. Recreates the structure, then fills in the learned weights. */
export function loadModel(path: string): Model {
const data = JSON.parse(readFileSync(path, "utf-8"));
const model = createModel(data.config);
for (let i = 0; i < model.params.length; i++) {
model.params[i].data = data.weights[i];
}
return model;
}
// --- KV Cache ---
/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
keys: Value[][][];
values: Value[][][];
} {
return {
keys: Array.from({ length: model.config.nLayer }, () => []),
values: Array.from({ length: model.config.nLayer }, () => []),
};
}
// --- Forward Pass ---
/**
* Run one step of the GPT: given a token at a position, return logits
* over the vocabulary for the next token.
*
* The keys/values caches are mutated (appended to) on each call —
* this is the KV cache that avoids recomputing attention for past positions.
*/
export function gpt(
model: Model,
tokenId: number,
posId: number,
keys: Value[][][],
values: Value[][][],
): Value[] {
const { nLayer, nHead, headDim } = model.config;
const { weights } = model;
// Step 1: Embedding lookup
// Combine "what word is this?" with "where does it appear?" into a single
// vector. This is the hidden state that flows through the rest of the network.
const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
const positionVector: Value[] = weights.positionEmbedding[posId];
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));
// Normalize before the first layer to keep values at a stable scale
hidden = rmsnorm(hidden);
// Step 2: Transformer layers
// Each layer has two blocks: attention (gather context from other tokens)
// followed by MLP (process the gathered information). Both use residual
// connections so the input is added back to the output of each block.
for (let li = 0; li < nLayer; li++) {
const layer = weights.layers[li];
// --- Attention block: look at previous tokens to gather context ---
// Save the hidden state so we can add it back after the block (residual)
const beforeAttention: Value[] = hidden;
hidden = rmsnorm(hidden);
// Project the hidden state into query, key, and value vectors.
// These are structurally identical projections — training teaches them
// to play different roles: query asks "what am I looking for?",
// key advertises "what do I contain?", value carries "what to retrieve".
const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[] = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
// Cache the key and value so future tokens can attend to this position
keys[li].push(key);
values[li].push(value);
// Each head independently attends to a different slice of the vectors,
// allowing the model to track multiple relationships at once
const attentionOutput: Value[] = [];
for (let h = 0; h < nHead; h++) {
const headStart = h * headDim;
const headQuery = query.slice(headStart, headStart + headDim);
const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));
// Scaled dot-product attention: score = (query · key) / √headDim
const attnLogits = headKeys.map((cachedKey) =>
vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
);
// Softmax converts scores into weights that sum to 1
const attnWeights = softmax(attnLogits);
// Weighted sum of value vectors — high-scoring positions contribute more
for (let j = 0; j < headDim; j++) {
attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
}
}
// Project concatenated head outputs back to the hidden dimension
hidden = linear(attentionOutput, layer.attention.output);
// Residual connection: add back what we had before attention
hidden = hidden.map((h, i) => h.add(beforeAttention[i]));
// --- MLP block: process each token's representation independently ---
const beforeMLP: Value[] = hidden;
hidden = rmsnorm(hidden);
hidden = linear(hidden, layer.mlp.hidden); // expand to 4x wider
hidden = hidden.map((h) => h.relu()); // nonlinearity
hidden = linear(hidden, layer.mlp.output); // compress back
// Residual connection: add back what we had before the MLP
hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
}
// Step 3: Output projection
// Project the final hidden state to vocabulary size — one score per word.
// These raw scores (logits) will be passed through softmax later to get
// a probability distribution over the next token.
return linear(hidden, weights.output);
}
rng.ts
/**
* Seeded Random Number Generator
*
* JavaScript has no built-in seeded RNG, so we roll our own:
* - mulberry32 for uniform random numbers
* - Box-Muller transform for Gaussian (normal) distribution
* - Fisher-Yates for array shuffling
* - Weighted random choice for sampling from probability distributions
*/
let _rngState = 0;
export function seed(s: number): void {
_rngState = s | 0;
}
/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
_rngState = (_rngState + 0x6d2b79f5) | 0;
let t = _rngState;
t = Math.imul(t ^ (t >>> 15), t | 1);
t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}
/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
let u1 = random();
while (u1 === 0) u1 = random();
const u2 = random();
return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}
/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
for (let i = arr.length - 1; i > 0; i--) {
const j = Math.floor(random() * (i + 1));
[arr[i], arr[j]] = [arr[j], arr[i]];
}
}
/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
const total = weights.reduce((a, b) => a + b, 0);
let r = random() * total;
for (let i = 0; i < weights.length; i++) {
r -= weights[i];
if (r <= 0) return i;
}
return weights.length - 1;
}
nn.ts
/**
* Neural Network Primitives
*
* The building blocks that the GPT architecture assembles:
* - linear: matrix-vector multiply (the fundamental neural net operation)
* - softmax: convert raw scores into a probability distribution
* - rmsnorm: normalize activations to stabilize training
*
* These mirror PyTorch's torch.nn.functional — general-purpose operations
* used by the model, training loop, and inference.
*/
import { Value, vsum } from "./autograd.js";
export type Matrix = Value[][];
/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}
/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
const maxVal = Math.max(...logits.map((v) => v.data));
const exps = logits.map((v) => v.sub(maxVal).exp());
const total = vsum(exps);
return exps.map((e) => e.div(total));
}
/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
const scale = ms.add(1e-5).pow(-0.5);
return input.map((xi) => xi.mul(scale));
}
autograd.ts
/**
* Autograd Engine
*
* A scalar-valued automatic differentiation engine. Each Value node records:
* - its forward-pass result (data)
* - its gradient w.r.t. the loss (grad), filled in by backward()
* - its children in the computation graph and the local derivatives
*
* The backward() method applies the chain rule via reverse-mode autodiff:
* topologically sort the graph, then propagate gradients from output to inputs.
*
* "If I nudge this parameter slightly, how does the loss change?"
*/
export class Value {
data: number;
grad: number;
children: Value[];
localGrads: number[];
constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
this.data = data;
this.grad = 0;
this.children = children;
this.localGrads = localGrads;
}
add(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data + o.data, [this, o], [1, 1]);
}
mul(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}
pow(n: number): Value {
return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}
log(): Value {
return new Value(Math.log(this.data), [this], [1 / this.data]);
}
exp(): Value {
return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}
relu(): Value {
return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}
neg(): Value {
return this.mul(-1);
}
sub(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.add(o.neg());
}
div(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.mul(o.pow(-1));
}
backward(): void {
const topo: Value[] = [];
const visited = new Set<Value>();
const buildTopo = (v: Value): void => {
if (!visited.has(v)) {
visited.add(v);
for (const child of v.children) buildTopo(child);
topo.push(v);
}
};
buildTopo(this);
this.grad = 1;
for (const v of topo.reverse()) {
for (let i = 0; i < v.children.length; i++) {
v.children[i].grad += v.localGrads[i] * v.grad;
}
}
}
}
/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
return values.reduce((acc, v) => acc.add(v), new Value(0));
}
tokenizer.ts
/**
* Tokenizer
*
* Translates strings to sequences of integers ("tokens") and back.
* Builds a word-level vocabulary from the training corpus, with a BOS
* (Beginning of Sequence) delimiter appended on each side.
*/
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}
Fine-Tuning
Our model generates declarative sentences: “the cat runs to the park.” But what if we want it to generate questions instead? We could train a new model from scratch on question data, but there is a much better way: fine-tuning.
Fine-tuning means taking a model that already knows a language and continuing to train it on a small, specialized dataset. The model keeps its existing knowledge (what words mean, how sentences are structured) and learns a new pattern (how to form questions) on top of it.
This is exactly how ChatGPT works. OpenAI first trains a base model on trillions of tokens of internet text (pre-training), then fine-tunes it on a much smaller dataset of conversations (fine-tuning). The base model learns language; the fine-tuning teaches it to be a helpful assistant.
The Question Dataset
Our vocabulary has 596 words. We cannot introduce new words without resizing the embedding tables, so the question dataset must use only words the model already knows. This is realistic: real fine-tuning rarely changes the tokenizer.
The vocabulary includes four question words: can, do, is, and where.
That is enough for four question patterns:
can the cat run
do the birds fly
is the dog big
where is the ball
The file data/questions.txt contains 150 sentences following these patterns.
Each sentence uses only words from the existing vocabulary.
Run the Fine-Tuning
The Fine-Tuning Script
Fine-tuning reuses the exact same train() function from pre-training. The only
differences are the inputs:
// Load the ORIGINAL corpus to build the SAME tokenizer.
// The tokenizer must match the one used during pre-training,
// because the model's embeddings are tied to specific token IDs.
const originalDocs = readFileSync("data/grade1_sentences.txt", "utf-8")
.split("\n")
.filter((l) => l.trim())
.map((l) => l.trim().toLowerCase());
const tokenizer = createWordTokenizer(originalDocs);
This is critical. The tokenizer assigns each word an integer index by sorting all unique words alphabetically. If we built the tokenizer from the question dataset instead, the word “cat” might get a different index. The model’s embedding for index 89 means “cat” because that is what it learned during pre-training. Using a different tokenizer would scramble every word.
// Load the pre-trained model
const model = loadModel("phrases-model.json");
// Load the fine-tuning dataset
const questionDocs = readFileSync("data/questions.txt", "utf-8")
.split("\n")
.filter((l) => l.trim())
.map((l) => l.trim().toLowerCase());
Then we call the same train() function with two key changes to the
configuration:
const fineTuned = train(
model,
{
numSteps: 1000, // 1,000 vs 5,000 for pre-training
learningRate: 0.001, // 0.001 vs 0.01 — 10x lower
beta1: 0.85,
beta2: 0.99,
epsAdam: 1e-8,
},
questionDocs,
tokenizer,
);
Why Fewer Steps?
The pre-trained model already knows that “the” comes before nouns and that “cat” is an animal. It does not need to relearn any of this. It only needs to learn that sentences can start with “can”, “do”, “is”, or “where” and that what follows is a slightly different word order. 1,000 steps is enough to shift the distribution without destroying what the model already knows.
Why a Lower Learning Rate?
The learning rate controls how aggressively each gradient update changes the parameters. During pre-training, we used 0.01 because the model was starting from random noise and needed to move fast. During fine-tuning, the parameters are already in a good place. A large learning rate would overshoot, destroying the carefully learned representations. A 10x smaller learning rate (0.001) nudges the model gently toward question patterns while preserving its existing knowledge.
Running It
npx tsx src/phrases-fine-tune.ts
You should see output like this:
vocab size: 597 (596 words + BOS)
loaded pre-trained model: 63296 params
fine-tuning sentences: 150
step 1 / 1000 | loss 2.5073
step 500 / 1000 | loss 0.9592
step 1000 / 1000 | loss 1.0530
model saved to phrases-fine-tuned-model.json
Notice two things:
-
The loss starts around 2.5, not 6.39. The pre-trained model is not guessing randomly. It already assigns reasonable probability to the correct next word, even for question patterns it has rarely seen. This head start is the whole point of fine-tuning.
-
The loss drops quickly. In pre-training, the model needed 5,000 steps to go from 6.39 to ~1.76. Here it reaches similar loss in just 1,000 steps because it already understands the building blocks.
Generate Questions
Use the --model flag to generate from the fine-tuned model:
npx tsx src/phrases-generate.ts 20 --model=phrases-fine-tuned-model.json
Compare this with the base model:
npx tsx src/phrases-generate.ts 20
The fine-tuned model should produce noticeably more questions: sentences starting with “can”, “do”, “is”, and “where”. The base model will mostly produce declarative sentences.
The fine-tuned model will still produce some declarative sentences too. It has not completely forgotten how to make statements; it has just shifted its probability distribution to favor question patterns.
Catastrophic Forgetting
What happens if we fine-tune too aggressively? Try increasing the learning rate or step count:
{ numSteps: 5000, learningRate: 0.01, ... } // same as pre-training
With these settings, the model will almost exclusively produce questions and lose its ability to generate varied declarative sentences. This is called catastrophic forgetting: the new data completely overwrites the old knowledge.
This is one of the fundamental tensions in fine-tuning. You want the model to learn the new task, but you do not want it to forget everything else. In practice, this is managed by:
- Low learning rates (what we did here)
- Few training steps (what we did here)
- Freezing layers (not updating early transformer layers, only the later ones)
- LoRA (adding small adapter matrices instead of modifying the original weights)
Our approach is the simplest version: just be gentle with the learning rate and stop early. It works well for small models and small distribution shifts.
What This Teaches
Fine-tuning is the most common way LLMs are deployed in the real world:
Pre-training (expensive, done once):
trillions of tokens -> base model
Fine-tuning (cheap, done many times):
base model + small dataset -> specialized model
GPT-4 is fine-tuned for conversation. Code Llama is fine-tuned for programming. Med-PaLM is fine-tuned for medical questions. Each starts from the same kind of base model and adapts it with a small, targeted dataset.
The entire fine-tuning pipeline here is five lines of setup and one call to
train(). The same function, the same optimizer, the same math. The only
differences are the data and the hyperparameters. That simplicity is not a
shortcut. It is how fine-tuning actually works.
Complete Code
Here is the complete source code for fine-tuning, along with all files from previous chapters.
phrases-fine-tune.ts
/**
* MicroGPT — Fine-Tune on Questions
*
* Loads a pre-trained model and fine-tunes it on a small question dataset.
* The model learns to produce questions while preserving its existing knowledge.
*
* npx tsx src/phrases-fine-tune.ts
*/
import { readFileSync } from "node:fs";
import { seed, shuffle } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { loadModel, saveModel } from "./model.js";
import { train } from "./train.js";
// 1. Seed the RNG for reproducibility
seed(42);
// 2. Load the ORIGINAL corpus to build the SAME tokenizer.
// The tokenizer must match the one used during pre-training,
// because the model's embeddings are tied to specific token IDs.
const originalDocs = readFileSync("data/grade1_sentences.txt", "utf-8")
.split("\n")
.filter((l) => l.trim())
.map((l) => l.trim().toLowerCase());
const tokenizer = createWordTokenizer(originalDocs);
console.log(`vocab size: ${tokenizer.vocabSize} (${tokenizer.vocabSize - 1} words + BOS)`);
// 3. Load the pre-trained model
const model = loadModel("phrases-model.json");
console.log(`loaded pre-trained model: ${model.params.length} params`);
// 4. Load the fine-tuning dataset (questions)
const questionDocs = readFileSync("data/questions.txt", "utf-8")
.split("\n")
.filter((l) => l.trim())
.map((l) => l.trim().toLowerCase());
shuffle(questionDocs);
console.log(`fine-tuning sentences: ${questionDocs.length}`);
// 5. Fine-tune: fewer steps, lower learning rate.
// - Lower learning rate (0.001 vs 0.01) preserves existing knowledge
// - Fewer steps (1000 vs 5000) since we're adapting, not learning from scratch
const fineTuned = train(
model,
{ numSteps: 1000, learningRate: 0.001, beta1: 0.85, beta2: 0.99, epsAdam: 1e-8 },
questionDocs,
tokenizer,
);
// 6. Save the fine-tuned model
saveModel(fineTuned, "phrases-fine-tuned-model.json");
console.log("\nmodel saved to phrases-fine-tuned-model.json");
phrases-train.ts
/**
* MicroGPT — Train on Phrases
*
* Trains a word-level GPT on grade 1 English sentences and saves the model.
* Run inference afterwards with: npx tsx phrases-generate.ts
*
* npx tsx phrases-train.ts
*/
import { readFileSync } from "node:fs";
import { seed, shuffle } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { createModel, saveModel } from "./model.js";
import { train } from "./train.js";
// 1. Seed the RNG for reproducibility
seed(42);
// 2. Load dataset: 30K grade 1 sentences
const sentences = readFileSync("data/grade1_sentences.txt", "utf-8")
.split("\n")
.filter((l) => l.trim())
.map((l) => l.trim().toLowerCase());
shuffle(sentences);
console.log(`num sentences: ${sentences.length}`);
// 3. Build word-level tokenizer from the corpus
const tokenizer = createWordTokenizer(sentences);
console.log(`vocab size: ${tokenizer.vocabSize} (${tokenizer.vocabSize - 1} words + BOS)`);
// 4. Create the model — bigger than the names model to handle the larger vocabulary
const untrained = createModel({
nLayer: 2,
nEmbd: 32,
blockSize: 16,
nHead: 4,
headDim: 8,
vocabSize: tokenizer.vocabSize,
});
console.log(`num params: ${untrained.params.length}`);
// 5. Train the model
const model = train(
untrained,
{ numSteps: 5000, learningRate: 0.01, beta1: 0.85, beta2: 0.99, epsAdam: 1e-8 },
sentences,
tokenizer,
);
// 6. Save the trained model to disk
saveModel(model, "phrases-model.json");
console.log("\nmodel saved to phrases-model.json");
phrases-generate.ts
/**
* MicroGPT — Generate Phrases
*
* Loads a trained model and generates new sentences.
* Train first with: npx tsx phrases-train.ts
*
* npx tsx phrases-generate.ts
* npx tsx phrases-generate.ts 50 # generate 50 sentences
* npx tsx phrases-generate.ts 20 --temp=0.3 # low temperature
* npx tsx phrases-generate.ts 20 --top-k=10 # top-k filtering
* npx tsx phrases-generate.ts 20 --top-p=0.9 # nucleus sampling
* npx tsx phrases-generate.ts 20 --temp=0.7 --top-p=0.9 --top-k=50
* npx tsx phrases-generate.ts 20 --model=phrases-fine-tuned-model.json
*/
import { readFileSync } from "node:fs";
import { seed } from "./rng.js";
import { createWordTokenizer } from "./tokenizer.js";
import { loadModel } from "./model.js";
import { generate, type GenerateOptions } from "./generate.js";
// Parse CLI arguments
const args = process.argv.slice(2);
const positional = args.filter((a) => !a.startsWith("--"));
const flags = Object.fromEntries(
args.filter((a) => a.startsWith("--")).map((a) => {
const [k, v] = a.slice(2).split("=");
return [k, v];
})
);
const numSamples = parseInt(positional[0] ?? "20", 10);
const options: GenerateOptions = {
temperature: parseFloat(flags["temp"] ?? "0.8"),
topK: parseInt(flags["top-k"] ?? "0", 10),
topP: parseFloat(flags["top-p"] ?? "1.0"),
};
// 1. Seed the RNG
seed(Date.now());
// 2. Rebuild the tokenizer from the same corpus (needed for encode/decode)
const corpus = readFileSync("data/grade1_sentences.txt", "utf-8")
.split("\n")
.filter((l) => l.trim())
.map((l) => l.trim().toLowerCase());
const tokenizer = createWordTokenizer(corpus);
// 3. Load the trained model
const modelPath = flags["model"] ?? "phrases-model.json";
const model = loadModel(modelPath);
console.log(`loaded ${modelPath}: ${model.params.length} params, vocab ${model.config.vocabSize}`);
// 4. Generate
const parts = [`temperature=${options.temperature}`];
if (options.topK! > 0) parts.push(`top-k=${options.topK}`);
if (options.topP! < 1.0) parts.push(`top-p=${options.topP}`);
console.log(`generating ${numSamples} sentences (${parts.join(", ")}):\n`);
const sentences = generate(model, tokenizer, numSamples, options);
sentences.forEach((s, i) =>
console.log(` ${String(i + 1).padStart(2)}. ${s}`)
);
generate.ts
/**
* Inference / Text Generation
*
* Starting from the BOS token, the model predicts one token at a time:
* 1. Forward the current token through the model to get logits
* 2. Apply temperature scaling (lower = more conservative, higher = more creative)
* 3. Apply top-k filtering (keep only the k most likely tokens)
* 4. Apply top-p / nucleus filtering (keep smallest set summing to p)
* 5. Convert to probabilities via softmax
* 6. Randomly sample the next token from that distribution
* 7. Stop when BOS is produced (end of sequence) or max length is reached
*/
import { type Model, gpt, createKVCache } from "./model.js";
import { weightedChoice } from "./rng.js";
import type { Tokenizer } from "./tokenizer.js";
export interface GenerateOptions {
temperature?: number;
topK?: number;
topP?: number;
}
/** Generate new text samples from a trained model. */
export function generate(
model: Model,
tokenizer: Tokenizer,
numSamples: number,
options: GenerateOptions = {},
): string[] {
const { temperature = 0.8, topK = 0, topP = 1.0 } = options;
const samples: string[] = [];
for (let i = 0; i < numSamples; i++) {
const { keys, values } = createKVCache(model);
let tokenId = tokenizer.BOS;
const tokens: number[] = [];
for (let posId = 0; posId < model.config.blockSize; posId++) {
const logits = gpt(model, tokenId, posId, keys, values);
// Temperature scaling
let scores = logits.map((l) => l.data / temperature);
// Top-k: keep only the k highest scores
if (topK > 0 && topK < scores.length) {
const sorted = [...scores].sort((a, b) => b - a);
const cutoff = sorted[topK - 1];
scores = scores.map((s) => (s >= cutoff ? s : -Infinity));
}
// Top-p (nucleus): keep smallest set whose probabilities sum to p
if (topP < 1.0) {
const indices = scores.map((s, idx) => idx);
indices.sort((a, b) => scores[b] - scores[a]);
// Compute softmax on current scores to get probabilities for filtering
const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
const total = exps.reduce((a, b) => a + b, 0);
const probs = exps.map((e) => e / total);
let cumSum = 0;
let exceeded = false;
for (const idx of indices) {
if (exceeded) {
scores[idx] = -Infinity;
}
cumSum += probs[idx];
if (cumSum > topP) {
exceeded = true;
}
}
}
// Softmax and sample (using plain numbers, not Value, for efficiency)
const maxS = Math.max(...scores.filter((s) => s !== -Infinity));
const exps = scores.map((s) => (s === -Infinity ? 0 : Math.exp(s - maxS)));
const total = exps.reduce((a, b) => a + b, 0);
const probs = exps.map((e) => e / total);
tokenId = weightedChoice(probs);
if (tokenId === tokenizer.BOS) break;
tokens.push(tokenId);
}
samples.push(tokenizer.decode(tokens));
}
return samples;
}
train.ts
/**
* Training Loop
*
* Each step:
* 1. Pick a sentence and tokenize it
* 2. Forward each token through the model, predicting the next token
* 3. Compute cross-entropy loss (how surprised the model is at the correct answer)
* 4. Backward pass: compute gradients via the chain rule
* 5. Adam optimizer: update parameters to reduce loss
*
* Loss starts at -log(1/vocabSize) (random guessing) and decreases as
* the model learns to predict the next word.
*/
import { Value, vsum } from "./autograd.js";
import { softmax } from "./nn.js";
import { type Model, gpt, createKVCache } from "./model.js";
import type { Tokenizer } from "./tokenizer.js";
export interface TrainConfig {
numSteps: number;
learningRate: number;
beta1: number;
beta2: number;
epsAdam: number;
}
/** Train the model on the dataset and return the trained model. */
export function train(
model: Model,
trainConfig: TrainConfig,
sentences: string[],
tokenizer: Tokenizer,
): Model {
const { numSteps, learningRate, beta1, beta2, epsAdam } = trainConfig;
const { params } = model;
// Adam optimizer moment buffers
const mBuf = new Float64Array(params.length);
const vBuf = new Float64Array(params.length);
for (let step = 0; step < numSteps; step++) {
// Tokenize a single sentence, surrounded by BOS on both sides
const sentence = sentences[step % sentences.length];
const tokens = tokenizer.encode(sentence);
const n = Math.min(model.config.blockSize, tokens.length - 1);
// Forward: build the computation graph from tokens to loss
const { keys, values } = createKVCache(model);
const losses: Value[] = [];
for (let posId = 0; posId < n; posId++) {
const tokenId = tokens[posId];
const targetId = tokens[posId + 1];
const logits = gpt(model, tokenId, posId, keys, values);
const probs = softmax(logits);
losses.push(probs[targetId].log().neg());
}
// Average cross-entropy loss over the sequence
const loss = vsum(losses).div(n);
// Backward: walk the computation graph and fill in .grad on every Value node.
// Returns void — gradients are stored as a side effect on the same param objects
// that the Adam loop below reads from.
loss.backward();
// Adam update with linear learning rate decay
const lrT = learningRate * (1 - step / numSteps);
for (let i = 0; i < params.length; i++) {
mBuf[i] = beta1 * mBuf[i] + (1 - beta1) * params[i].grad;
vBuf[i] = beta2 * vBuf[i] + (1 - beta2) * params[i].grad ** 2;
const mHat = mBuf[i] / (1 - beta1 ** (step + 1));
const vHat = vBuf[i] / (1 - beta2 ** (step + 1));
params[i].data -= lrT * mHat / (Math.sqrt(vHat) + epsAdam);
params[i].grad = 0;
}
process.stdout.write(
`\rstep ${String(step + 1).padStart(4)} / ${numSteps} | loss ${loss.data.toFixed(4)}`
);
}
return model;
}
model.ts
/**
* GPT Model
*
* The transformer architecture: a function that maps input tokens to a
* probability distribution over what comes next.
*
* Follows GPT-2 with minor simplifications:
* - RMSNorm instead of LayerNorm
* - No biases
* - ReLU instead of GeLU
*
* Config: 32 embedding dims, 4 attention heads, 2 layers, 16 max context
* → 63,296 parameters total
*/
import { readFileSync, writeFileSync } from "node:fs";
import { Value, vsum } from "./autograd.js";
import { type Matrix, linear, softmax, rmsnorm } from "./nn.js";
import { gauss } from "./rng.js";
// --- Configuration ---
export interface GPTConfig {
nLayer: number;
nEmbd: number;
blockSize: number;
nHead: number;
headDim: number;
vocabSize: number;
}
interface Attention {
query: Matrix;
key: Matrix;
value: Matrix;
output: Matrix;
}
interface MLP {
hidden: Matrix;
output: Matrix;
}
interface Layer {
attention: Attention;
mlp: MLP;
}
export interface Weights {
tokenEmbedding: Matrix;
positionEmbedding: Matrix;
output: Matrix;
layers: Layer[];
}
/** The trained (or untrained) model: config + weights + flattened parameter list. */
export interface Model {
config: GPTConfig;
weights: Weights;
params: Value[];
}
// --- Model Creation ---
function matrix(nout: number, nin: number, std = 0.08): Matrix {
return Array.from({ length: nout }, () =>
Array.from({ length: nin }, () => new Value(gauss(0, std)))
);
}
/** Create a new model with randomly initialized weights. */
export function createModel(config: GPTConfig): Model {
const { nEmbd, nLayer, blockSize, vocabSize } = config;
const weights: Weights = {
tokenEmbedding: matrix(vocabSize, nEmbd),
positionEmbedding: matrix(blockSize, nEmbd),
output: matrix(vocabSize, nEmbd),
layers: Array.from({ length: nLayer }, () => ({
attention: {
query: matrix(nEmbd, nEmbd),
key: matrix(nEmbd, nEmbd),
value: matrix(nEmbd, nEmbd),
output: matrix(nEmbd, nEmbd),
},
mlp: {
hidden: matrix(4 * nEmbd, nEmbd),
output: matrix(nEmbd, 4 * nEmbd),
},
})),
};
const allMatrices: Matrix[] = [
weights.tokenEmbedding,
weights.positionEmbedding,
weights.output,
...weights.layers.flatMap((layer) => [
layer.attention.query, layer.attention.key,
layer.attention.value, layer.attention.output,
layer.mlp.hidden, layer.mlp.output,
]),
];
const params = allMatrices.flatMap((mat) => mat.flatMap((row) => row));
return { config, weights, params };
}
// --- Save / Load ---
/** Save a trained model to a JSON file (config + parameter values). */
export function saveModel(model: Model, path: string): void {
const data = {
config: model.config,
weights: model.params.map((p) => p.data),
};
writeFileSync(path, JSON.stringify(data));
}
/** Load a model from a JSON file. Recreates the structure, then fills in the learned weights. */
export function loadModel(path: string): Model {
const data = JSON.parse(readFileSync(path, "utf-8"));
const model = createModel(data.config);
for (let i = 0; i < model.params.length; i++) {
model.params[i].data = data.weights[i];
}
return model;
}
// --- KV Cache ---
/** Create fresh key/value caches for a new sequence. Must be called per-sequence. */
export function createKVCache(model: Model): {
keys: Value[][][];
values: Value[][][];
} {
return {
keys: Array.from({ length: model.config.nLayer }, () => []),
values: Array.from({ length: model.config.nLayer }, () => []),
};
}
// --- Forward Pass ---
/**
* Run one step of the GPT: given a token at a position, return logits
* over the vocabulary for the next token.
*
* The keys/values caches are mutated (appended to) on each call —
* this is the KV cache that avoids recomputing attention for past positions.
*/
export function gpt(
model: Model,
tokenId: number,
posId: number,
keys: Value[][][],
values: Value[][][],
): Value[] {
const { nLayer, nHead, headDim } = model.config;
const { weights } = model;
// Step 1: Embedding lookup
// Combine "what word is this?" with "where does it appear?" into a single
// vector. This is the hidden state that flows through the rest of the network.
const tokenVector: Value[] = weights.tokenEmbedding[tokenId];
const positionVector: Value[] = weights.positionEmbedding[posId];
let hidden: Value[] = tokenVector.map((t, i) => t.add(positionVector[i]));
// Normalize before the first layer to keep values at a stable scale
hidden = rmsnorm(hidden);
// Step 2: Transformer layers
// Each layer has two blocks: attention (gather context from other tokens)
// followed by MLP (process the gathered information). Both use residual
// connections so the input is added back to the output of each block.
for (let li = 0; li < nLayer; li++) {
const layer = weights.layers[li];
// --- Attention block: look at previous tokens to gather context ---
// Save the hidden state so we can add it back after the block (residual)
const beforeAttention: Value[] = hidden;
hidden = rmsnorm(hidden);
// Project the hidden state into query, key, and value vectors.
// These are structurally identical projections — training teaches them
// to play different roles: query asks "what am I looking for?",
// key advertises "what do I contain?", value carries "what to retrieve".
const query: Value[] = linear(hidden, layer.attention.query);
const key: Value[] = linear(hidden, layer.attention.key);
const value: Value[] = linear(hidden, layer.attention.value);
// Cache the key and value so future tokens can attend to this position
keys[li].push(key);
values[li].push(value);
// Each head independently attends to a different slice of the vectors,
// allowing the model to track multiple relationships at once
const attentionOutput: Value[] = [];
for (let h = 0; h < nHead; h++) {
const headStart = h * headDim;
const headQuery = query.slice(headStart, headStart + headDim);
const headKeys = keys[li].map((ki) => ki.slice(headStart, headStart + headDim));
const headValues = values[li].map((vi) => vi.slice(headStart, headStart + headDim));
// Scaled dot-product attention: score = (query · key) / √headDim
const attnLogits = headKeys.map((cachedKey) =>
vsum(headQuery.map((q, j) => q.mul(cachedKey[j]))).div(Math.sqrt(headDim))
);
// Softmax converts scores into weights that sum to 1
const attnWeights = softmax(attnLogits);
// Weighted sum of value vectors — high-scoring positions contribute more
for (let j = 0; j < headDim; j++) {
attentionOutput.push(vsum(attnWeights.map((w, t) => w.mul(headValues[t][j]))));
}
}
// Project concatenated head outputs back to the hidden dimension
hidden = linear(attentionOutput, layer.attention.output);
// Residual connection: add back what we had before attention
hidden = hidden.map((h, i) => h.add(beforeAttention[i]));
// --- MLP block: process each token's representation independently ---
const beforeMLP: Value[] = hidden;
hidden = rmsnorm(hidden);
hidden = linear(hidden, layer.mlp.hidden); // expand to 4x wider
hidden = hidden.map((h) => h.relu()); // nonlinearity
hidden = linear(hidden, layer.mlp.output); // compress back
// Residual connection: add back what we had before the MLP
hidden = hidden.map((h, i) => h.add(beforeMLP[i]));
}
// Step 3: Output projection
// Project the final hidden state to vocabulary size — one score per word.
// These raw scores (logits) will be passed through softmax later to get
// a probability distribution over the next token.
return linear(hidden, weights.output);
}
rng.ts
/**
* Seeded Random Number Generator
*
* JavaScript has no built-in seeded RNG, so we roll our own:
* - mulberry32 for uniform random numbers
* - Box-Muller transform for Gaussian (normal) distribution
* - Fisher-Yates for array shuffling
* - Weighted random choice for sampling from probability distributions
*/
let _rngState = 0;
export function seed(s: number): void {
_rngState = s | 0;
}
/** Mulberry32: a fast, seedable 32-bit PRNG. Returns a uniform float in [0, 1). */
export function random(): number {
_rngState = (_rngState + 0x6d2b79f5) | 0;
let t = _rngState;
t = Math.imul(t ^ (t >>> 15), t | 1);
t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
}
/** Box-Muller transform: convert two uniform samples into a Gaussian sample. */
export function gauss(mean: number, std: number): number {
let u1 = random();
while (u1 === 0) u1 = random();
const u2 = random();
return mean + std * Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
}
/** Fisher-Yates shuffle: randomly reorder an array in-place. */
export function shuffle<T>(arr: T[]): void {
for (let i = arr.length - 1; i > 0; i--) {
const j = Math.floor(random() * (i + 1));
[arr[i], arr[j]] = [arr[j], arr[i]];
}
}
/** Sample an index from a weighted distribution. Used for token sampling during inference. */
export function weightedChoice(weights: number[]): number {
const total = weights.reduce((a, b) => a + b, 0);
let r = random() * total;
for (let i = 0; i < weights.length; i++) {
r -= weights[i];
if (r <= 0) return i;
}
return weights.length - 1;
}
nn.ts
/**
* Neural Network Primitives
*
* The building blocks that the GPT architecture assembles:
* - linear: matrix-vector multiply (the fundamental neural net operation)
* - softmax: convert raw scores into a probability distribution
* - rmsnorm: normalize activations to stabilize training
*
* These mirror PyTorch's torch.nn.functional — general-purpose operations
* used by the model, training loop, and inference.
*/
import { Value, vsum } from "./autograd.js";
export type Matrix = Value[][];
/** y = Wx: multiply a weight matrix by an input vector. */
export function linear(input: Value[], weights: Matrix): Value[] {
return weights.map((row) => vsum(row.map((w, i) => w.mul(input[i]))));
}
/** Convert raw logits to probabilities. Subtracts max for numerical stability. */
export function softmax(logits: Value[]): Value[] {
const maxVal = Math.max(...logits.map((v) => v.data));
const exps = logits.map((v) => v.sub(maxVal).exp());
const total = vsum(exps);
return exps.map((e) => e.div(total));
}
/** Root Mean Square normalization: scale activations to unit variance. */
export function rmsnorm(input: Value[]): Value[] {
const ms = vsum(input.map((xi) => xi.mul(xi))).div(input.length);
const scale = ms.add(1e-5).pow(-0.5);
return input.map((xi) => xi.mul(scale));
}
autograd.ts
/**
* Autograd Engine
*
* A scalar-valued automatic differentiation engine. Each Value node records:
* - its forward-pass result (data)
* - its gradient w.r.t. the loss (grad), filled in by backward()
* - its children in the computation graph and the local derivatives
*
* The backward() method applies the chain rule via reverse-mode autodiff:
* topologically sort the graph, then propagate gradients from output to inputs.
*
* "If I nudge this parameter slightly, how does the loss change?"
*/
export class Value {
data: number;
grad: number;
children: Value[];
localGrads: number[];
constructor(data: number, children: Value[] = [], localGrads: number[] = []) {
this.data = data;
this.grad = 0;
this.children = children;
this.localGrads = localGrads;
}
add(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data + o.data, [this, o], [1, 1]);
}
mul(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return new Value(this.data * o.data, [this, o], [o.data, this.data]);
}
pow(n: number): Value {
return new Value(this.data ** n, [this], [n * this.data ** (n - 1)]);
}
log(): Value {
return new Value(Math.log(this.data), [this], [1 / this.data]);
}
exp(): Value {
return new Value(Math.exp(this.data), [this], [Math.exp(this.data)]);
}
relu(): Value {
return new Value(Math.max(0, this.data), [this], [this.data > 0 ? 1 : 0]);
}
neg(): Value {
return this.mul(-1);
}
sub(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.add(o.neg());
}
div(other: Value | number): Value {
const o = typeof other === "number" ? new Value(other) : other;
return this.mul(o.pow(-1));
}
backward(): void {
const topo: Value[] = [];
const visited = new Set<Value>();
const buildTopo = (v: Value): void => {
if (!visited.has(v)) {
visited.add(v);
for (const child of v.children) buildTopo(child);
topo.push(v);
}
};
buildTopo(this);
this.grad = 1;
for (const v of topo.reverse()) {
for (let i = 0; i < v.children.length; i++) {
v.children[i].grad += v.localGrads[i] * v.grad;
}
}
}
}
/** Sum a list of Values through the computation graph. */
export function vsum(values: Value[]): Value {
return values.reduce((acc, v) => acc.add(v), new Value(0));
}
tokenizer.ts
/**
* Tokenizer
*
* Translates strings to sequences of integers ("tokens") and back.
* Builds a word-level vocabulary from the training corpus, with a BOS
* (Beginning of Sequence) delimiter appended on each side.
*/
export interface Tokenizer {
vocabSize: number;
BOS: number;
encode(sentence: string): number[];
decode(tokens: number[]): string;
}
/** Word-level tokenizer. Discovers the word vocabulary from the corpus. */
export function createWordTokenizer(sentences: string[]): Tokenizer {
const words = [...new Set(sentences.flatMap((d) => d.split(" ")))].sort();
const BOS = words.length;
const vocabSize = words.length + 1;
return {
vocabSize,
BOS,
encode(sentence: string): number[] {
return [BOS, ...sentence.split(" ").map((w) => words.indexOf(w)), BOS];
},
decode(tokens: number[]): string {
return tokens.map((t) => words[t]).join(" ");
},
};
}