Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration

The configuration defines the shape of the model:

const model = createModel({
  nLayer: 2,       // number of transformer layers
  nEmbd: 32,       // embedding dimension (size of internal vectors)
  blockSize: 16,   // maximum sequence length (longest sentence we can process)
  nHead: 4,        // number of attention heads
  headDim: 8,      // dimension per attention head (nEmbd / nHead)
  vocabSize: 597,  // our tokenizer's vocabulary size
});

These are small numbers. Production models use nEmbd in the thousands and dozens of layers. But the architecture is the same, ours just fits in memory and trains in minutes instead of months.

A note on nHead: with 32 embedding dimensions, 4 heads is a good balance. Each head gets 32 / 4 = 8 dimensions to work with. Two heads would give 16 dims each (fewer distinct attention patterns), and 8 heads would give 4 dims each (very little room per head at this small scale).