Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parameters: The Model’s Memory

When we create a model, we allocate a set of weight matrices filled with small random numbers:

function matrix(nout: number, nin: number, std = 0.08): Matrix {
  return Array.from({ length: nout }, () =>
    Array.from({ length: nin }, () => new Value(gauss(0, std)))
  );
}

Each weight matrix serves a specific role. Here is every matrix in our model and what it does:

MatrixShapePurpose
weights.tokenEmbedding597 x 32Token embeddings: one 32-dim vector per word
weights.positionEmbedding16 x 32Position embeddings: one 32-dim vector per position
weights.output597 x 32Output projection: maps back to vocabulary
layers[0].attention.query32 x 32Attention query weights (layer 0)
layers[0].attention.key32 x 32Attention key weights (layer 0)
layers[0].attention.value32 x 32Attention value weights (layer 0)
layers[0].attention.output32 x 32Attention output weights (layer 0)
layers[0].mlp.hidden128 x 32MLP hidden layer (layer 0)
layers[0].mlp.output32 x 128MLP output layer (layer 0)
layers[1].attention.query32 x 32Attention query weights (layer 1)
layers[1].attention.key32 x 32Attention key weights (layer 1)
layers[1].attention.value32 x 32Attention value weights (layer 1)
layers[1].attention.output32 x 32Attention output weights (layer 1)
layers[1].mlp.hidden128 x 32MLP hidden layer (layer 1)
layers[1].mlp.output32 x 128MLP output layer (layer 1)

Total: 63,296 parameters. Every one of these numbers will be adjusted during training.