Parameters: The Model’s Memory

When we create a model, we allocate a set of weight matrices filled with small random numbers:

function matrix(nout: number, nin: number, std = 0.08): Matrix {
  return Array.from({ length: nout }, () =>
    Array.from({ length: nin }, () => new Value(gauss(0, std)))
  );
}

Each weight matrix serves a specific role. Here is every matrix in our model and what it does:

Matrix	Shape	Purpose
`weights.tokenEmbedding`	597 x 32	Token embeddings: one 32-dim vector per word
`weights.positionEmbedding`	16 x 32	Position embeddings: one 32-dim vector per position
`weights.output`	597 x 32	Output projection: maps back to vocabulary
`layers[0].attention.query`	32 x 32	Attention query weights (layer 0)
`layers[0].attention.key`	32 x 32	Attention key weights (layer 0)
`layers[0].attention.value`	32 x 32	Attention value weights (layer 0)
`layers[0].attention.output`	32 x 32	Attention output weights (layer 0)
`layers[0].mlp.hidden`	128 x 32	MLP hidden layer (layer 0)
`layers[0].mlp.output`	32 x 128	MLP output layer (layer 0)
`layers[1].attention.query`	32 x 32	Attention query weights (layer 1)
`layers[1].attention.key`	32 x 32	Attention key weights (layer 1)
`layers[1].attention.value`	32 x 32	Attention value weights (layer 1)
`layers[1].attention.output`	32 x 32	Attention output weights (layer 1)
`layers[1].mlp.hidden`	128 x 32	MLP hidden layer (layer 1)
`layers[1].mlp.output`	32 x 128	MLP output layer (layer 1)

Total: 63,296 parameters. Every one of these numbers will be adjusted during training.

Keyboard shortcuts

LLMs, the Hard Way

Parameters: The Model’s Memory