- Introduction
- Building the Model
- 1. Prerequisites
- 1.1. Software
- 1.2. The Training Data
- 1.3. The Pipeline
- 2. The Tokenizer
- 2.1. Building the Vocabulary
- 2.2. Encoding and Decoding
- 2.3. Why Word-Level Tokens?
- 2.4. Complete Code
- 3. The Autograd Engine
- 3.1. The Math You Need
- 3.2. The Value Class
- 3.3. Derived Operations
- 3.4. The Computation Graph
- 3.5. Backward Pass
- 3.6. Complete Code
- 4. Neural Network Primitives
- 4.1. Linear
- 4.2. Softmax
- 4.3. RMSNorm
- 4.4. Complete Code
- 5. The Model
- 5.1. Configuration
- 5.2. Parameters
- 5.3. Embeddings
- 5.4. Attention
- 5.5. MLP
- 5.6. Residual Connections
- 5.7. The KV Cache
- 5.8. Creating the Model
- 5.9. Running the Model
- 5.10. Complete Code
- Training and Inference
- 6. Training
- 6.1. The Training Configuration
- 6.2. The Training Loop
- 6.3. Watching It Learn
- 6.4. Complete Code
- 7. Saving the Model
- 7.1. What Gets Saved
- 7.2. Loading the Model
- 8. Generation
- 8.1. The Generation Loop
- 8.2. Sampling Strategies
- 8.3. The KV Cache
- 8.4. Example Output
- 8.5. Complete Code
- Putting It to Work
- 9. Smoke Test
- 9.1. Train the Model
- 9.2. Generate Sentences
- 9.3. What You Have Built
- 9.4. Complete Code
- 10. Fine-Tuning
- 10.1. The Question Dataset
- 10.2. Run the Fine-Tuning
- 10.3. Generate Questions
- 10.4. Catastrophic Forgetting
- 10.5. Complete Code