Appendix A — microGPT in Python — Complete Runnable Code
This appendix assembles every Python snippet from the book into three files under src/python/, with all dependencies resolved. The implementation uses only the Python standard library.
A.1 Files
common.py— forward-pass building blocks (embeddings, attention, FFN, transformer, GPT forward)train.py— loss functions and gradient descent (cross-entropy, backprop, SGD)inference.py— generation and end-to-end demo
A.2 Installation & Running
# Run the full demo
python3 src/python/inference.py
# Run every chapter's end-to-end code check
python3 src/python/run_book_code.pyA.3 End-to-End Demo
def demo(seed: int = 7) -> dict[str, object]:
config = GPTConfig(vocab_size=100, d_model=32, num_heads=4, num_layers=2, max_seq_len=64)
model_rng = random.Random(seed)
sample_rng = random.Random(seed + 1)
params = make_gpt_params(config, model_rng)
logits = gpt_forward([1, 2, 3, 4, 5], params, config)
probabilities = softmax(logits)
top_token = max(range(len(probabilities)), key=probabilities.__getitem__)
generated = generate(params, config, [1, 2, 3], 10, temperature=0.8, top_k=10, rng=sample_rng)
return {
"config": config,
"parameters": count_parameters(config),
"logits_first_10": logits[:10],
"top_token": top_token,
"top_probability": probabilities[top_token],
"generated": generated,
"full_sequence": [1, 2, 3] + generated,
}A.4 microGPT Architecture Summary (Reference)
| Component | Input | Output | Parameters |
|---|---|---|---|
| Token Embedding | [T] ints |
\([T \times d]\) | \(|V| \times d\) |
| Positional Embedding | [T] positions |
\([T \times d]\) | \(T_{max} \times d\) |
| × N Transformer Blocks | |||
| LayerNorm 1 | \([T \times d]\) | \([T \times d]\) | 2d |
| Multi-Head Attn | \([T \times d]\) | \([T \times d]\) | \(4d^2\) |
| Residual | \([T \times d]\) | \([T \times d]\) | 0 |
| LayerNorm 2 | \([T \times d]\) | \([T \times d]\) | 2d |
| FFN | \([T \times d]\) | \([T \times d]\) | \(8d^2\) |
| Residual | \([T \times d]\) | \([T \times d]\) | 0 |
| Final LayerNorm | \([T \times d]\) | \([T \times d]\) | 2d |
| Unembedding | [d] |
\([|V|]\) | \(d \times |V|\) (tied) |
Total parameters: \(2|V|d + T_{max}\cdot d + N(12d^2 + 4d) + 2d\)
For GPT-2 small: d=768, N=12, |V|=50257, T_max=1024 → ~117M params.