Appendix A — microGPT in Python — Complete Runnable Code

This appendix assembles every Python snippet from the book into three files under src/python/, with all dependencies resolved. The implementation uses only the Python standard library.

A.1 Files

common.py — forward-pass building blocks (embeddings, attention, FFN, transformer, GPT forward)
train.py — loss functions and gradient descent (cross-entropy, backprop, SGD)
inference.py — generation and end-to-end demo

A.2 Installation & Running

# Run the full demo
python3 src/python/inference.py

# Run every chapter's end-to-end code check
python3 src/python/run_book_code.py

A.3 End-to-End Demo

def demo(seed: int = 7) -> dict[str, object]:
    config = GPTConfig(vocab_size=100, d_model=32, num_heads=4, num_layers=2, max_seq_len=64)
    model_rng = random.Random(seed)
    sample_rng = random.Random(seed + 1)
    params = make_gpt_params(config, model_rng)
    logits = gpt_forward([1, 2, 3, 4, 5], params, config)
    probabilities = softmax(logits)
    top_token = max(range(len(probabilities)), key=probabilities.__getitem__)
    generated = generate(params, config, [1, 2, 3], 10, temperature=0.8, top_k=10, rng=sample_rng)
    return {
        "config": config,
        "parameters": count_parameters(config),
        "logits_first_10": logits[:10],
        "top_token": top_token,
        "top_probability": probabilities[top_token],
        "generated": generated,
        "full_sequence": [1, 2, 3] + generated,
    }

A.4 microGPT Architecture Summary (Reference)

Component	Input	Output	Parameters
Token Embedding	`[T]` ints	\([T \times d]\)	\(\|V\| \times d\)
Positional Embedding	`[T]` positions	\([T \times d]\)	\(T_{max} \times d\)
× N Transformer Blocks
LayerNorm 1	\([T \times d]\)	\([T \times d]\)	`2d`
Multi-Head Attn	\([T \times d]\)	\([T \times d]\)	\(4d^2\)
Residual	\([T \times d]\)	\([T \times d]\)	0
LayerNorm 2	\([T \times d]\)	\([T \times d]\)	`2d`
FFN	\([T \times d]\)	\([T \times d]\)	\(8d^2\)
Residual	\([T \times d]\)	\([T \times d]\)	0
Final LayerNorm	\([T \times d]\)	\([T \times d]\)	`2d`
Unembedding	`[d]`	\([\|V\|]\)	\(d \times \|V\|\) (tied)

Total parameters: \(2|V|d + T_{max}\cdot d + N(12d^2 + 4d) + 2d\)

For GPT-2 small: d=768, N=12, |V|=50257, T_max=1024 → ~117M params.