Appendix A — microGPT in Python — Complete Runnable Code

This appendix assembles every Python snippet from the book into three files under src/python/, with all dependencies resolved. The implementation uses only the Python standard library.

A.1 Files

  • common.py — forward-pass building blocks (embeddings, attention, FFN, transformer, GPT forward)
  • train.py — loss functions and gradient descent (cross-entropy, backprop, SGD)
  • inference.py — generation and end-to-end demo

A.2 Installation & Running

# Run the full demo
python3 src/python/inference.py

# Run every chapter's end-to-end code check
python3 src/python/run_book_code.py

A.3 End-to-End Demo

def demo(seed: int = 7) -> dict[str, object]:
    config = GPTConfig(vocab_size=100, d_model=32, num_heads=4, num_layers=2, max_seq_len=64)
    model_rng = random.Random(seed)
    sample_rng = random.Random(seed + 1)
    params = make_gpt_params(config, model_rng)
    logits = gpt_forward([1, 2, 3, 4, 5], params, config)
    probabilities = softmax(logits)
    top_token = max(range(len(probabilities)), key=probabilities.__getitem__)
    generated = generate(params, config, [1, 2, 3], 10, temperature=0.8, top_k=10, rng=sample_rng)
    return {
        "config": config,
        "parameters": count_parameters(config),
        "logits_first_10": logits[:10],
        "top_token": top_token,
        "top_probability": probabilities[top_token],
        "generated": generated,
        "full_sequence": [1, 2, 3] + generated,
    }

A.4 microGPT Architecture Summary (Reference)

Component Input Output Parameters
Token Embedding [T] ints \([T \times d]\) \(|V| \times d\)
Positional Embedding [T] positions \([T \times d]\) \(T_{max} \times d\)
× N Transformer Blocks
LayerNorm 1 \([T \times d]\) \([T \times d]\) 2d
Multi-Head Attn \([T \times d]\) \([T \times d]\) \(4d^2\)
Residual \([T \times d]\) \([T \times d]\) 0
LayerNorm 2 \([T \times d]\) \([T \times d]\) 2d
FFN \([T \times d]\) \([T \times d]\) \(8d^2\)
Residual \([T \times d]\) \([T \times d]\) 0
Final LayerNorm \([T \times d]\) \([T \times d]\) 2d
Unembedding [d] \([|V|]\) \(d \times |V|\) (tied)

Total parameters: \(2|V|d + T_{max}\cdot d + N(12d^2 + 4d) + 2d\)

For GPT-2 small: d=768, N=12, |V|=50257, T_max=1024 → ~117M params.