Preface

Who This Book is For?

This book is for programmers who want to understand how GPT-style language models actually work — not just use them.

If you have written Python before and can follow basic algebra, you have enough background to read every page. You do not need prior experience with machine learning, deep learning, or neural networks. The book introduces every concept from scratch, explains the math in plain language before writing it as a formula, and implements each idea in code you can run and inspect.

If you are already a machine-learning practitioner, this book gives you a clean, self-contained reference for the transformer architecture. The implementations are deliberately simple — no frameworks, no abstractions beyond plain Python — so nothing hides what is actually happening.

How to Use This Book

Read the chapters in order on the first pass. Each chapter builds on the previous one: tokens before embeddings, embeddings before attention, attention before the full transformer block. Skipping ahead is possible but you may find yourself returning to fill gaps.

Every code snippet is part of a complete, runnable program. The source lives in src/python/ and can be run with python3 src/python/run_book_code.py. Every diagram is generated from code in src/matplotlib/, src/figures/, or directly inside the chapter files. If a diagram surprises you, open its source and change the numbers.

Work through the exercises at the end of each chapter. The exercises are not optional decoration — they are the fastest way to verify that you understood the material rather than just read it.

What the book covers

The book covers fourteen chapters and one appendix:

Foundations. 1  Introduction gives an overview of GPT and its history. 2  Notation and Definitions introduces the math notation used throughout: vectors, matrices, dot products, softmax, logarithms, mean, and variance.

Representations. 3  Tokens — Text to Numbers explains tokenization and byte-pair encoding. 4  Embeddings — Numbers to Meaning covers word embeddings and the embedding lookup. 5  Positional Encoding — Giving Order to Meaning introduces sinusoidal positional encoding.

Transformer Core. 6  Attention — Tokens Talking to Each Other derives scaled dot-product attention step by step. 7  RoPE: Position Inside Attention shows how modern GPT-style models put position directly inside attention. 8  Multi-Head Attention — Many Conversations at Once extends it to multi-head attention. 9  Feed-Forward Network — The Model’s Memory covers the feed-forward sub-layer. 10  The Transformer Block — Putting It Together assembles these pieces into a full transformer block with residual connections and layer normalization.

Prediction and Learning. 11  Vocabulary Projection — From Vectors to Words covers vocabulary projection. 12  Loss — How the Model Learns derives cross-entropy loss and perplexity. 13  Training — Teaching the Model traces backpropagation and the Adam optimizer through the full model.

Modern GPT. 14  Modern GPT describes the changes in current models: RoPE positional encoding, grouped-query attention, SwiGLU activations, and RMSNorm.

Appendix A presents the complete micro-GPT implementation — the same code from across the chapters assembled into one readable file.

Should I Buy This Book?

The full text is free to read online at the book’s website. Nothing is paywalled.

If you find the book useful and want to support the work, you can purchase the EPUB or PDF edition from the releases page. Buying a copy is a way of saying thank you — but it is entirely optional. The content is identical to what you are reading now.