Transformer Core

This part builds the transformer block from its major components. The chapters move from one attention head to a full block that can be stacked into a GPT model.

In 6 Attention — Tokens Talking to Each Other, each token learns which other tokens to read from.
In 7 RoPE: Position Inside Attention, position moves inside the attention score through rotary position encoding.
In 8 Multi-Head Attention — Many Conversations at Once, several attention patterns run in parallel.
In 9 Feed-Forward Network — The Model’s Memory, each token representation is transformed by a position-wise neural network.
In 10 The Transformer Block — Putting It Together, attention, feed-forward layers, residual connections, and normalization become one reusable block.