Prediction and Learning

This part follows the model from final hidden states to training updates. It explains how GPT chooses the next token, measures mistakes, and changes its weights.

In 11 Vocabulary Projection — From Vectors to Words, the final vector becomes logits and next-token probabilities.
In 12 Loss — How the Model Learns, predictions become a scalar training signal through cross-entropy and perplexity.
In 13 Training — Teaching the Model, gradients and weight updates connect the loss back to the model parameters.