7 RoPE: Position Inside Attention

So far, position has been added to the token representation. Sinusoidal and learned positional embeddings both create a position vector and add it to the token embedding before attention begins.

Modern GPT-style models often do something more subtle: they put position into attention itself. That idea is called Rotary Positional Encoding, or RoPE.

7.1 The Idea

Standard positional encoding changes the input stream:

\[ \tilde{X} = X + PE \]

Attention then projects \(\tilde{X}\) into \(Q\), \(K\), and \(V\).

RoPE changes a different point in the pipeline. It first computes the usual query, key, and value vectors:

\[ Q = XW_q,\quad K = XW_k,\quad V = XW_v \]

Then it rotates each query and key according to that token’s position. Values are not rotated.

The attention score is still a dot product, but now it compares rotated vectors:

\[ S[i,j] = \frac{(R_i q_i) \cdot (R_j k_j)}{\sqrt{d_k}} \]

This lets position affect which tokens attend to which other tokens. Instead of saying “position is another feature in the token vector,” RoPE says “position changes the geometry of query-key matching.”

7.2 The Math

RoPE treats a vector as pairs of dimensions. For each pair, it applies a 2D rotation.

For a vector pair \((x, y)\) and angle \(\theta\):

\[ \begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \]

For token position \(m\) and dimension pair \(r\), RoPE uses:

\[ \theta_{m,r} = \frac{m}{10000^{2r/d}} \]

So each pair rotates at a different frequency. Early dimensions rotate quickly; later dimensions rotate slowly.

The important property is what happens inside the attention dot product:

\[ (R_m q) \cdot (R_n k) = q \cdot (R_{n-m} k) \]

The score depends on the relative distance \(n-m\) between the query position and key position. That is how RoPE encodes relative position without adding a separate position vector to \(X\).

The full attention formula becomes:

\[ \operatorname{Attention}_{\text{RoPE}}(Q,K,V) = \operatorname{softmax}\!\left( \frac{(R Q)(R K)^{\top}}{\sqrt{d_k}} + M \right)V \]

where \(RQ\) means “rotate every query row by its position” and \(RK\) means the same for keys.

7.3 A Small Example

Take a two-dimensional query and key:

\[ q = [1, 0], \quad k = [1, 0] \]

Before rotation, their dot product is:

\[ q \cdot k = 1 \]

Now place the query at position 1 and the key at position 3. In two dimensions the frequency is 1, so the rotations use angles 1 and 3 radians:

\[ R_1 q = [\cos 1, \sin 1] \]

\[ R_3 k = [\cos 3, \sin 3] \]

Their dot product is:

\[ \cos 1 \cos 3 + \sin 1 \sin 3 = \cos(3 - 1) \]

The result depends on the distance between positions, not just on the original vector contents. For real model dimensions, many dimension pairs rotate at many frequencies, so attention gets a detailed relative-position signal.

7.4 The Code Structure

def rope_angle(position: int, pair_index: int, d_model: int) -> float:
    return position / (10000.0 ** (2.0 * pair_index / d_model))

rope_angle computes the rotation angle for one position and one pair of dimensions. The formula matches the sinusoidal frequency schedule: early dimension pairs rotate faster, later pairs rotate more slowly.

def rotate_pair(x: float, y: float, angle: float) -> tuple[float, float]:
    cos_theta = math.cos(angle)
    sin_theta = math.sin(angle)
    return x * cos_theta - y * sin_theta, x * sin_theta + y * cos_theta

rotate_pair is the 2D rotation from the math section. It takes two neighboring vector components and returns the rotated pair.

def apply_rope_to_vector(vector: Vector, position: int) -> Vector:
    if len(vector) % 2 != 0:
        raise ValueError("RoPE requires an even vector dimension")

    rotated: Vector = []
    d_model = len(vector)
    for i in range(0, d_model, 2):
        angle = rope_angle(position, i // 2, d_model)
        x, y = rotate_pair(vector[i], vector[i + 1], angle)
        rotated.extend([x, y])
    return rotated

apply_rope_to_vector walks through a query or key vector two dimensions at a time. Each pair gets its own angle, determined by the token position and the pair index. RoPE requires an even vector dimension because every dimension must belong to a pair.

def apply_rope(matrix: Matrix) -> Matrix:
    return [
        apply_rope_to_vector(row, position)
        for position, row in enumerate(matrix)
    ]

apply_rope applies that vector rotation to every row in a matrix. The row index is the token position, so row 0 gets position 0, row 1 gets position 1, and so on.

def scaled_dot_product_attention_with_rope(
    query: Matrix,
    key: Matrix,
    value: Matrix,
) -> tuple[Matrix, Matrix]:
    return scaled_dot_product_attention(
        apply_rope(query),
        apply_rope(key),
        value,
    )

scaled_dot_product_attention_with_rope keeps the attention formula from Chapter 6. The only difference is that queries and keys are rotated before the dot products are computed. Values are not rotated; they are still mixed by the final attention weights.

def chapter_07(seed: int = 607) -> dict[str, object]:
    rng = random.Random(seed)
    query = random_matrix(4, 8, rng)
    key = random_matrix(4, 8, rng)
    value = random_matrix(4, 8, rng)
    output, weights = scaled_dot_product_attention_with_rope(query, key, value)
    return {
        "rotated_first": apply_rope_to_vector([1.0, 0.0], 1),
        "rotated_shape": (len(apply_rope(query)), len(query[0])),
        "output_shape": (len(output), len(output[0])),
        "weight_rows": [sum(row) for row in weights],
    }

chapter_07 is the runnable demo used by python3 src/python/chapter_demos.py. It checks that RoPE preserves the query/key matrix shape and that the resulting attention rows still sum to 1 after softmax.

7.5 Takeaways

Earlier positional encodings add position to the token representation before attention.
RoPE (Su et al., 2021) puts position inside attention by rotating \(Q\) and \(K\).
Values are not rotated; they are still blended by the attention weights.
The query-key dot product depends only on the relative distance between positions, not absolute indices.
No extra parameters — the rotation is deterministic.
Generalises to sequences longer than those seen during training.
Compatible with linear attention variants.
Used in LLaMA, Mistral, Qwen, and most open-weight models.

What’s next? RoPE changes how one attention head sees position. The next chapter asks how a model can track several kinds of relationships at once: multi-head attention. See Chapter 8.