7  RoPE: Position Inside Attention

So far, position has been added to the token representation. Sinusoidal and learned positional embeddings both create a position vector and add it to the token embedding before attention begins.

Modern GPT-style models often do something more subtle: they put position into attention itself. That idea is called Rotary Positional Encoding, or RoPE.

7.1 The Idea

Standard positional encoding changes the input stream:

\[ \tilde{X} = X + PE \]

Attention then projects \(\tilde{X}\) into \(Q\), \(K\), and \(V\).

RoPE changes a different point in the pipeline. It first computes the usual query, key, and value vectors:

\[ Q = XW_q,\quad K = XW_k,\quad V = XW_v \]

Then it rotates each query and key according to that token’s position. Values are not rotated.

The attention score is still a dot product, but now it compares rotated vectors:

\[ S[i,j] = \frac{(R_i q_i) \cdot (R_j k_j)}{\sqrt{d_k}} \]

This lets position affect which tokens attend to which other tokens. Instead of saying “position is another feature in the token vector,” RoPE says “position changes the geometry of query-key matching.”

7.2 The Math

RoPE treats a vector as pairs of dimensions. For each pair, it applies a 2D rotation.

For a vector pair \((x, y)\) and angle \(\theta\):

\[ \begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \]

For token position \(m\) and dimension pair \(r\), RoPE uses:

\[ \theta_{m,r} = \frac{m}{10000^{2r/d}} \]

So each pair rotates at a different frequency. Early dimensions rotate quickly; later dimensions rotate slowly.

The important property is what happens inside the attention dot product:

\[ (R_m q) \cdot (R_n k) = q \cdot (R_{n-m} k) \]

The score depends on the relative distance \(n-m\) between the query position and key position. That is how RoPE encodes relative position without adding a separate position vector to \(X\).

The full attention formula becomes:

\[ \operatorname{Attention}_{\text{RoPE}}(Q,K,V) = \operatorname{softmax}\!\left( \frac{(R Q)(R K)^{\top}}{\sqrt{d_k}} + M \right)V \]

where \(RQ\) means “rotate every query row by its position” and \(RK\) means the same for keys.

7.3 A Small Example

Take a two-dimensional query and key:

\[ q = [1, 0], \quad k = [1, 0] \]

Before rotation, their dot product is:

\[ q \cdot k = 1 \]

Now place the query at position 1 and the key at position 3. In two dimensions the frequency is 1, so the rotations use angles 1 and 3 radians:

\[ R_1 q = [\cos 1, \sin 1] \]

\[ R_3 k = [\cos 3, \sin 3] \]

Their dot product is:

\[ \cos 1 \cos 3 + \sin 1 \sin 3 = \cos(3 - 1) \]

The result depends on the distance between positions, not just on the original vector contents. For real model dimensions, many dimension pairs rotate at many frequencies, so attention gets a detailed relative-position signal.

7.4 The Code Structure

def rope_angle(position: int, pair_index: int, d_model: int) -> float:
    return position / (10000.0 ** (2.0 * pair_index / d_model))

rope_angle computes the rotation angle for one position and one pair of dimensions. The formula matches the sinusoidal frequency schedule: early dimension pairs rotate faster, later pairs rotate more slowly.

def rotate_pair(x: float, y: float, angle: float) -> tuple[float, float]:
    cos_theta = math.cos(angle)
    sin_theta = math.sin(angle)
    return x * cos_theta - y * sin_theta, x * sin_theta + y * cos_theta

rotate_pair is the 2D rotation from the math section. It takes two neighboring vector components and returns the rotated pair.

def apply_rope_to_vector(vector: Vector, position: int) -> Vector:
    if len(vector) % 2 != 0:
        raise ValueError("RoPE requires an even vector dimension")

    rotated: Vector = []
    d_model = len(vector)
    for i in range(0, d_model, 2):
        angle = rope_angle(position, i // 2, d_model)
        x, y = rotate_pair(vector[i], vector[i + 1], angle)
        rotated.extend([x, y])
    return rotated

apply_rope_to_vector walks through a query or key vector two dimensions at a time. Each pair gets its own angle, determined by the token position and the pair index. RoPE requires an even vector dimension because every dimension must belong to a pair.

def apply_rope(matrix: Matrix) -> Matrix:
    return [
        apply_rope_to_vector(row, position)
        for position, row in enumerate(matrix)
    ]

apply_rope applies that vector rotation to every row in a matrix. The row index is the token position, so row 0 gets position 0, row 1 gets position 1, and so on.

def scaled_dot_product_attention_with_rope(
    query: Matrix,
    key: Matrix,
    value: Matrix,
) -> tuple[Matrix, Matrix]:
    return scaled_dot_product_attention(
        apply_rope(query),
        apply_rope(key),
        value,
    )

scaled_dot_product_attention_with_rope keeps the attention formula from Chapter 6. The only difference is that queries and keys are rotated before the dot products are computed. Values are not rotated; they are still mixed by the final attention weights.

def chapter_07(seed: int = 607) -> dict[str, object]:
    rng = random.Random(seed)
    query = random_matrix(4, 8, rng)
    key = random_matrix(4, 8, rng)
    value = random_matrix(4, 8, rng)
    output, weights = scaled_dot_product_attention_with_rope(query, key, value)
    return {
        "rotated_first": apply_rope_to_vector([1.0, 0.0], 1),
        "rotated_shape": (len(apply_rope(query)), len(query[0])),
        "output_shape": (len(output), len(output[0])),
        "weight_rows": [sum(row) for row in weights],
    }

chapter_07 is the runnable demo used by python3 src/python/chapter_demos.py. It checks that RoPE preserves the query/key matrix shape and that the resulting attention rows still sum to 1 after softmax.

7.5 Takeaways

  • Earlier positional encodings add position to the token representation before attention.
  • RoPE (Su et al., 2021) puts position inside attention by rotating \(Q\) and \(K\).
  • Values are not rotated; they are still blended by the attention weights.
  • The query-key dot product depends only on the relative distance between positions, not absolute indices.
  • No extra parameters — the rotation is deterministic.
  • Generalises to sequences longer than those seen during training.
  • Compatible with linear attention variants.
  • Used in LLaMA, Mistral, Qwen, and most open-weight models.

What’s next? RoPE changes how one attention head sees position. The next chapter asks how a model can track several kinds of relationships at once: multi-head attention. See Chapter 8.