7 RoPE: Position Inside Attention
So far, position has been added to the token representation. Sinusoidal and learned positional embeddings both create a position vector and add it to the token embedding before attention begins.
Modern GPT-style models often do something more subtle: they put position into attention itself. That idea is called Rotary Positional Encoding, or RoPE.
7.1 The Idea
Standard positional encoding changes the input stream:
\[ \tilde{X} = X + PE \]
Attention then projects \(\tilde{X}\) into \(Q\), \(K\), and \(V\).
RoPE changes a different point in the pipeline. It first computes the usual query, key, and value vectors:
\[ Q = XW_q,\quad K = XW_k,\quad V = XW_v \]
Then it rotates each query and key according to that token’s position. Values are not rotated.
The attention score is still a dot product, but now it compares rotated vectors:
\[ S[i,j] = \frac{(R_i q_i) \cdot (R_j k_j)}{\sqrt{d_k}} \]
This lets position affect which tokens attend to which other tokens. Instead of saying “position is another feature in the token vector,” RoPE says “position changes the geometry of query-key matching.”
7.2 The Math
RoPE treats a vector as pairs of dimensions. For each pair, it applies a 2D rotation.
For a vector pair \((x, y)\) and angle \(\theta\):
\[ \begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \]
For token position \(m\) and dimension pair \(r\), RoPE uses:
\[ \theta_{m,r} = \frac{m}{10000^{2r/d}} \]
So each pair rotates at a different frequency. Early dimensions rotate quickly; later dimensions rotate slowly.
The important property is what happens inside the attention dot product:
\[ (R_m q) \cdot (R_n k) = q \cdot (R_{n-m} k) \]
The score depends on the relative distance \(n-m\) between the query position and key position. That is how RoPE encodes relative position without adding a separate position vector to \(X\).
The full attention formula becomes:
\[ \operatorname{Attention}_{\text{RoPE}}(Q,K,V) = \operatorname{softmax}\!\left( \frac{(R Q)(R K)^{\top}}{\sqrt{d_k}} + M \right)V \]
where \(RQ\) means “rotate every query row by its position” and \(RK\) means the same for keys.
7.3 A Small Example
Take a two-dimensional query and key:
\[ q = [1, 0], \quad k = [1, 0] \]
Before rotation, their dot product is:
\[ q \cdot k = 1 \]
Now place the query at position 1 and the key at position 3. In two dimensions the frequency is 1, so the rotations use angles 1 and 3 radians:
\[ R_1 q = [\cos 1, \sin 1] \]
\[ R_3 k = [\cos 3, \sin 3] \]
Their dot product is:
\[ \cos 1 \cos 3 + \sin 1 \sin 3 = \cos(3 - 1) \]
The result depends on the distance between positions, not just on the original vector contents. For real model dimensions, many dimension pairs rotate at many frequencies, so attention gets a detailed relative-position signal.
7.4 The Code Structure
def rope_angle(position: int, pair_index: int, d_model: int) -> float:
return position / (10000.0 ** (2.0 * pair_index / d_model))rope_angle computes the rotation angle for one position and one pair of dimensions. The formula matches the sinusoidal frequency schedule: early dimension pairs rotate faster, later pairs rotate more slowly.
def rotate_pair(x: float, y: float, angle: float) -> tuple[float, float]:
cos_theta = math.cos(angle)
sin_theta = math.sin(angle)
return x * cos_theta - y * sin_theta, x * sin_theta + y * cos_thetarotate_pair is the 2D rotation from the math section. It takes two neighboring vector components and returns the rotated pair.
def apply_rope_to_vector(vector: Vector, position: int) -> Vector:
if len(vector) % 2 != 0:
raise ValueError("RoPE requires an even vector dimension")
rotated: Vector = []
d_model = len(vector)
for i in range(0, d_model, 2):
angle = rope_angle(position, i // 2, d_model)
x, y = rotate_pair(vector[i], vector[i + 1], angle)
rotated.extend([x, y])
return rotatedapply_rope_to_vector walks through a query or key vector two dimensions at a time. Each pair gets its own angle, determined by the token position and the pair index. RoPE requires an even vector dimension because every dimension must belong to a pair.
def apply_rope(matrix: Matrix) -> Matrix:
return [
apply_rope_to_vector(row, position)
for position, row in enumerate(matrix)
]apply_rope applies that vector rotation to every row in a matrix. The row index is the token position, so row 0 gets position 0, row 1 gets position 1, and so on.
def scaled_dot_product_attention_with_rope(
query: Matrix,
key: Matrix,
value: Matrix,
) -> tuple[Matrix, Matrix]:
return scaled_dot_product_attention(
apply_rope(query),
apply_rope(key),
value,
)scaled_dot_product_attention_with_rope keeps the attention formula from Chapter 6. The only difference is that queries and keys are rotated before the dot products are computed. Values are not rotated; they are still mixed by the final attention weights.
def chapter_07(seed: int = 607) -> dict[str, object]:
rng = random.Random(seed)
query = random_matrix(4, 8, rng)
key = random_matrix(4, 8, rng)
value = random_matrix(4, 8, rng)
output, weights = scaled_dot_product_attention_with_rope(query, key, value)
return {
"rotated_first": apply_rope_to_vector([1.0, 0.0], 1),
"rotated_shape": (len(apply_rope(query)), len(query[0])),
"output_shape": (len(output), len(output[0])),
"weight_rows": [sum(row) for row in weights],
}chapter_07 is the runnable demo used by python3 src/python/chapter_demos.py. It checks that RoPE preserves the query/key matrix shape and that the resulting attention rows still sum to 1 after softmax.
7.5 Takeaways
- Earlier positional encodings add position to the token representation before attention.
- RoPE (Su et al., 2021) puts position inside attention by rotating \(Q\) and \(K\).
- Values are not rotated; they are still blended by the attention weights.
- The query-key dot product depends only on the relative distance between positions, not absolute indices.
- No extra parameters — the rotation is deterministic.
- Generalises to sequences longer than those seen during training.
- Compatible with linear attention variants.
- Used in LLaMA, Mistral, Qwen, and most open-weight models.
What’s next? RoPE changes how one attention head sees position. The next chapter asks how a model can track several kinds of relationships at once: multi-head attention. See Chapter 8.