Foundations of Transformer Models - Absolute Positional Embeddings

Imagine a simple sentence - “The cat sat on the mat.”

Transformers process sequences using self-attention, where every token interacts with every other token. But here’s the catch: Transformers don’t inherently understand the order of tokens.

Without extra information, “The cat sat on the mat” is no different from “The mat sat on the cat.”

To fix this, we introduce positional embeddings—additional information that tells the model the order of tokens. Without them, transformers would lose crucial context. Absolute positional embeddings were the first solution to this problem.

Absolute positional embeddings assign a unique positional vector to each token in a sequence. These vectors encode a token’s position and are added to the word embeddings before being fed into the transformer.

Instead of training these vectors, many models like the original BERT use sinusoidal functions to calculate them. Here’s the formula:

pi(2k)=sin⁡(i100002kd),pi(2k+1)=cos⁡(i100002kd)

Let’s break it down:

• i: The position of the token in the sequence (e.g., 0 for the first token, 1 for the second).

• k: The index of the embedding dimension (e.g., 0, 1, 2, …, 255 for d = 512).

• d: The size of the embedding vector (typically 512).

To understand this better, let’s calculate positional encodings for two tokens: “The” at position i = 0 and “cat” at position i = 1. For simplicity, we’ll use d = 4 instead of 512.

For i = 0 (Position of “The”):

p0(0)=sin⁡(0100000/4)=sin⁡(0)=0

p0(1)=cos⁡(0100000/4)=cos⁡(0)=1

p0(2)=sin⁡(0100002/4)=sin⁡(0)=0

p0(3)=cos⁡(0100002/4)=cos⁡(0)=1

For i = 1 (Position of “cat”):

p1(0)=sin⁡(1100000/4)=sin⁡(1)≈0.841

p1(1)=cos⁡(1100000/4)=cos⁡(1)≈0.540

p1(2)=sin⁡(1100002/4)≈sin⁡(0.01)≈0.01

p1(3)=cos⁡(1100002/4)≈cos⁡(0.01)≈0.999

Adding Positional Encodings to Word Embeddings

Now that we’ve calculated the positional encodings, let’s add them to the word embeddings of “The” and “cat.”

Let’s assume:

• The word embedding of “The” is [0.5, 0.1, 0.8, 0.3].

• The word embedding of “cat” is [0.2, 0.9, 0.4, 0.6].

We add the positional encodings to these word embeddings:

For “The”:

[0.5, 0.1, 0.8, 0.3] + [0, 1, 0, 1] = [0.5, 1.1, 0.8, 1.3]

For “cat”:

[0.2, 0.9, 0.4, 0.6] + [0.841, 0.540, 0.01, 0.999] = [1.041, 1.44, 0.41, 1.599]

These combined embeddings are what the transformer processes. Now the model knows both the meaning and position of each token.

You might wonder: why use sinusoidal functions? Here’s why:

1. Smooth Variation Across Positions: The sin and cos values change smoothly with i, providing a natural progression of positions. This allows the model to identify token order easily.

2. Unique Positional Patterns: Each position produces a unique combination of sin and cos values, ensuring no two positions are encoded the same.

3. Relative Distance: The difference between the positional encodings of two tokens reflects their relative distance. For example, the encodings of tokens at i = 0 and i = 1 are close, while i = 0 and i = 10 differ significantly.

In practice, the embedding size d is much larger—typically 512. This means:

• Each token has a 512-dimensional positional vector.

• You calculate p_i(0) to p_i(511) using the same formulas.

• Lower dimensions (e.g., k = 0, 1) capture broader positional information, while higher dimensions focus on finer details.

This high dimensionality allows transformers to model complex relationships between tokens across sequences of varying lengths.

Absolute positional embeddings are essential for:

• Understanding token order: They prevent transformers from treating sequences as a bag of words.

• Capturing long-range dependencies: They allow models to relate words that are far apart in the sequence.

• Generalizing to unseen lengths: Sinusoidal functions ensure positional encodings can extend to sequences longer than those seen during training.

Absolute positional embeddings solve the critical problem of sequence ordering in transformers. By leveraging sinusoidal functions, they provide smooth, unique, and distance-aware positional information, laying the groundwork for powerful models like BERT. Understanding this concept is key to appreciating the innovations of Rotary Position Embeddings (RoPE), which take positional encodings to the next level. Ready to dive deeper? Stay tuned for RoPE!

Arnav Jaitly

Hi, I am Arnav! If you liked this article, do consider leaving a comment below as it motivates me to publish more helpful content like this!

Save my email and name for next time

Foundations of Transformer Models - Absolute Positional Embeddings

Arnav Jaitly

Leave a Comment

1 Comments

Find Posts by Categories