Injecting information about word order into the model.
A critical characteristic of the self-attention mechanism is that it is 'permutation-invariant,' meaning it does not have any built-in sense of the order or position of tokens in a sequence. If you shuffle the words in a sentence, the self-attention output for each word (before aggregation) would be a simple reordering of the original outputs. This is a problem because word order is fundamental to the meaning of language ('The dog chased the cat' is very different from 'The cat chased the dog'). To solve this, the Transformer architecture introduces a technique called Positional Encoding. The idea is to create a unique vector for each position in the sequence and add this vector to the corresponding token's input embedding. This injects information about the token's absolute or relative position directly into its representation. The original Transformer paper proposed using a combination of sine and cosine functions of different frequencies to create these positional vectors. The formula for each dimension of the positional encoding vector is designed such that the encoding for each position is unique. Furthermore, this method has the desirable property that it can generalize to sequence lengths longer than those seen during training, and the model can potentially learn to interpret relative positions easily because the distance between any two positions can be represented as a linear function of their positional encodings. An alternative approach, used in models like BERT, is to use learnable positional embeddings, where a separate embedding vector is learned for each position, similar to how token embeddings are learned. Regardless of the method, positional encodings are a crucial addition that gives the Transformer the sense of sequence order that self-attention lacks.