The core mechanism allowing the model to weigh token importance.
Self-attention is the heart of the Transformer. It's a mechanism that allows the model to dynamically weigh the importance of different words in an input sequence when creating a representation for each word. For every word it processes, self-attention allows it to look at all other words in the sequence and determine which ones are most relevant for understanding the current word's meaning in this specific context. This process is often explained using the analogy of 'Query,' 'Key,' and 'Value' vectors. For each input token, the model learns three separate vectors: a Query (Q) vector, representing the current word's request for information; a Key (K) vector, representing what kind of information that word can provide; and a Value (V) vector, representing the actual content of the word. To calculate the attention for a given word, its Q vector is compared (via dot product) with the K vector of every other word in the sequence. The results are scaled and passed through a softmax function to get attention weights. These weights are then used to create a weighted sum of all the Value vectors in the sequence. The result is a new representation for the word that is a blend of all other words, with more 'attention' paid to the most relevant ones. To make this process even more powerful, the Transformer uses Multi-Head Attention. Instead of performing a single attention calculation, it runs multiple self-attention processes in parallel, each with its own set of learned Q, K, and V weight matrices. Each 'head' can learn to focus on different types of relationships (e.g., one head might focus on syntactic dependencies, another on semantic similarity). The outputs of all the heads are then concatenated and linearly transformed to produce the final output of the multi-head attention layer.