How Transformers create dynamic representations unlike Word2Vec.
A fundamental difference between modern Transformer-based models and older NLP models like Word2Vec lies in how they represent words. Word2Vec produces static embeddings. In this paradigm, each word in the vocabulary is assigned a single, fixed embedding vector. The vector for the word 'bank' is the same regardless of whether it appears in the context of 'river bank' or 'investment bank.' While these static embeddings successfully capture general semantic relationships (like 'king' is to 'man' as 'queen' is to 'woman'), they fail to account for polysemy (words having multiple meanings) and the nuances of context. Transformers, on the other hand, produce contextual embeddings (or more accurately, contextual representations). The process starts with a static embedding from the embedding matrix, just like in older models. However, this is only the initial input to the first layer of the Transformer. As this initial embedding passes through the stack of Transformer layers, the self-attention mechanism repeatedly refines it. At each layer, the representation for the word 'bank' is updated based on the other words present in the specific sentence. The self-attention mechanism allows 'bank' to gather information from 'river' in one sentence and from 'investment' in another. Therefore, the final vector representation of 'bank' that comes out of the last Transformer layer is highly dependent on its context. The representation for 'bank' in 'river bank' will be very different from the representation of 'bank' in 'investment bank.' This ability to generate dynamic, context-aware representations for each token is a primary reason for the superior performance and deeper language understanding of LLMs compared to their predecessors.