Representing words as dense vectors that capture semantic meaning.
A major breakthrough in NLP was the development of word embeddings, a technique for representing words as dense, low-dimensional vectors of real numbers. This approach is a significant improvement over sparse representations like one-hot encoding, where each word is a huge vector with a single '1' and the rest '0's. Sparse vectors suffer from the 'curse of dimensionality' and fail to capture any relationship between words; the vectors for 'cat' and 'dog' are just as dissimilar as the vectors for 'cat' and 'car'. Word embeddings, such as those produced by the Word2Vec model, solve this problem by learning a distributed representation. Word2Vec is a predictive model that is trained on a large corpus of text. It has two main architectures: Continuous Bag-of-Words (CBOW), which predicts a target word based on its surrounding context words, and Skip-gram, which does the opposite, predicting surrounding context words given a target word. The key insight is that during this training process, the model learns a vector representation (the embedding) for each word in its vocabulary. The training forces the model to place words that appear in similar contexts close to each other in the vector space. This results in embeddings that capture semantic relationships. For example, the vector for 'king' will be close to the vector for 'queen,' and the vector for 'France' will be close to 'Italy.' Remarkably, these embeddings can also capture analogies through simple vector arithmetic. The famous example is that the result of `vector('king') - vector('man') + vector('woman')` is a vector that is very close to `vector('queen')`. These pre-trained embeddings can then be used as the input layer for downstream NLP models, providing them with a much richer understanding of language from the start.