From n-grams and RNNs to the Transformer revolution.
The journey to today's LLMs is a story of increasing model complexity and a growing ability to capture long-range context. The earliest language models were statistical, based on n-grams. An n-gram model predicts the next word based on the probability of it occurring after the previous 'n-1' words. For example, a trigram model (n=3) would predict the word following 'the cat sat' based on the frequency of different words appearing after that specific phrase in a large text corpus. While simple and effective for some tasks, n-gram models are severely limited by their short context window and their inability to generalize to unseen sequences. The first major leap came with the application of Recurrent Neural Networks (RNNs) to language. RNNs, with their internal memory (hidden state), could theoretically handle context of arbitrary length, processing a sentence word by word and updating their understanding. This was a significant improvement over the fixed context of n-grams. Advanced RNN architectures like LSTMs and GRUs were developed to better handle long-term dependencies, becoming the state-of-the-art for many years and powering services like Google Translate. However, RNNs process text sequentially, which is computationally slow and still makes it difficult to relate words that are very far apart in a text. The true revolution began in 2017 with the paper 'Attention Is All You Need,' which introduced the Transformer architecture. The Transformer completely discarded recurrence and instead relied entirely on a mechanism called 'self-attention.' This allowed the model to weigh the importance of all other words in the input when processing any given word, capturing complex relationships regardless of their distance. Crucially, this architecture was highly parallelizable, which finally unlocked the ability to train truly massive models on unprecedented amounts of data, leading directly to the era of LLMs.