Overview of the encoder-decoder stacks that define the model.
The Transformer architecture, introduced by Google researchers in 2017, represented a paradigm shift in sequence modeling. It was designed to overcome the limitations of recurrent architectures like RNNs and LSTMs, namely their sequential nature which hindered parallelization and their struggle with long-term dependencies. The original Transformer was designed for sequence-to-sequence tasks like machine translation and consists of two main parts: an encoder stack and a decoder stack. The encoder's role is to process the input sequence (e.g., an English sentence) and build a rich, context-aware representation of it. It is composed of a stack of identical encoder layers. Each encoder layer has two main sub-components: a multi-head self-attention mechanism and a position-wise feed-forward neural network. The decoder's role is to take the encoder's output representations and generate the output sequence (e.g., the French translation) one token at a time. The decoder is also a stack of identical layers, but each decoder layer has three sub-components: a 'masked' multi-head self-attention mechanism (to prevent it from looking ahead at future tokens in the output it is generating), a cross-attention mechanism that pays attention to the output of the encoder, and a position-wise feed-forward network. Many modern generative LLMs, like the GPT family, are 'decoder-only' Transformers. They discard the encoder and the cross-attention mechanism, functioning solely as powerful text generators. They are pre-trained to predict the next token in a sequence, and this simple objective, when combined with the Transformer's architectural strengths and massive scale, is sufficient to learn the vast array of capabilities we observe in LLMs.