The encoder-decoder architecture for tasks like machine translation.
Sequence-to-Sequence (Seq2Seq) models are a class of neural network architectures designed to handle problems where the input and output are both sequences of arbitrary length. This makes them particularly well-suited for tasks like machine translation (translating a sentence from one language to another), text summarization (converting a long document into a short summary), and question answering. A Seq2Seq model is composed of two main components, typically implemented using Recurrent Neural Networks (RNNs) or their variants like LSTMs or GRUs. The first component is the Encoder. The encoder's job is to process the entire input sequence, one element at a time. As it does so, it compresses the information from the sequence into a single, fixed-size vector representation called the 'context vector' or 'thought vector.' This vector is intended to be a semantic summary of the entire input sequence. The final hidden state of the encoder RNN is often used as this context vector. The second component is the Decoder. The decoder is another RNN that takes the context vector from the encoder as its initial hidden state. Its task is to generate the output sequence, one element at a time. At each step, it produces an output and updates its own hidden state, which is then used to generate the next element in the sequence. This process continues until a special end-of-sequence token is generated. While revolutionary, this classic Seq2Seq architecture has a significant bottleneck: the fixed-size context vector. It's challenging for the model to cram all the information from a very long input sentence into this single vector, leading to poor performance on long sequences. This limitation directly motivated the development of the attention mechanism.