A crucial optimization for speeding up text generation.
The KV Cache is a fundamental optimization technique that dramatically speeds up the process of generating text from an autoregressive Transformer model (like GPT). Understanding how it works requires looking at the self-attention mechanism during generation. When a Transformer generates text, it does so one token at a time. To generate token 'T', the model must perform a forward pass using all the tokens from 1 to 'T-1' as context. A key part of this forward pass is the self-attention calculation, where each token's Query vector is compared against the Key (K) and Value (V) vectors of all previous tokens. Now, when the model goes to generate the next token, 'T+1', it needs to use the context from tokens 1 to 'T'. A naive implementation would re-calculate the Key and Value vectors for all the previous tokens (1 to 'T-1') all over again, which is incredibly wasteful and redundant. The KV Cache solves this. As the model processes each token, it calculates the Key and Value vectors for that token and then stores them in a cache in GPU memory. When generating the next token, the model only needs to compute the K and V vectors for the newest token and then append them to the cached K and V vectors from all the previous steps. This means the expensive matrix multiplications for past tokens are done only once. The model can then perform the attention calculation using the full, up-to-date set of cached Keys and Values. This simple caching strategy avoids enormous amounts of re-computation, making the generation of long text sequences computationally feasible and much, much faster. The size of this cache, however, grows with the length of the generated sequence and can become a memory bottleneck for very long context windows.