The architecture of serving LLMs at scale.
Serving a Large Language Model to thousands or millions of users requires a robust and scalable system architecture. The most common approach is to expose the model's functionality through a web-based Application Programming Interface (API). A user's application sends a request (containing the prompt and other parameters like temperature) to an API endpoint, and the server processes the request, generates a response from the LLM, and sends it back. The core of the serving infrastructure is a cluster of powerful GPUs. A single request is typically handled by one or more GPUs depending on the model parallelism strategy used. A critical component in front of these GPUs is a request scheduler or batching engine. Generating text token by token is an iterative process, and GPUs are most efficient when they are performing large matrix multiplications. If each request is processed individually, the GPU can be severely underutilized. A batching engine groups multiple incoming requests together and processes them simultaneously as a single 'batch.' This significantly increases throughput (the number of requests served per second). The system also needs to handle the complexities of generative inference. Unlike a classification model that produces a single output, an LLM generates a sequence of tokens. This can be done via simple 'greedy' decoding (always picking the most likely next token) or more advanced sampling methods like nucleus sampling to produce more diverse outputs. For interactive applications like chatbots, responses are often 'streamed' back to the user token-by-token as they are generated, which improves the perceived latency of the system.