The key performance metrics for an LLM serving system.
When evaluating the performance of an LLM serving system, two key metrics are often in tension: latency and throughput. Understanding the difference and the trade-off between them is crucial for system design. Latency refers to the time it takes to process a single request. It's the duration from when a user sends a prompt to when they receive the complete response. Low latency is critical for real-time, interactive applications like chatbots. If a user has to wait several seconds for a response, the experience will be poor. Latency is often measured in milliseconds or seconds. Throughput, on the other hand, refers to the total number of requests the system can handle in a given period. It's a measure of the system's overall capacity. High throughput is crucial for applications with a large number of concurrent users. Throughput is often measured in requests per second or tokens per second. The trade-off arises because the primary technique for maximizing throughput is batching—grouping multiple requests together. However, batching inherently increases latency. To form a batch, the system must wait for a certain number of requests to arrive or for a short timeout to expire. This waiting time is added directly to the latency of every request in the batch. A system optimized purely for throughput would use very large batches, leading to high latency. A system optimized purely for latency would process each request individually (a batch size of 1), leading to very low throughput and inefficient GPU utilization. The goal of a well-designed LLM serving system is to find the right balance for its specific application. Techniques like continuous batching have been developed to mitigate this trade-off, allowing new requests to be added to a running batch dynamically, improving both throughput and latency compared to static batching.