Back to Applications & System Design for LLMs

Latency vs. Throughput

The key performance metrics for an LLM serving system.

Key Notes

When evaluating the performance of an LLM serving system, two key metrics are often in tension: latency and throughput. Understanding the difference and the trade-off between them is crucial for system design. Latency refers to the time it takes to process a single request. It's the duration from when a user sends a prompt to when they receive the complete response. Low latency is critical for real-time, interactive applications like chatbots. If a user has to wait several seconds for a response, the experience will be poor. Latency is often measured in milliseconds or seconds. Throughput, on the other hand, refers to the total number of requests the system can handle in a given period. It's a measure of the system's overall capacity. High throughput is crucial for applications with a large number of concurrent users. Throughput is often measured in requests per second or tokens per second. The trade-off arises because the primary technique for maximizing throughput is batching—grouping multiple requests together. However, batching inherently increases latency. To form a batch, the system must wait for a certain number of requests to arrive or for a short timeout to expire. This waiting time is added directly to the latency of every request in the batch. A system optimized purely for throughput would use very large batches, leading to high latency. A system optimized purely for latency would process each request individually (a batch size of 1), leading to very low throughput and inefficient GPU utilization. The goal of a well-designed LLM serving system is to find the right balance for its specific application. Techniques like continuous batching have been developed to mitigate this trade-off, allowing new requests to be added to a running batch dynamically, improving both throughput and latency compared to static batching.

Back to Applications & System Design for LLMs

Latency vs. Throughput

The key performance metrics for an LLM serving system.

Key Notes