Techniques like quantization to make models faster and smaller.
Inference optimization is the set of techniques used to make a trained LLM run faster and more efficiently. One of the most effective and widely used techniques is quantization. Most deep learning models are trained using 32-bit floating-point numbers (FP32) for their weights and activations, as this high precision is beneficial for the stable convergence of the training process. However, during inference, this level of precision is often not necessary. Quantization is the process of reducing the number of bits required to represent these numbers. For example, we can convert the model's FP32 weights to 16-bit floating-point (FP16 or bfloat16), 8-bit integers (INT8), or even 4-bit integers (INT4). This has two major benefits. First, it significantly reduces the size of the model in memory. An FP32 model converted to INT8 will be roughly 4 times smaller, making it easier to fit on a single GPU and reducing memory bandwidth requirements. Second, modern GPUs and specialized AI hardware have dedicated processing units that can perform integer arithmetic much faster than floating-point arithmetic. Running an INT8 quantized model can lead to a significant increase in inference speed (throughput). Of course, there is a trade-off. Reducing precision can lead to a slight degradation in the model's accuracy or performance. The art of quantization lies in using sophisticated techniques (like asymmetric quantization or using different precisions for different layers) to minimize this accuracy loss while maximizing the gains in speed and size. For many applications, a small, acceptable drop in performance is a worthwhile price to pay for a 2-4x speedup in inference.