Methods like LoRA for fine-tuning LLMs with minimal computation.
As Large Language Models grew to hundreds of billions of parameters, full supervised fine-tuning became increasingly impractical. It requires massive amounts of GPU memory to store the model, its gradients, and the optimizer states, and the resulting fine-tuned model is a full-sized copy, making it expensive to store and deploy many different specialized models. Parameter-Efficient Fine-Tuning (PEFT) methods were developed to address this challenge. The core idea of PEFT is to freeze the vast majority of the pre-trained LLM's parameters and only train a very small number of new or existing parameters. This dramatically reduces the memory and computational requirements for fine-tuning. Low-Rank Adaptation (LoRA) is one of the most popular and effective PEFT techniques. LoRA is based on the hypothesis that the change in the weight matrices during fine-tuning has a low 'intrinsic rank.' This means that the large matrix of weight updates can be approximated by the product of two much smaller, 'low-rank' matrices. In practice, LoRA injects pairs of these small, trainable rank-decomposition matrices into each Transformer layer, alongside the frozen pre-trained weights. During training, only these new, small matrices are updated. During inference, the learned weights from these small matrices are merged (added) back into the original weights. This approach allows for fine-tuning a massive model with a tiny fraction (often less than 0.1%) of the trainable parameters, leading to a massive reduction in memory usage and faster training times, all while achieving performance very close to that of full fine-tuning.