High-level overview of how training is scaled across many GPUs.
Training a state-of-the-art LLM is impossible on a single GPU. The models are too large to fit in one device's memory, and the training data is too vast to process in a reasonable amount of time. Therefore, LLM training relies on distributed training, a set of techniques for splitting the workload across a cluster of hundreds or thousands of GPUs. There are two main strategies for this: data parallelism and model parallelism. Data Parallelism is the most common approach. In this setup, the entire model is replicated on each GPU. The training data is then split into mini-batches, and each GPU receives a different mini-batch to process. Each GPU computes the forward and backward passes for its batch independently, calculating the gradients for its copy of the model. Then, a communication step occurs where the gradients from all GPUs are averaged together. Finally, each GPU uses this averaged gradient to update its local copy of the model weights, ensuring all replicas stay synchronized. This method is effective but requires that a single copy of the model can fit into the memory of one GPU. When a model becomes too large for a single GPU, Model Parallelism is required. This involves splitting the model itself across multiple GPUs. There are different ways to do this. In 'tensor parallelism,' individual layers or even large weight matrices (like the embedding table or the FFN) are split across GPUs. This requires significant communication between the GPUs during the forward and backward passes. In 'pipeline parallelism,' entire layers of the model are placed on different GPUs. One GPU might handle layers 1-8, the next GPU layers 9-16, and so on. The data flows through this pipeline of GPUs. Efficiently combining these different parallelism strategies is a complex engineering challenge and is essential for training the largest models in existence today.