The optimizers used to train models with billions of parameters.
Training a model with billions of parameters requires a highly efficient and stable optimization algorithm. While Stochastic Gradient Descent (SGD) forms the theoretical basis, it's too slow and simplistic for training LLMs. The de facto standard optimizer for large-scale deep learning is Adam, or more specifically, its variant AdamW. Adam, which stands for Adaptive Moment Estimation, improves upon standard SGD in two key ways. First, it maintains an exponentially decaying average of past gradients (the 'first moment,' akin to momentum), which helps the optimizer accelerate in the correct direction and dampens oscillations. Second, it maintains an exponentially decaying average of past squared gradients (the 'second moment,' akin to AdaGrad or RMSProp). This second moment estimate is used to adapt the learning rate for each parameter individually. Parameters that receive large or frequent gradients will have their effective learning rate reduced, while parameters with small or infrequent gradients will have their effective learning rate increased. This adaptive learning rate is crucial for training complex models where different parameters may require different step sizes. AdamW (Adam with Weight Decay) is a small but important modification to the original Adam algorithm. Standard 'L2 regularization' is a technique used to prevent overfitting by adding a penalty term to the loss function that discourages large weights. In Adam, this regularization term can interact poorly with the adaptive learning rates. AdamW decouples the weight decay from the gradient update step. It performs the standard Adam update and then applies the weight decay directly to the model's weights. This seemingly minor change leads to better generalization performance and has made AdamW the preferred optimizer for training Transformers and other LLMs.