The predictable relationship between size, data, and performance.
Scaling laws are a set of empirical observations that have become a guiding principle in the development of Large Language Models. These laws describe a predictable power-law relationship between a model's performance (measured by its loss on a held-out test set), the number of parameters in the model (N), the size of the training dataset (D), and the amount of compute used for training (C). The key finding, first systematically explored by researchers at OpenAI, is that model performance improves smoothly and predictably as these three factors are scaled up. Specifically, the test loss decreases as a power-law of N, D, and C. This was a monumental discovery because it transformed the process of building better AI models from a series of ad-hoc architectural innovations into a more predictable engineering challenge. It suggested that, for a given computational budget, there is an optimal allocation between model size and data size. If your model is too large for your dataset, it will overfit; if your dataset is too large for your model, it will underfit. The scaling laws provide a recipe for how to increase both in tandem to achieve the best possible performance for a given amount of compute. This principle has driven the trend of building ever-larger models. It provides the confidence to invest the immense resources required to train a model with hundreds of billions of parameters, knowing that the resulting performance will likely be state-of-the-art. These laws also explain the phenomenon of 'emergent abilities,' where certain capabilities like multi-step reasoning only appear once a model surpasses a certain size threshold. The smooth improvement in loss eventually crosses a critical point where these new, qualitative behaviors emerge.