The self-supervised tasks used to train LLMs from scratch.
Pre-training is the initial, computationally intensive phase where an LLM learns general-purpose knowledge from a massive, unlabeled text corpus. This is a form of 'self-supervised learning' because the labels are generated automatically from the input data itself, without needing human annotators. The specific task the model performs during this phase is called the pre-training objective. The two most influential objectives are Causal Language Modeling (CLM) and Masked Language Modeling (MLM). Causal Language Modeling (CLM), also known as autoregressive or next-token prediction, is the objective used for generative models like the GPT family. In CLM, the model is trained to predict the next token in a sequence given all the preceding tokens. For the sentence 'The quick brown fox', the model would be trained to predict 'quick' given 'The', then 'brown' given 'The quick', and so on. This is achieved in Transformers by using a 'look-ahead mask' in the self-attention mechanism, which prevents a token at a given position from attending to any subsequent tokens. This inherent directionality makes CLM models excellent at text generation. Masked Language Modeling (MLM) is the objective used for models like BERT. Instead of predicting the next token, MLM involves taking an input sentence, randomly 'masking' (hiding) about 15% of its tokens, and then training the model to predict the original identity of these masked tokens. To do this effectively, the model must consider both the left and the right context surrounding the mask. This bidirectional context allows MLM-based models to build a deep, rich understanding of language, making them exceptionally good at discriminative tasks like text classification and question answering, which require a holistic understanding of the input text.