From raw text to a list of integer IDs.
The process of converting raw text into a format suitable for an LLM is a multi-step pipeline. While we often think of it as a single 'tokenization' step, it's more accurately a sequence of transformations. The first stage is Normalization. This involves applying a series of standardizations to the raw text to clean it up and reduce variations that don't affect meaning. Common normalization steps include converting text to a uniform case (e.g., lowercase), handling Unicode compatibility (NFKC normalization), and sometimes stripping accents or removing extra whitespace. The goal is to ensure that different but semantically equivalent strings are treated the same way. The second stage is Pre-tokenization. This is the initial step of splitting the text into smaller, preliminary chunks. A common pre-tokenizer splits the text based on whitespace and punctuation. For example, 'Hello, world!' might be pre-tokenized into ['Hello', ',', 'world', '!']. This step creates the initial 'words' that the main tokenization model will work with. The third stage is the core Model itself, which applies the subword tokenization algorithm (like BPE, WordPiece, or Unigram) to the pre-tokenized words. This is where a word like 'tokenization' might be broken down into subwords like ['token', '##ization']. The model uses the vocabulary learned during its training to perform this split. The final stage is Post-Processing. This step involves adding any special tokens required by the LLM architecture. For example, models like BERT require a '[CLS]' token at the beginning of a sequence and a '[SEP]' token to separate sentences. This stage assembles the final sequence of tokens and then converts them into their corresponding integer IDs from the vocabulary, ready to be fed into the model's embedding layer.