Solving the out-of-vocabulary problem with subword units.
While simple word-based tokenization is intuitive, it suffers from major drawbacks, especially for large-scale models. The vocabulary size can become enormous, and the model has no way of handling words it hasn't seen during training (out-of-vocabulary or OOV words). Subword tokenization algorithms provide an elegant solution to these problems. The core idea is to break down rare or complex words into smaller, more frequent subword units, while keeping common words as single tokens. This way, the model can represent any word as a sequence of these subwords, effectively eliminating the OOV problem. Byte-Pair Encoding (BPE) is a popular subword algorithm. It starts with a vocabulary of individual characters present in the training corpus. It then iteratively finds the most frequent pair of adjacent tokens (or characters) and merges them into a single new token, adding this new token to the vocabulary. This process is repeated for a predefined number of merges, resulting in a vocabulary that contains common words as single tokens and components of rare words as subword tokens. For example, a word like 'embedding' might be kept as a single token, while a rarer word like 'tokenization' might be broken into 'token', '##ization'. WordPiece is a similar algorithm used by models like BERT. Instead of merging the most frequent pair, it builds a vocabulary and then merges pairs that maximize the likelihood of the training data. The result is conceptually similar: a fixed-size vocabulary that can represent any input text by combining whole-word and subword units. This approach strikes a balance between the expressiveness of character-level tokenization and the efficiency of word-level tokenization.