What Is Tokenization?
Tokenization is the process of converting raw text into a sequence of tokens — discrete units like words, subwords, or characters — that can be processed by an AI model, using algorithms like Byte Pair Encoding (BPE) or SentencePiece.
How Tokenization Works
Before a language model can process text, it must be broken into tokens using a tokenizer. Modern tokenizers use subword algorithms like Byte Pair Encoding, which balances vocabulary size with the ability to handle any text. Common words become single tokens, while rare words are split into smaller pieces. For example, 'unhappiness' might become 'un', 'happiness' or 'un', 'happ', 'iness.' Each token is mapped to a numerical ID that the model processes. The tokenizer's vocabulary and algorithm significantly impact model performance — different models use different tokenizers, which is why the same text may have different token counts across models. Tokenization also handles special tokens for conversation structure, like end-of-sequence markers.
Real-World Examples
OpenAI's tiktoken tokenizer splitting 'ChatGPT is great' into ['Chat', 'G', 'PT', ' is', ' great'] — 5 tokens
A multilingual tokenizer handling Japanese, Arabic, and English text efficiently with different subword splitting strategies
A developer using a tokenizer library to count tokens before sending a prompt to ensure it fits within the context window