Tokenizer & Tokenization

Now that you know what large language models are, you must be wondering: “How does a model use text as input and output?”

The answer is: Tokenization

Any language can be broken down into basic pieces (in our case, tokens). Each of those pieces is translated into its own vector representation, which is eventually fed into the model. For example:

Each model has its own dictionary of tokens, which determines the language it "speaks". Each text in the input will be decomposed into these tokens, and every text generated by the model will be composed of them.

But how do we break down a language? Which pieces are we choosing as our tokens? There are several approaches to solving this:


Character-level tokenization

As a simple solution, each character can be treated as its own token. By doing so, we can represent the entire English language with just 26 characters (okay, double it for capital letters and add some punctuation). This would give us a small token dictionary, thereby reducing the width we need for those vectors and saving us some valuable memory. However, those tokens don’t have any inherent meaning - we all know what the meaning of “Cat” is, but what is the meaning of “C”? The key to understanding language is context. Although it is clear to us readers that a "Cat" and a "Cradle" have different meanings, for a language model with this tokenizer, the "C" is the same.


Word-level tokenization

Another approach we can try is breaking our text into words, just like in the example above ("I want to break free").

Now, every token has a meaning that the model can learn and use. We are gaining meaning, but that requires a much larger dictionary. Also, it raises another question: what about words stemming from the same root word, like “helped”, “helping”, and “helpful”? In this approach, each of these words will get a different token with no inherent relation between them, whereas for us readers, it's clear that they all have a similar meaning.

Furthermore, words may have fundamentally different meanings when strung together. For instance, my run-down car isn't running anywhere. What if we went a step further?


Sentence-level tokenization

In this approach, we break our text into sentences. This will capture meaningful phrases! However, this would result in an absurdly large dictionary, with some tokens being so rare that we would require an enormous amount of data to teach the model the meaning of each token.


Which is best?

Each method has pros and cons, and like any real-life problem, the best solution involves a number of compromises. AI21 Studio uses a large token dictionary (250K), which contains some elements from every method: separate characters, words, word parts, such as prefixes and suffixes, and many multi-word tokens. For example, in our current tokenizer, the phrase "I want to break free." is split into the following tokens:

['▁I▁want▁to', '▁break', '▁free', '.']

The underscores are used to replace whitespace in the string. Note how the first few words are combined as a semantic unit into a single token, but the last two words are not. Also note that the final punctuation mark is its own token, and a whitespace mark is added to the start of the sentence, because the tokenizer adds a dummy space at the start of each line for consistency.