Large Language Models

Introduction to the core of our product

This page describes autoregressive LLMs (large language models)

An autoregressive LLM is a neural network model that consists of billions of parameters. LLMs are trained on a vast collection of texts, with the goal of predicting the next word based on the input text. During training, the model predicts each word in a given text and grades its success rate one word at a time. LLMs are trained against sentences, paragraphs, articles, books, and more.

Breaking down the language into tokens

About tokens

For various reasons, input and output text is broken into pieces called tokens. Any language can be broken down into tokens. Each of those tokens has an associated list of weights, called a vector, which is eventually fed into the model. For example:

Each value (weight) in the vector represents the amount of a specific aspect of the token--for example, "blueness" "age-ness" or "familial relationship-ness", or some other aspect that doesn't have a simple term, but has meaning to the model. Each token has over a thousand weights, and the same aspect is represented by the same vector in each token. Taken as a whole, the set of vectors for a token represents a kind of understanding of that token. The model can then compare different tokens for relatedness to help determine which tokens generally follow another given token. (See the learn more section at the end of the page to learn more how this works.)

Each model has its own dictionary of tokens. Each text in the input will be decomposed into these tokens, and every text generated by the model will be composed of them.

Token dictionary

AI21 Studio uses a large token dictionary (250K), which contains tokens generated from separate characters, words, word parts, such as prefixes and suffixes, and multi-word tokens. For example, in our current tokenizer, the phrase "I want to break free." is split into the following tokens:

['▁I▁want▁to', '▁break', '▁free', '.']

The underscores are used to replace whitespace in a token in the dictionary, and are stripped out when the response is generated. Note how the first few words are combined as a semantic unit into a single token, but the last two words are not. Also note that the final punctuation mark is its own token, and a whitespace mark is added to the start of the sentence, because the tokenizer adds a dummy space at the start of each line for consistency.

Choosing the next word (token)

How does an LLM choose the next token? Will it always choose the same token for a given sequence, or will it choose something different every time?

Given a prompt (the input to the model), the model generates a pool of all possible next tokens. It does that by calculating the probability of appearing next for every token in the dictionary. Then, the model gets a pool of the most likely candidates and chooses one.

Let's understand this by using a simple example. Suppose the prompt is the phrase "I will". Here is the distribution over all possible tokens (assuming a very small dictionary just for the demonstration):

In this example, there is a 55% chance that the model will choose the completion "survive", a 25% chance of the completion "always love you", etc. In simple words, it means that if we give the model this prompt 100 times, we'll likely get “survive” 55 times, "always love you" 25 times, etc.

You can give the model some guidance on choosing the next token by adjusting the temperature (increasing the likelihood of lower-ranked tokens to improve variability) or adjusting the TopP (cutting out the tail of less likely tokens, thus reducing the pool size).

Using Temperature to influence the next token choice

The most popular sampling parameter to play with is the temperature. Inspired by thermodynamics, low temperature means stagnation and high temperature means movement. The temperature value affects the distribution: for a low temperature, we get a higher weight for the most probable token. At 0 temperature, we always get the maximum token, and the distribution looks like this:

Namely, the model will always sample the same token at every call.

A high temperature does the exact opposite: the distribution tends to be as uniform as possible, meaning all tokens have about the same probability. It will look something like this:

So if you care about consistency and you always want to get the most likely completion (for example, in classification tasks), go for temperature=0. If you want more creativity (usually useful in text generation tasks), try a higher temperature (we recommend to start with 0.7).

Using TopP to influence the next token choice

TopP limits the number of tokens in the available pool of next tokens. It works like this: we take the most likely token, add it to the pool, and add its probability to the amount. We continue adding tokens this way until the sum of probabilities reaches TopP. This removes the "tail" of the tokens with the lowest probabilities. In our example for TopP=0.95, we sample from the following distribution:

📘

Temperature or TopP?

Here's some guidance on when to adjust temperature, TopP, or both.

Knowing when to stop generating tokens

Language models are trained to generate text token by token until they reach an answer that seems to fit the prompt. Just like it learned how to generate text, it is also trained to know when to stop. LLM tokenizers have a special token ([eom]) that signals when a reasonable generation should end by default. During pre-training, models were trained on text items containing stop sequence suffixes, and by doing so, they were adjusted to end sequences appropriately.

You can provide additional guidance to the model about when to stop generating text:

  • Specify length guidelines in the prompt. (Best practice) For example: "Write a sentence or two about...", "Write a paragraph about...", "Do not exceed twenty words." These are soft guidelines, and the LLM will respect them within reason while still trying to generate a sensible, good quality answer. That means if you specify a 10 word limit, the model might give you anywhere from five to fifteen (or so) words. If you have length guidelines, you should specify them in the prompt.
  • Specify a max_tokens parameter. (Good practice) Set max_tokens as a firewall to prevent the rare cases where a model might go astray and generate a very long answer. With very high temperatures, the model is more likely to generate very long answers. Set this value to well over your required maximum response length, just as a precaution.
  • Setting a stop sequence. (Special use cases) You can set a stop sequence that tells the model to stop generating text as soon as it generates any of a list of sequences that you provide. For example: [".","###","\n"] means that the model should stop generating as soon as it generates a period, a ### mark, or a newline. Stop sequences are only used in special circumstances, and are always respected exactly.

📘

Added value: knowledge acquisition

If you continuously read all of Shakespeare's works as a means to learn a language, over time, you wouldn't just be able to recall all of his plays and poems, but also be able to replicate his unique writing style.

Similarly, by feeding our LLMs with a multitude of textual sources, they've developed a comprehensive "understanding" of English and a broad base of general "knowledge." This is called an emergent property -- a side effect.

Interacting with Large Language Models

LLMs are queried using natural language. Rather than writing lines of code and loading a model, you write a natural language prompt and pass it to the model as the input. For example:

>Tell me a joke about a large language model that thinks it is a computer.

What do you get when you cross a dinosaur with a large language model? A really smart thesaurus.

Maybe not ready for Vegas, but still pretty impressive, no?

Learn more

There are many places to learn more about machine learning and large language models on the internet. Here are a few:

  • Intro to large language models video by Andrej Karpathy. This is a 1 hour general-audience introduction to Large Language Models. Karpathy is a legend in the LLM community for his work in the field and for his excellent presentations on machine learning.
  • Neural networks video series on YouTube by 3Blue1Brown is an excellent explanation of how neural networks, including LLMs, work. Can be fairly technical in parts, but watching it a few times is a very good way to understand how an LLM is trained, and how it generates results. No coding at all.
  • Hugging Face self-paced NLP course. Hugging Face is a community and online repository for machine learning projects. The free course describes some basics about how machine learning works, and describes how to use Hugging Face architecture to train your own models or use user-contributed models.