Throughput stands for the number of requests a model can process within a given timeframe. In the AI21 Studio environment, we take care of all backend considerations, adjusting to fluctuations in traffic from all our users in real-time. However, in Amazon SageMaker, you, the customer, are in charge of managing backend operations. Here, the choice of which instances to deploy, and in what quantity, hinges on both the model's requirements and the specific characteristics of your use-case. In SageMaker, you pay per instance uptime, so the number of instances you require to support your usage determines what you end up seeing on the bill. Keep in mind that Higher throughput means fewer instances are needed.

With an estimate of the throughput for your use of AI21 models, you can easily predict your operational costs. Throughput directly influences cost calculations by determining the number of instances needed to manage your traffic during normal and peak times. For those who'd rather not manually adjust these parameters, SageMaker supports Auto Scaling to automatically match resources to traffic needs.

Think of the instance as a workhorse: knowing its throughput is like predicting how much cargo this workhorse can handle reliably. For example, consider the Jurassic-2 Mid model on a p4d.24xlarge instance, specifically for the use-case of 'Answering a question based on a help center article' which has a throughput of 448 RPM (Requests Per Minute). For peak traffic of approximately 1,200 requests per minute, you'd need to deploy 3 instances to handle this volume. Conversely, during quieter times with around 300 requests per minute, just 1 instance would suffice.

To reach an instance's optimal throughput and maximize its productivity, send a sufficient number of concurrent requests to a real-time endpoint. AI21 has plans in motion to introduce Batch Transform endpoints. Until then, you can simulate batching by sending a high volume of parallel requests to real-time endpoints.

To determine the expected throughput for your specific scenario, assess the average length of your prompt and generated output in tokens. Identifying where these lengths fit in the following tables provides a clear indication of the ballpark you're in.

The detailed throughput data for the latest AI21 models available in SageMaker is presented next. Throughput is determined by both the model and the instance type, and the choice of instance type further hinges on the selected model and particular requirements related to context window length and latency. Note that the data in these tables corresponds to scenarios where the model produces a single output for each request (numResults=1).

Foundation models

Jurassic-2 Ultra

p4de.24xlarge

Prompt (tokens)	Completion (tokens)	Example use-case	RPM
7168	512	Summarize several pages into a one-pager	<20
1024	128	Summarize a news article into a long paragraph	143
512	64	Answer a question based on a help center article	290
128	1024	Write a blog post	<20
128	128	Paraphrase a paragraph	272

p4d.24xlarge

Prompt (tokens)	Completion (tokens)	Example use-case	RPM
7168	512	Summarize few pages into a one pager	<20
2048	256	Summarize few pages into a half-pager	41
1024	128	Summarize a news article into a long paragraph	133
512	64	Answer a question based on a help center article	291
128	1024	Write a blog post	<20
128	128	Paraphrase a paragraph	220

g5.48xlarge

Prompt (tokens)	Completion (tokens)	Example use-case	RPM
1024	128	Summarize a news article into a long paragraph	<20
512	64	Answer a question based on a help center article	27
128	1024	Write a blog post	<20
128	128	Paraphrase a paragraph	<20

Jurassic-2 Mid

p4d.24xlarge

Prompt (tokens)	Completion (tokens)	Example use-case	RPM
7168	512	Summarize several pages into a one-pager	<20
1024	128	Summarize a news article into a long paragraph	237
512	64	Answer a question based on a help center article	448
128	1024	Write a blog post	<20
128	128	Paraphrase a paragraph	385

g5.48xlarge

Prompt (tokens)	Completion (tokens)	Example use-case	RPM
2048	256	Summarize few pages into a half-pager	<20
1024	128	Summarize a news article into a long paragraph	34
512	64	Answer a question based on a help center article	76
128	1024	Write a blog post	<20
128	128	Paraphrase a paragraph	107

g5.12xlarge

Prompt (tokens)	Completion (tokens)	Example use-case	RPM
1024	128	Summarize a news article into a long paragraph	50
512	64	Answer a question based on a help center article	113
128	1024	Write a blog post	<20
128	128	Paraphrase a paragraph	200

g4dn.12xlarge

Prompt (tokens)	Completion (tokens)	Example use-case	RPM
1024	128	Summarize a news article into a long paragraph	33
512	64	Answer a question based on a help center article	76
128	1024	Write a blog post	<20
128	128	Paraphrase a paragraph	127