Throughput guide

Throughput stands for the number of requests a model can process within a given timeframe. In the AI21 Studio environment, we take care of all backend considerations, adjusting to fluctuations in traffic from all our users in real-time. However, in Amazon SageMaker, you, the customer, are in charge of managing backend operations. Here, the choice of which instances to deploy, and in what quantity, hinges on both the model's requirements and the specific characteristics of your use-case. In SageMaker, you pay per instance uptime, so the number of instances you require to support your usage determines what you end up seeing on the bill. Keep in mind that Higher throughput means fewer instances are needed.

With an estimate of the throughput for your use of AI21 models, you can easily predict your operational costs. Throughput directly influences cost calculations by determining the number of instances needed to manage your traffic during normal and peak times. For those who'd rather not manually adjust these parameters, SageMaker supports Auto Scaling to automatically match resources to traffic needs.

Think of the instance as a workhorse: knowing its throughput is like predicting how much cargo this workhorse can handle reliably. For example, consider the Jurassic-2 Mid model on a p4d.24xlarge instance, specifically for the use-case of 'Answering a question based on a help center article' which has a throughput of 448 RPM (Requests Per Minute). For peak traffic of approximately 1,200 requests per minute, you'd need to deploy 3 instances to handle this volume. Conversely, during quieter times with around 300 requests per minute, just 1 instance would suffice.

To reach an instance's optimal throughput and maximize its productivity, send a sufficient number of concurrent requests to a real-time endpoint. AI21 has plans in motion to introduce Batch Transform endpoints. Until then, you can simulate batching by sending a high volume of parallel requests to real-time endpoints.

To determine the expected throughput for your specific scenario, assess the average length of your prompt and generated output in tokens. Identifying where these lengths fit in the following tables provides a clear indication of the ballpark you're in.

The detailed throughput data for the latest AI21 models available in SageMaker is presented next. Throughput is determined by both the model and the instance type, and the choice of instance type further hinges on the selected model and particular requirements related to context window length and latency. Note that the data in these tables corresponds to scenarios where the model produces a single output for each request (numResults=1).

Foundation models

Jurassic-2 Ultra

p4de.24xlarge

Prompt (tokens)Completion (tokens)Example use-caseRPM
7168512Summarize several pages into a one-pager<20
1024128Summarize a news article into a long paragraph143
51264Answer a question based on a help center article290
1281024Write a blog post<20
128128Paraphrase a paragraph272

p4d.24xlarge

Prompt (tokens)Completion (tokens)Example use-caseRPM
7168512Summarize few pages into a one pager<20
2048256Summarize few pages into a half-pager41
1024128Summarize a news article into a long paragraph133
51264Answer a question based on a help center article291
1281024Write a blog post<20
128128Paraphrase a paragraph220

g5.48xlarge

Prompt (tokens)Completion (tokens)Example use-caseRPM
1024128Summarize a news article into a long paragraph<20
51264Answer a question based on a help center article27
1281024Write a blog post<20
128128Paraphrase a paragraph<20

Jurassic-2 Mid

p4d.24xlarge

Prompt (tokens)Completion (tokens)Example use-caseRPM
7168512Summarize several pages into a one-pager<20
1024128Summarize a news article into a long paragraph237
51264Answer a question based on a help center article448
1281024Write a blog post<20
128128Paraphrase a paragraph385

g5.48xlarge

Prompt (tokens)Completion (tokens)Example use-caseRPM
2048256Summarize few pages into a half-pager<20
1024128Summarize a news article into a long paragraph34
51264Answer a question based on a help center article76
1281024Write a blog post<20
128128Paraphrase a paragraph107

g5.12xlarge

Prompt (tokens)Completion (tokens)Example use-caseRPM
1024128Summarize a news article into a long paragraph50
51264Answer a question based on a help center article113
1281024Write a blog post<20
128128Paraphrase a paragraph200

g4dn.12xlarge

Prompt (tokens)Completion (tokens)Example use-caseRPM
1024128Summarize a news article into a long paragraph33
51264Answer a question based on a help center article76
1281024Write a blog post<20
128128Paraphrase a paragraph127