Throughput guide
Throughput stands for the number of requests a model can process within a given timeframe. In the AI21 Studio environment, we take care of all backend considerations, adjusting to fluctuations in traffic from all our users in real-time. However, in Amazon SageMaker, you, the customer, are in charge of managing backend operations. Here, the choice of which instances to deploy, and in what quantity, hinges on both the model's requirements and the specific characteristics of your use-case. In SageMaker, you pay per instance uptime, so the number of instances you require to support your usage determines what you end up seeing on the bill. Keep in mind that Higher throughput means fewer instances are needed.
With an estimate of the throughput for your use of AI21 models, you can easily predict your operational costs. Throughput directly influences cost calculations by determining the number of instances needed to manage your traffic during normal and peak times. For those who'd rather not manually adjust these parameters, SageMaker supports Auto Scaling to automatically match resources to traffic needs.
Think of the instance as a workhorse: knowing its throughput is like predicting how much cargo this workhorse can handle reliably. For example, consider the Jurassic-2 Mid model on a p4d.24xlarge instance, specifically for the use-case of 'Answering a question based on a help center article' which has a throughput of 448 RPM (Requests Per Minute). For peak traffic of approximately 1,200 requests per minute, you'd need to deploy 3 instances to handle this volume. Conversely, during quieter times with around 300 requests per minute, just 1 instance would suffice.
To reach an instance's optimal throughput and maximize its productivity, send a sufficient number of concurrent requests to a real-time endpoint. AI21 has plans in motion to introduce Batch Transform endpoints. Until then, you can simulate batching by sending a high volume of parallel requests to real-time endpoints.
To determine the expected throughput for your specific scenario, assess the average length of your prompt and generated output in tokens. Identifying where these lengths fit in the following tables provides a clear indication of the ballpark you're in.
The detailed throughput data for the latest AI21 models available in SageMaker is presented next. Throughput is determined by both the model and the instance type, and the choice of instance type further hinges on the selected model and particular requirements related to context window length and latency. Note that the data in these tables corresponds to scenarios where the model produces a single output for each request (numResults=1
).
Foundation models
Jurassic-2 Ultra
p4de.24xlarge
Prompt (tokens) | Completion (tokens) | Example use-case | RPM |
---|---|---|---|
7168 | 512 | Summarize several pages into a one-pager | <20 |
1024 | 128 | Summarize a news article into a long paragraph | 143 |
512 | 64 | Answer a question based on a help center article | 290 |
128 | 1024 | Write a blog post | <20 |
128 | 128 | Paraphrase a paragraph | 272 |
p4d.24xlarge
Prompt (tokens) | Completion (tokens) | Example use-case | RPM |
---|---|---|---|
7168 | 512 | Summarize few pages into a one pager | <20 |
2048 | 256 | Summarize few pages into a half-pager | 41 |
1024 | 128 | Summarize a news article into a long paragraph | 133 |
512 | 64 | Answer a question based on a help center article | 291 |
128 | 1024 | Write a blog post | <20 |
128 | 128 | Paraphrase a paragraph | 220 |
g5.48xlarge
Prompt (tokens) | Completion (tokens) | Example use-case | RPM |
---|---|---|---|
1024 | 128 | Summarize a news article into a long paragraph | <20 |
512 | 64 | Answer a question based on a help center article | 27 |
128 | 1024 | Write a blog post | <20 |
128 | 128 | Paraphrase a paragraph | <20 |
Jurassic-2 Mid
p4d.24xlarge
Prompt (tokens) | Completion (tokens) | Example use-case | RPM |
---|---|---|---|
7168 | 512 | Summarize several pages into a one-pager | <20 |
1024 | 128 | Summarize a news article into a long paragraph | 237 |
512 | 64 | Answer a question based on a help center article | 448 |
128 | 1024 | Write a blog post | <20 |
128 | 128 | Paraphrase a paragraph | 385 |
g5.48xlarge
Prompt (tokens) | Completion (tokens) | Example use-case | RPM |
---|---|---|---|
2048 | 256 | Summarize few pages into a half-pager | <20 |
1024 | 128 | Summarize a news article into a long paragraph | 34 |
512 | 64 | Answer a question based on a help center article | 76 |
128 | 1024 | Write a blog post | <20 |
128 | 128 | Paraphrase a paragraph | 107 |
g5.12xlarge
Prompt (tokens) | Completion (tokens) | Example use-case | RPM |
---|---|---|---|
1024 | 128 | Summarize a news article into a long paragraph | 50 |
512 | 64 | Answer a question based on a help center article | 113 |
128 | 1024 | Write a blog post | <20 |
128 | 128 | Paraphrase a paragraph | 200 |
g4dn.12xlarge
Prompt (tokens) | Completion (tokens) | Example use-case | RPM |
---|---|---|---|
1024 | 128 | Summarize a news article into a long paragraph | 33 |
512 | 64 | Answer a question based on a help center article | 76 |
128 | 1024 | Write a blog post | <20 |
128 | 128 | Paraphrase a paragraph | 127 |
Updated 4 months ago