A majority of companies I have worked with have never run into AWS service limits for Step Functions. However, when a Fortune 1000 company came to Serverless Guru asking us to build a product that would be used for millions of customer transactions per day, I ran the numbers to see what the service limits were like in September 2020 since AWS sometimes improves limits.
AWS Service Limits, officially known as service quotas as of June 2019, can throttle a component of your app. So planning ahead and then monitoring your usage against those limits with a solution like the Limit Monitor and the Service Quotas dashboard in the AWS console is so critical for reliability, that it’s the first question on reliability in the AWS Well-Architected Tool.
TLDR. I go through a few numbers and conclude AWS Lambda (without Step Functions) is a safer bet than Step Functions or Fargate from a service limit perspective when faced with millions of unevenly distributed events per day.
Step Functions Limits, September 2020
In a scenario where nearly a dozen external systems interoperate on an event-driven basis and reliability is essential, AWS Step Functions is appealing for its ability to kick off both Lambda functions and containerized tasks in Fargate and its built-in error-handling tools (such as the Catch field).
But wait, are those choices really all at my disposal when working at the scale of 1 Million events per day that are not evenly distributed over time? Time to check AWS’s service quotas.
Step Functions is engineered for limits of 300 new executions per second in N. Virginia, Oregon, and Ireland and 150 per second in all other regions. So if we scale to 1 Million events per day, that’s 11.57 events per second. But we can’t bank on the events being evenly distributed. A hotspot during the day that’s just a little over an order of magnitude more clustered would fail.
Fargate Limits, September 2020
The following Fargate quotas are all soft limits, and therefore adjustable upon request, but without a hint in the documentation of how adjustable: at a time, you can have 10K clusters/region, 2K services/cluster, and 2K tasks/service. Assuming every event takes a Fargate task 5 seconds to process, and you’ve got 300 events per second during a daily surge, you’d be close to the tasks per service quota and would have to request a quota increase and maybe even eventually create a workaround with multiple services or clusters for the same purpose. So Fargate seems safer than Step Functions from the limits regard.
Lambda Quotas, September 2020
Lambda defaults to a quota of 1000 concurrent executions, but can be increased up to “hundreds of thousands,” which is explicitly called out in the docs. And that’s per Lambda function. If we’ve got an unlikely high spike of 400 requests per second, and each invocation has the function running 5 seconds, then we’d have 20K concurrent executions of the first function in the chain of functions, which is well within the hard limits.
Fun Fact №1: You can request service quota increases from the Service Quotas dashboard. Just type in “Service Quotas” in the Find Services box to get there.
Fun Fact № 2: Consider buying a Compute Savings Plan, esp. when using a high volume of Lambda and/or Fargate usage, to save up to 66%
Fun Fact №3: The limit to invocations requests per second of a Lambda function by AWS services such as SQS and Kinesis is…“unlimited.”
The most conservative bet in terms of massive scale is Lambda. Whatever option you choose, however, just make sure to monitor the service limits by implementing alerts from something like the Service Limit Monitor Solution so that you can request a service limit increase on time. And for reliability, consider all the best practices of the Well-Architected Framework, including queuing and dead-letter queues.