Serverless Consulting

Troubleshooting Slow SQS FIFO Queues

A Real Production Serverless Scenario

Troubleshooting Slow SQS FIFO Queues

By
Marcelo Andrade
December 22, 2020

Introduction

Sometimes running a huge amount of operations in a database is very tough, especially when you need to ensure the order in which operations are performed. One of our clients was struggling with a really slow processing for SQS Fifo queue messages consuming and we made some tests around it to understand what performance improvements could be done.

This post will pass through a simple implementation around Lambda and RDS, created to get more metrics and information about this common pattern used over AWS cloud.

Assets and Resources dependencies

We used a RDS Aurora MySql cluster (1 Read / 1 Write) with db.t3.small instance size, running in a Private VPC.

RDS Cluster Screenshot

We also have Secrets Manager configured to save database credentials and host name.

Secrets Manager Screenshot

If you are just here for the code, check out the template, <>.

Project Resources

Database Table

We used MySql Sakila Sample Database as a start point to realize our operations.

For this use case we basically insert some random generated First and Last name values to the actor table.

Consumer

  • As the only responsibility for the consumer is to iterate over a batch of messages and run a simple insert query for each of them (it could be a batch query to be faster), we are using Lambda with 128mb memory allocation for it (minimum size allowed) running inside the same VPC as the database.
  • To be able to communicate with Secrets Manager we are creating a VPC Endpoint to have it working.
  • As we are using Artillery for the load test and it works for HTTP requests only, we have an API Gateway to configure a HTTP endpoint as the message input and sent it to a SQS queue using FIFO mode (First In First Out).
Consumer diagram

Important Considerations

Dead Letter Queue Required

If the consumer has any error, it should have your queue locked until you send the message(s) that you tried to process to a Dead Letter Queue. If you don't handle this, you can have your entire process locked. That's why we also deploy a DLQ configured to be used with the RedrivePolicy from the main queue.

Screenshot from serverless.yml file

Another important information here is if you try to process a batch of 10 messages and have an error in any of them, you will have a "failed" scenario for the entire batch. You need to be careful when processing from the Dead Letter Queue, because in this case, it can have messages that were already processed successfully inside the failed batch.

Scaling SQS FIFO

As we are using a FIFO queue, we are not able to have a horizontal scalable consumer for it as concurrent executions will mess it up with the order of the messages. You can process up to 10 messages per lambda execution using batchSize.

VPC Endpoint to access Secrets Manager

As we are running the Lambda function inside a private VPC, we need a VPC Endpoint in order to access the Secrets Manager.

Producer (Load Testing)

To be able to generate a good amount of data over SQS we used the serverless-artillery solution. If you don't know it yet, we have a post in our blog with more information right here.

For our test case, we just created the script.yml file to generate random id's messages and POST it to an API Gateway endpoint to make it available in the SQS Queue. You can find the config file for Artillery here. We are using this randomNumber for ID because our queue doesn't accept duplicated entries.

script.yml

For this scenario, you can get BASE_PATH from the output of the consumer deploy, like this:

Output example

Running Load Test—Test Scenarios

#1 Without MySQL interaction

SQS FIFO—BatchSize 1

  • Number of messages received in 30 seconds: 9625
  • Total Time to process: ~7 minutes

Lambda Metrics

SQS Metrics

SQS FIFO—BatchSize 10

  • Number of messages received in 5 minutes: 90814
  • Total Time to process: ~7 minutes

Lambda Metrics

SQS Metrics

With MySQL Queries

In the previous load tests we didn't use mysql queries, we were merely getting a baseline for how many messages could be processed without any operations.

Now we are going to add the MySQL queries in and see how the performance changes.

SQS FIFO—BatchSize 1 - (1 INSERT per message)

  • Number of messages received in 30 seconds: 2,288
  • Total Time to process: ~6 minutes

SQS FIFO—BatchSize 10 - (1 INSERT per message / 10 per lambda execution)

  • Number of messages received in 30 seconds: 2,274
  • Total Time to process: ~1min 15s

SQS FIFO—BatchSize 10 - (1 INSERT per message / 10 per lambda execution)

  • Number of messages received in 5 minutes: 65,530
  • Total Time to process: ~23 minutes

Lambda Metrics

Now that we are running queries to our RDS database, our duration has increased compared to the scenario without the database. Remember, we are running a 128mb lambda and it's running 10 different insert queries each time.

Database Metrics

DB Connections

As we are dealing with FIFO queue, you should not have more than 1 execution running in parallel for it. For that reason you should not have a high volume of connections over your RDS instance.

CPU Utilization

As we are not running concurrent queries over our database we should have a low usage of CPU at the RDS instance. For that print screen, we wrote more than 65k new register and only used a peek of 17% of CPU in the smallest instance that we have available.

Write Latency

This metric is very good to check. If you run a huge amount of queries simultaneously, it will increase your write latency and consequently this will make your Lambda function take more time to finish processing.

Conclusion

At the end of the day, we concluded that you can handle a very good amount of data using the SQS FIFO queue using the batchSize: 10 strategy. It literally increases the performance  almost 10x compared to setting the SQS FIFO queue to batchSize: 1 and when it comes to high production traffic this is one way to help solve this kind of bottleneck.

While we were studying and running the test cases we discovered that some developers are using Kinesis Data Stream instead of SQS FIFO, which gives them the possibility to deal with a much bigger volume of data, in a smaller window of time. We are going to test this scenario as well and write another post talking about it.

Thanks for reading and let us know if you have any questions.