Cloudfront can be simply defined as a CDN (Content Delivery Network), caching your static assets in a datacenter nearer to your viewers. But Cloudfront is a lot more complex and versatile than this simple definition. Cloudfront is a “pull” CDN, which means that you don’t push your content to the CDN. The content is pulled into the CDN Edge from the origin at the first request of any piece of content.
In addition to the traditional pull and cache usage, Cloudfront can also be used as:
A Networking Router
A Firewall
A Web Server
An Application Server
Why is using a CDN relevant?
The main reason is to improve the speed of delivery of static content. By caching the content on the CDN edge, you not only reduce the download time from a few seconds to a few milliseconds, but you also reduce the load and amount of requests on your backend (Network, IO, CPU, Memory, …).
Static content can be defined as content not changing between two identical requests done in the same time frame.
Identical can be as simple as the same URI, or as fine grained as down to the authentication header. The time frame can range between 1 second to 1 year. The most common case is caching resources like Javascript or CSS and serving the same file to all users forever. But caching a JSON response tailored to a user (Authentication header) for a few seconds reduces the backend calls when the user has the well-known “frenetic browser reload syndrome”.
Edges, Mid-Tier Caches, and Origins
Cloudfront isn’t “just” some servers in datacenters around the world. The service is a layered network of Edge Locations and Regional Edge Caches (or Mid-Tier Caches).
Edge Locations are distributed around the globe with more than 400 points of presence in over 90 cities across 48 countries. Each Edge Location is connected to one of the 13 Regional Edge Caches.
Regional Edge Caches are transparent to you and your visitors, you can’t configure them or access them directly. Your visitors will interact with the nearest Edge Location, which will connect to the attached Regional Edge Cache and finally to your origin. Therefore, in this article, we will refer to Cloudfront as the combination of Edge Locations and Region Edge Caches.
What Have We Learned?
Cloudfront is more than just a simple “pull-cache-serve” service
You improve delivery speed to your visitors
You can increase resilience by always using a healthy backend
You improve overall speed to your backend by leveraging AWS’s backbone
You can modify any request to tailor the response to your visitor’s device or region
You don’t always need a backend
You protect your backend by reducing the number of calls reaching it
Nowadays we don't just want data, we want information, which is quite different. Data is like rough stones ⛏️ extracted from mining, while, information is the polished diamond 💎✨.
This article aims to help you prepare a simple and effective stack where you can store data (rough stones), and then carry out analyses (extracting diamonds), in a fully scalable low-cost way 💸.
The Pattern
The “AWS Lego”
When it comes to data analysis, AWS offers many sorts of tools; like a “Lego” with infinite possibilities. Today, I’m showing a pattern created and recommended by a friend of mine, Marcelo Andrade, in it we will have a very simple API that will feed a columnar database (a very performant way for doing data analysis), and all of this will be saved into an S3 bucket.
Which “Lego” Pieces Will We Be Using?
API Gateway (to create our API)
S3 (where we will store the data)
Glue Database (to organize the data in S3 in a structured way)
Kinesis Firehose (is the data stream that will connect API Gateway and Glue database + S3)
Lambda (let's create a JavaScript that will guarantee partitioning)
Athena (for data analysis)
Finally, The Pattern!
How about understanding this pattern with a real-world example? Below we have an API to store Logs, and you will be able to extract metrics later (without using CloudWatch 😀).
Any application wants to send logs to the API; (Rough stones ⛏️🪨)
Endpoint/POST on API Gateway receives JSON with logs
API Gateway makes a Service Proxy, in other words, it communicates directly with the Kinesis Firehose without the need for a Lambda
Kinesis Firehose converts this “raw” JSON to a columnar format (Apache Parquet, a performative format for data analysis)
Kinesis Firehose then uses the Glue Database, which basically saves the JSON converted into Parquet in directories (a.k.a partitions) inside of an S3 bucket
A user using Athena can consume this partitioned information saved in S3 through SQL queries; (Polished diamond 💎✨)
This information can be consumed later by other processing; not only by users on the Athena console.
Real-world applications
Storage of metrics, logs, and traces for later analysis, which can replace CloudWatch and thus reduce costs
For example, you can use the API to feed sales data from your e-commerce in real time and make statistical predictions of profit & churn, and identify anomalies.
Data on sales, financial, stock market, supply chain, health, performance, telecommunications, etc.
Any data format that requires complex analytical queries or that requires aggregations across multiple dimensions — which is common in large data analysis environment scaling.
Show me the code!
You can check out the pattern in this Repository, here are some highlights of the most important points of the pattern:
1. Our bucket to save logs - 'resources/s3.yml':
2. Glue Database - 'resources/glue.yml':
in 'InputFormat' and 'OutputFormat' it was specified that we want the Parquet format (which is that columnar database);
in 'TableInput > StorageDescriptor > Columns' the database schema was specified;
in 'TableInput > PartitionKeys' it was specified that the database is partitioned by year, month, and day.
3. Create a Stream with Kinesis Firehose and connect with Glue Database and S3:
in 'KinesisDeliveryStream > Properties', there is a property called 'ExtendedS3DestinationConfiguration', which means that the result of this stream will be stored into a S3 bucket (we reference this bucket in BucketARN);
in 'DataFormatConversionConfiguration > InputFormatConfiguration' we are specifying that we receive data in JSON (OpenXJsonSerDe);
in 'DataFormatConversionConfiguration > OutputFormatConfiguration' we are specifying that the Stream converts the data into Parquet (ParquetSerDe);
in 'DataFormatConversionConfiguration > SchemaConfiguration' we reference the Glue Database that we created earlier.
4. Now we want an endpoint on API Gateway doing Service Proxy for Kinesis. The trick here is that we are going to use the 'serverless-apigateway-service-proxy' plugin for this, just include it in 'serverless.yml':
in 'streamName' we reference the Stream that we created previously, the plugin will be able to create an endpoint in the API Gateway on its own.
5. Let's configure Athena, so we will be able to make queries on this mass of data that will be stored in an S3 bucket. The trick here is to include the serverless-athena plugin, in the 'serverless.yml':
in 'databases > name' we must be careful, we do not use the same name as the Glue Database;
in 'databases > ddl' we specify a SQL file* with a 'CREATE DATABASE' statement for Athena;
in 'databases > tables > ddl' we specify a SQL file* with a 'CREATE TABLE' statement with the same specifications that we used earlier in Glue Database.
6. Finally, we will guarantee to partition in Athena each time a file is submitted to the API and saved in Parquet format into an S3 bucket, for this we create a Lambda that is triggered every 's3:ObjectCreated'. Add to 'serverless.yml':
7. Here is the Lambda handler - 'athena/handlers/partitionUpdate.js':
Let's play with our Logs API 😀!
1. 'npm run deploy' or 'npx serverless deploy --stage dev --region us-east-1'
2. Wait for deployment, get the output URL from API Gateway, and do a POST request with a Log in JSON format:
3. Query in Athena searching for the number of results by logtype (error, warning, info, etc.)
It is important to note that we benefit from filtering by the attributes that were used for partitioning the data (year, month, and day). With massive amounts of data, this has a great performance impact, since Athena charges only per amount of data read.
Now just plug the API into your application and use it at will, all resources used will scale automatically (yeah, it’s serverless power! 😀), with very high storage capacity at low cost (S3), and you also have access to very performant data analysis at low cost.
Conclusion
There are many ways to do data storing and analysis, but it’s very convenient to do it by API, this one will save a lot of money storing data in S3 and by only paying to read the data using partitions with Athena.
With AWS Calculator, I estimated 5 GB of Logs read and 3 GB of data analyzed. While using CloudWatch it would cost 5,06 USD/month, with this pattern it costs 1,78 USD/month, almost 3 times cheaper.
This pattern supports large amounts of data with no performance and scalability gaps, you could input data from other sources using Glue, and there are many possibilities for triggering other processings from Athena query results, such as ETLs, for example.
Sometimes you are in a low-budget situation, so every piece of coin matters. As shown with the Logs API example, many times you don’t need all resources from CloudWatch, only a few things for a simple solution. Patterns like these solve many real-world cases with less code.
If you are not a data analyst expert like me, don’t be worried, I think this pattern may help you as a good starting point, so feel free to contact me. My special thanks to Marcelo Andrade for this pattern— it was an eye-opener for me.