Rust Lambdas for AWS CloudFront
I mostly wanted to do this to gain some experience using Lambdas and Rust, however, the end state does give me a database of website requests.
The goal here was to create a
CloudFront@Edge Lambda normal Lambda (AWS currently only allows NodeJS Lambdas on their edge servers, which would likely mean I can’t use Rust), to parse website access logs and put the results in DynamoDB S3 (opted away from Dynamo when I learned about Amazon Athena)
In sticking with a few themes for this site, I really want a pay per use implementation of this (no running EC2 instance or database), because I expect no traffic to this site and I don’t want to overpay for my disappointment.
Which is “generally” fastest Java, Python, Node, C#?
Ultimately the core code will be in Rust, however Lambda can’t natively execute rust code, so it has to be invoked from a language they do support. I wasn’t overly particular about which language to use for this so I set out to research which would be the fastest (read cheapest)
Cold Start Times
Java/C# unsurprisingly have the longest cold start.
Java is the fastest once running (this personally didn’t surprise me as the JIT is likely able to optimize the shit out of the very simple benchmark code)
Python starts fast but the fastest start is not with the smallest code size/memory, actually the opposite, cold start times seem to decrease with increased memory/code size…. (I should note that this was not the experience seen by a friend of mine who uses Lambda’s, more investigation might be necessary here)
C#’s runtime performance and slow start rule it out for me (Though I’m curious if this has improved, no idea how Amazon is implementing their C# Lambdas, but with the .net core project I am curious if this performance is better)
I really wanted to avoid Java for this. Don’t get me wrong, I love Java, but I wanted the lowest memory overhead and lowest cold start times (again, going cheap and assuming nobody visits my site)
Fastest way to hand data to Rust.
The examples I saw basically converted the payload to json, and then in Rust I would have to convert it back. This is very flexible and makes the C interface pretty simple:
However, I kind of hate the idea of this, because you incur a serialization and de-serialization penalty to execute your native code…
A good future project I would like to do here would be some black box testing of their lambda environment, playing around with serialization/deserialization to see which performs the best.
If I just use simple primitives in the interface there is no serialization/de-serialization penalties at all. This is great for performance but would certainly be a nightmare if you needed to communicate more than a few values to your Rust code.
Lucky for me, in this project, all I need from the Lambda input is the S3 Bucket, S3 Object Key, and S3 Region, I will be mapping what I need directly to primitives and calling a Rust function which has method parameters for exactly the data I need, this avoids any serialization penalties and works well for small payloads:
NOTE It does look possible to hand structs between the wrapper and Rust, however, at the time I did this there wasn’t a lot of great documentation or examples on how this would be done, it also opens you up to a world of possible issues in unsafe blocks in your rust code. If I do get to a point where I want to send a lot of data into Rust from the Lambda request, this would probably be worth re-visiting. Or at least considering alongside the probably much safer (but slower) serialization to json/xml idea discussed above.
- The thought of installing NodeJS into this project and creating another massive node_modules directory on my computer makes me dry-heave a little
- I like python a lot better as a language
- The mechanism for calling native code via python was really simple
Because Rust is a native language, we will need to compile it in a compatible environment that the Lambda’s are run, I figured the best way to accomplish this would be to build using the exact image the Lambda is executed in.
Setup an EC2 Build Server for Rust
I scripted all this with Ansible to make my life easier, the playbooks can be found in the git project in the ansible directory
The playbook performs the following steps:
- Create a new VPC
- Create a subnet in the VPC
- Add a gateway to the VPC
- Add a routing entry for the subnet which adds a default entry pointing to our new gateway
- Create a new Security Group which allows access on port 22 incoming
- Provision the EC2 instance in the new VPC with the new Security Group
- Update all packages on the instance
- Install Git
- Install some necessary build tools (C Linker, OpenSSL Dev libraries)
- Setup AWS CLI credentials/config file
- Download Rust, extract it, install it
- Reboot the instance
This gives us an exact copy of the environment our lambda will run in created in its own isolated network.
I also wanted to automate the build process, this was my first implementation:
- Commit/Push all local changes
- Ansible: Checkout the tag on the ec2 instance
- Ansible: Run the build script which runs the cargo build and creates the lambda distribution zip
- Ansible: Invoke the AWS CLI to deploy the updated lambda
This worked however I ended up simplifying things and removing Ansible:
- Shell Script: rsync the current project directory to the build server
- Shell Script: execute commands via ssh on the build server to build/package the lambda
- Shell Script: execute AWS CLI command via ssh on the build server to deploy the lambda
I made these changes for a few reasons:
- Ansible was slow for performing this task and has no streaming stdout making it hard to see why builds were failing and/or what the output was of process steps
- It was very cumbersome to do a git commit/push for every change
The problem with this setup is that there is no link between what gets deployed and any commit in source control (since uncommitted files are being used for the build)
This is a major no-no… and I haven’t decided yet how I’m going to circle back around to fix this.
As this project matures and I’m not doing so much ad-hoc debugging I will need to revisit this process and make sure I get some traceability between the lambda deployed to amazon and a commit or tag in source control.
I spent 5 hours sorting this out. Locally I was able to build and run my lambda with my python test wrapper just fine.
I was also able to build just fine on my EC2 build server, however, when I deployed the Lambda I was getting this error:
What I discovered is that rust-openssl by default dynamically links the openssl libcrypto library. Ok no problem, I just need to check the version on my build server vs the stock image and see what changed. I have a step in my ansible playbook to update all packages so I must be updating openssl to a newer version right? Just fix that and we’re back in business right? Fucking think again…
I made my build server from the exact image Amazon told me they are using to run my lambda: Lambda Execution Environment
The problem comes about when I look at that default image, it has OpenSSL version 1.0.1 installed. HOWEVER when I install
openssl-devel on my build server it updates OpenSSL to 1.0.2, and the nice folks over at Amazon have completely removed anything OpenSSL 1.0.1 from their repos…
Hrm, ok, looking at openssl-rust, looks like you can pass in some environment variables and statically link openssl, I know there could be some argument here if static vs dynamic is a good/bad idea. But one thing I couldn’t find any documentation on, is the process Amazon goes through when they update the Lambda runtime image. I decided that static linking would potentially help protect me from them doing a hotswap of the AMI they run my Lambda in, to one that has a different version of OpenSSL, causing a sudden and immediate failure of my Lambda… Perfect, let’s statically link and move on…
From the openssl-rust README:
Great, what the do I set
OPENSSL_STATIC to? I decided to go with
1, which seems to work, seems like the value is probably not important.
First thing I learned, setting
OPENSSL_STATIC doesn’t do jack shit. Some more googling and looking at few GitHub issues I determined that you must also specify either
Ok, so I setup
could not find native library ssl. I’ll admit, this one was me being a little boneheaded and half an hour later I realized it was looking for the .a static openssl binary and not the .so shared object.
This was easily fixed with
sudo yum install openssl-static
Should be smooth sailing now, right? HA!
Project builds, then throws a fantastic error in the linker stage:
Ok great… all my ssl libs are 1.0.2 now, so why is something referencing 1.0.1… I have no idea, and I still have no idea… I googled this for a while and eventually gave up, time to change tack again.
I decided I’ll just compile openssl on my build machine and use the compiled output to build my app.
Success!!! well… almost. I changed my cargo build to use
OPENSSL_DIR instead of the other two options and pointed it at the prefix directory I used in the openssl config script.
The app built, no errors! My lambda deployed and ran! No errors! But then I look at the log:
Kill me… At this point I was about ready to just delete all this shit and do this in Java… It would have been way easier, probably run faster, work better… why am I doing this is Rust? Only assholes write natively compiled code…
Well, lucky for me
martinmroz is a hero and I found an issue he filed on github. Not only was he having the exact same problem, he put a massive amount of effort into solving it and posted a detailed solution of how he fixed it.
Turns out Amazons decision to use a RedHat based distro screws me again… (actually this may be the first problem, but I really really really prefer debian based distros to RedHat based, though I also really really like RedHat as a company and what they do for open source… It’s very complicated!)
The Amazon linux image keeps CA certs in
/etc/pki/tls which is not the default location OpenSSL looks when compiled, the fix is to add
--openssldir=/etc/pki/tls to the config script params.
That was it! It finally worked. In the end my openssl config line looks like this:
Note you will need to run
sudo make install because you now need access to /etc dir
And my cargo build command looks like this:
All of this is wrapped up into the Ansible playbook start_build_server.yml and my deploy.sh script if you want to see how it all comes together.
A Solution Looking For A Problem
One of my favorite phrases I’ve adopted for working in a large company. I see this stuff all the time:
Let’s use MongoDB! I see that in a lot of tech blogs
Yes I know it’s 2018, and yes our company just adopted MongoDB last year only so they could store JSON files, something that easily could have been accomplished in SQL Server or Oracle who both have first class data types for JSON documents, and both of which we have infrastructure and teams in place to support (which we had nothing for Mongo)
We need to send this data through BizTalk or Cisco Information Server or Oracle Service Bus
Yes our company has all of these, and also Fuse/Servicemix. I think maybe the only Enterprise Service Bus we are missing is Mulesoft. ESB’s are stupid.
I digress, I now have a working Lambda which can retrieve an S3 object when it is added to a bucket so i can…. Profit???
Jackpot! I have a solution looking for a problem! Upper mgmt here I come!
In all seriousness I do actually have one problem I’m solving thus far, I wanted experience using Rust and also using AWS Lambda, I’ve at least started accomplishing that.
But what should I do with these access logs?
Put it in a database? What kind of database? Really I don’t think I can answer those questions until I figure out what I want to do with the data.
What DO I want to do with the data? What am I looking for? What statistics?
Here are a few things I’ve come up with:
- Requests per day (total and per page, non bot requests)
- Unique visitors (IP’s, non bot requests)
- Top referrer per page (figure out how people are getting to my content)
How would I access this data? Query by time range? Group by IP?
I wouldn’t think twice about putting this data in a relational database. I know exactly how to write sql queries for the above statistics
However, there are some better technologies out there for time-series data, Elasticsearch and InfluxDB are two that I have used and work very well.
The problem is, none of these options so far meet my requirement of pay per use, the all have a non-trivial monthly fee for just the service even if it sees no data.
DynamoDB is an option, a bonus is that I likely qualify for the free tier for this app, in fact, as mentioned above this was going to be my original solution.
That was until I discovered Amazon Athena in a nutshell Amazon took Apache Hive and combined it with Presto to allow you to run SQL queries against objects in S3.
Hive is used to project a table structure on your objects in S3 and Presto is used to wire it up.
This is kind of a perfect scenario for me, my costs would be only based on use, putting data in/out of s3 (and storage), and I’m able to write familiar SQL queries to access the data.
So now my app is basically an ETL process for Athena, and really it doesn’t even need to do much, Athena could be setup to directly access the log files shipped from cloudwatch. But what kind of good corporate person would I be if I didn’t inject some piece of middleware that only I understand and can maintain to increase my job security??
Job security aside, there is another good reason to massage the raw cloudfront data.
- Verify file version
- Remove unnecessary comment lines from the file
- Convert the files into another format???
- Store the files in a directory structure which can be automatically partitioned by Athena
1. One downside of Athena appears to be very very little tolerance for schema changes. If you wanted to add/remove a column from your datasource it appears you only have a couple options, retroactively apply this change to all your data, or create a new table for the changed data moving forward. You might be able to also handle changes in the configuration of the table (the row formatter) but this would be complex for sure.
Amazon puts a version number at the top of all their access logs, so the first thing we should do is verify the version number is what we expect and error if it doesn’t match our schema, this preventing any schema related load issues.
2. This is a simple one, but we don’t need the version number and header info in our data files, so we can just strip them out.
3. I spent some time on this one, in the Athena docs, they do recommend you store your data in a Columnar Format
Your Amazon Athena query performance improves if you convert your data into open source columnar formats, such as Apache Parquet or ORC
Basically these format were built specifically to use with technologies like Hive and distributed stores like S3.
The problem for me was a lack of mature libraries in Rust which can create these formats, so this was an area that I decided to skip for now to limit the time spent on this project.
I will be using the tab separated values format the logs are provided in, and the Regex row formatter provided by Amazon
4. This is where my ETL process will add the most value
Athena Table and S3 Design
Amazon already did a lot of the work for me providing the schema and regex for parsing the cloud front logs, I just need a partition design for S3
Adding partitions by the date seemed like good sense to me, with time-series data there really isn’t a query I should be doing which isn’t bounded by a date.
With my modifications the table definition looks more like this:
FIXME: Regex needs to be shortened to account for moving some of the date fields into the partition paths
I removed the date since it’s now in the partition data.
I changed the
time column to a int instead of string, and will store the time in BCD format e.g. 10:03:35 -> 100335. This way I can still query in my where clause for time ranges in essentially a human readable way.
Then for auto partition repair to work, my data will need to go into s3 like this:
There is some tradeoff between the number of partitions and the amount of data. In my case it doesn’t seem to make sense to partition beyond one day. Even if there were a million hits to a site, how big would that end up being? A few megabytes? I had one file which had 5 rows in it and it was almost exactly 1000b or about 200b a row. A million hits would crudely extrapolate out to a 200MB file.
If I start getting more than a million hits a day I think I might need to reconsider my partitioning strategy
If I can find the time, it would be good to write a cron process that comes in every day and combines all the previous days files into one file
The ETL Application
I started implementing the code for parsing the file. Got through the first two items, verify the file version and remove the comments. Then hit my first problem when I got to the part where I needed to start actually writing output. Turns out I don’t know how to do this yet.
I start looking at the API for putting an S3 object, looks like it takes a
Vec<u8>… hrmmmm I smell a problem. What happens if I’m reading a very very large file? Reading the file is handled in a streaming fashion. I did some empirical testing to verify this, I took a ~300MB compressed, 1GB uncompressed file with about 13 million lines in it and fed it through my app in a unit test:
The good news is that according to
Clearly it’s not reading the entire file into memory.
The bad news:
Elapsed (wall clock) time (h:mm:ss or m:ss): 10:01.37
Given that I wasn’t doing anything with the rows, I’m not sure why it took 10 minutes to read the file. I thought this was a byproduct of running my test through the cargo test harness, so I built a simple binary within my project to run with cargo run
This didn’t seem to have any affect on the performance.
I then took a look at the BufReader and realized the default buffer size was only 8kb, I increased this to 1MB and got bored waiting for it an killed it after 7 mins.
I increased the buffer to 100MB… and I give up after 9mins… Buffer size isn’t making any difference here. Let’s try a release build and see if that makes any difference!
5 mins… FML
I like pouring salt on wounds, I wonder how fast it takes linux tools like zcat and wc
Well at least we are close /s
I looked over things a little more, specifically how I was using the gzip library libflate. I decided to try a different library flate2-rs
Huh… Well at least we are back on track again, 13 million rows in 10 seconds in 41Mb of RAM (It actually reported much less in top). I put the BufReader back to the default 8kb and removed my little test binary.
cargo test was now running in about a minute on a debug build, and around 10 seconds when running
cargo test --release
I did save a tag of the repo which you can run with
/usr/bin/time -v cargo run --release --bin benchmark, however to build it you’ll need a version of rusoto which includes the pull request (>= 0.31.0) I made, and update
Cargo.toml to use that version rather than the local path I have specified.
Ok where were we….
Right, we can stream read, but we can’t stream write because the S3 put_object method does not take a stream, meaning we would need all the bytes we want in the file to be read into memory first.
For a Lambda capped in a 128MB execution environment this would certainly break for large files (probably anything over 100MB when you take into account the python overhead and whatever memory our app is using)
I thought about this for a bit and noticed that the S3 object also has Multipart Upload support.
It also turns out that GZIP files can be concatenated…
The little wheels in my head are turning now, what if I chunk up the incoming file, gzip it, and send it to s3 as part of a multipart upload? Then when I’m done, tell Amazon I’m completed and it will concatenate all the files for me?? I wonder if this will work?
I’m in the middle of moving into a new house and selling the old, I was going to finish this blog before publishing but recent events have created a situation where publishing this in its current state will have some value.
Check out the source for this project on gitlab: https://gitlab.com/slim-bean/access-log-lambda