Rust Lambdas for AWS CloudFront

10 Jan 2018

Background

I mostly wanted to do this to gain some experience using Lambdas and Rust, however, the end state does give me a database of website requests.

The goal here was to create a ~~CloudFront@Edge Lambda~~ normal Lambda (AWS currently only allows NodeJS Lambdas on their edge servers, which would likely mean I can’t use Rust), to parse website access logs and put the results in ~~DynamoDB~~ S3 (opted away from Dynamo when I learned about Amazon Athena)

In sticking with a few themes for this site, I really want a pay per use implementation of this (no running EC2 instance or database), because I expect no traffic to this site and I don’t want to overpay for my disappointment.

Performance

Which is “generally” fastest Java, Python, Node, C#?

Ultimately the core code will be in Rust, however Lambda can’t natively execute rust code, so it has to be invoked from a language they do support. I wasn’t overly particular about which language to use for this so I set out to research which would be the fastest (read cheapest)

Cold Start Times

http://theburningmonk.com/2017/06/aws-lambda-compare-coldstart-time-with-different-languages-memory-and-code-sizes/

Running Performance

https://read.acloud.guru/comparing-aws-lambda-performance-when-using-node-js-java-c-or-python-281bef2c740f

Conclusions?

Java/C# unsurprisingly have the longest cold start.

Java is the fastest once running (this personally didn’t surprise me as the JIT is likely able to optimize the shit out of the very simple benchmark code)

Python starts fast but the fastest start is not with the smallest code size/memory, actually the opposite, cold start times seem to decrease with increased memory/code size…. (I should note that this was not the experience seen by a friend of mine who uses Lambda’s, more investigation might be necessary here)

C#’s runtime performance and slow start rule it out for me (Though I’m curious if this has improved, no idea how Amazon is implementing their C# Lambdas, but with the .net core project I am curious if this performance is better)

I really wanted to avoid Java for this. Don’t get me wrong, I love Java, but I wanted the lowest memory overhead and lowest cold start times (again, going cheap and assuming nobody visits my site)

Picking between javascript and python is a fair bit harder. Both seem to offer similar performance, and if we are really nitpicking performance it would likely come down to how we pass day between the outside wrapper code and our rust implementation. Both can take advantage of the Rust Foreign Function Interface to avoid the need for spawning a new process, so now it’s an issue of defining that interface:

Fastest way to hand data to Rust.

The examples I saw basically converted the payload to json, and then in Rust I would have to convert it back. This is very flexible and makes the C interface pretty simple:

1

pub extern "C" fn handle(event_json: *const c_char, context_json: *const c_char) -> i32 {

However, I kind of hate the idea of this, because you incur a serialization and de-serialization penalty to execute your native code…

Amazon doesn’t really give you any input at all as to how they do their lambda implementation behind the scenes, so there is no way for me to know analytically how they create the python dict’s (or javascript object) used to contain the Lambda request information. This made it really hard for me to determine which would be faster in this respect, python or javascript from a purely analytical standpoint, you would have to do this empirically.

A good future project I would like to do here would be some black box testing of their lambda environment, playing around with serialization/deserialization to see which performs the best.

If I just use simple primitives in the interface there is no serialization/de-serialization penalties at all. This is great for performance but would certainly be a nightmare if you needed to communicate more than a few values to your Rust code.

Lucky for me, in this project, all I need from the Lambda input is the S3 Bucket, S3 Object Key, and S3 Region, I will be mapping what I need directly to primitives and calling a Rust function which has method parameters for exactly the data I need, this avoids any serialization penalties and works well for small payloads:

1

pub extern "C" fn handle(bucket_name: *const c_char, object_name: *const c_char, region: *const c_char) -> i32 {

NOTE It does look possible to hand structs between the wrapper and Rust, however, at the time I did this there wasn’t a lot of great documentation or examples on how this would be done, it also opens you up to a world of possible issues in unsafe blocks in your rust code. If I do get to a point where I want to send a lot of data into Rust from the Lambda request, this would probably be worth re-visiting. Or at least considering alongside the probably much safer (but slower) serialization to json/xml idea discussed above.

Final Conclusion?

Python!

I really dislike javascript as a language
The thought of installing NodeJS into this project and creating another massive node_modules directory on my computer makes me dry-heave a little
I like python a lot better as a language
The mechanism for calling native code via python was really simple

And while I didn’t test this directly, I HIGHLY doubt there would be any significant performance difference of executing my simple primitive API between python or javascript

Builds

Because Rust is a native language, we will need to compile it in a compatible environment that the Lambda’s are run, I figured the best way to accomplish this would be to build using the exact image the Lambda is executed in.

Setup an EC2 Build Server for Rust

I scripted all this with Ansible to make my life easier, the playbooks can be found in the git project in the ansible directory

The playbook performs the following steps:

Create a new VPC
Create a subnet in the VPC
Add a gateway to the VPC
Add a routing entry for the subnet which adds a default entry pointing to our new gateway
Create a new Security Group which allows access on port 22 incoming
Provision the EC2 instance in the new VPC with the new Security Group
Update all packages on the instance
Install Git
Install some necessary build tools (C Linker, OpenSSL Dev libraries)
Setup AWS CLI credentials/config file
Download Rust, extract it, install it
Reboot the instance

This gives us an exact copy of the environment our lambda will run in created in its own isolated network.

Build Process

I also wanted to automate the build process, this was my first implementation:

Commit/Push all local changes
Ansible: Checkout the tag on the ec2 instance
Ansible: Run the build script which runs the cargo build and creates the lambda distribution zip
Ansible: Invoke the AWS CLI to deploy the updated lambda

This worked however I ended up simplifying things and removing Ansible:

Shell Script: rsync the current project directory to the build server
Shell Script: execute commands via ssh on the build server to build/package the lambda
Shell Script: execute AWS CLI command via ssh on the build server to deploy the lambda

I made these changes for a few reasons:

Ansible was slow for performing this task and has no streaming stdout making it hard to see why builds were failing and/or what the output was of process steps
It was very cumbersome to do a git commit/push for every change

The problem with this setup is that there is no link between what gets deployed and any commit in source control (since uncommitted files are being used for the build)

This is a major no-no… and I haven’t decided yet how I’m going to circle back around to fix this.

As this project matures and I’m not doing so much ad-hoc debugging I will need to revisit this process and make sure I get some traceability between the lambda deployed to amazon and a commit or tag in source control.

OpenSSL Fiasco

I spent 5 hours sorting this out. Locally I was able to build and run my lambda with my python test wrapper just fine.

I was also able to build just fine on my EC2 build server, however, when I deployed the Lambda I was getting this error:

1

module initialization error: /lib64/libcrypto.so.10: version `OPENSSL_1.0.2' not found (required by ./libaccessloglambda.so) 

What I discovered is that rust-openssl by default dynamically links the openssl libcrypto library. Ok no problem, I just need to check the version on my build server vs the stock image and see what changed. I have a step in my ansible playbook to update all packages so I must be updating openssl to a newer version right? Just fix that and we’re back in business right? Fucking think again…

I made my build server from the exact image Amazon told me they are using to run my lambda: Lambda Execution Environment

The problem comes about when I look at that default image, it has OpenSSL version 1.0.1 installed. HOWEVER when I install openssl-devel on my build server it updates OpenSSL to 1.0.2, and the nice folks over at Amazon have completely removed anything OpenSSL 1.0.1 from their repos…

Hrm, ok, looking at openssl-rust, looks like you can pass in some environment variables and statically link openssl, I know there could be some argument here if static vs dynamic is a good/bad idea. But one thing I couldn’t find any documentation on, is the process Amazon goes through when they update the Lambda runtime image. I decided that static linking would potentially help protect me from them doing a hotswap of the AMI they run my Lambda in, to one that has a different version of OpenSSL, causing a sudden and immediate failure of my Lambda… Perfect, let’s statically link and move on…

From the openssl-rust README:

1

OPENSSL_STATIC - If specified, OpenSSL libraries will be statically rather than dynamically linked.

Great, what the do I set OPENSSL_STATIC to? I decided to go with 1, which seems to work, seems like the value is probably not important.

First thing I learned, setting OPENSSL_STATIC doesn’t do jack shit. Some more googling and looking at few GitHub issues I determined that you must also specify either OPENSSL_DIR OR OPENSSL_LIB_DIR and OPENSSL_INCLUDE_DIR

Ok, so I setup OPENSSL_LIB_DIR and OPENSSL_INCLUDE_DIR:

1

RUST_BACKTRACE=yes OPENSSL_STATIC=1 OPENSSL_LIB_DIR=/usr/lib64/ OPENSSL_INCLUDE_DIR=/usr/include/openssl/ cargo build --release

Problem: could not find native library ssl. I’ll admit, this one was me being a little boneheaded and half an hour later I realized it was looking for the .a static openssl binary and not the .so shared object.

This was easily fixed with sudo yum install openssl-static

Should be smooth sailing now, right? HA!

Project builds, then throws a fantastic error in the linker stage:

1

version node not found for symbol SSLeay_version@OPENSSL_1.0.1

Ok great… all my ssl libs are 1.0.2 now, so why is something referencing 1.0.1… I have no idea, and I still have no idea… I googled this for a while and eventually gave up, time to change tack again.

I decided I’ll just compile openssl on my build machine and use the compiled output to build my app.

Success!!! well… almost. I changed my cargo build to use OPENSSL_DIR instead of the other two options and pointed it at the prefix directory I used in the openssl config script.

The app built, no errors! My lambda deployed and ran! No errors! But then I look at the log:

1

The OpenSSL library reported an error

Kill me… At this point I was about ready to just delete all this shit and do this in Java… It would have been way easier, probably run faster, work better… why am I doing this is Rust? Only assholes write natively compiled code…

Well, lucky for me martinmroz is a hero and I found an issue he filed on github. Not only was he having the exact same problem, he put a massive amount of effort into solving it and posted a detailed solution of how he fixed it.

Turns out Amazons decision to use a RedHat based distro screws me again… (actually this may be the first problem, but I really really really prefer debian based distros to RedHat based, though I also really really like RedHat as a company and what they do for open source… It’s very complicated!)

The Amazon linux image keeps CA certs in /etc/pki/tls which is not the default location OpenSSL looks when compiled, the fix is to add --openssldir=/etc/pki/tls to the config script params.

That was it! It finally worked. In the end my openssl config line looks like this:

1

./Configure --prefix=/home/ec2-user/openssl/out linux-x86_64 -fPIC --openssldir=/etc/pki/tls

Note you will need to run sudo make install because you now need access to /etc dir

And my cargo build command looks like this:

1

RUST_BACKTRACE=yes OPENSSL_STATIC=1 OPENSSL_DIR=/home/ec2-user/openssl/out/ cargo build --release

All of this is wrapped up into the Ansible playbook start_build_server.yml and my deploy.sh script if you want to see how it all comes together.

A Solution Looking For A Problem

One of my favorite phrases I’ve adopted for working in a large company. I see this stuff all the time:

Let’s use MongoDB! I see that in a lot of tech blogs

Yes I know it’s 2018, and yes our company just adopted MongoDB last year only so they could store JSON files, something that easily could have been accomplished in SQL Server or Oracle who both have first class data types for JSON documents, and both of which we have infrastructure and teams in place to support (which we had nothing for Mongo)

We need to send this data through BizTalk or Cisco Information Server or Oracle Service Bus

Yes our company has all of these, and also Fuse/Servicemix. I think maybe the only Enterprise Service Bus we are missing is Mulesoft. ESB’s are stupid.

I digress, I now have a working Lambda which can retrieve an S3 object when it is added to a bucket so i can…. Profit???

Jackpot! I have a solution looking for a problem! Upper mgmt here I come!

In all seriousness I do actually have one problem I’m solving thus far, I wanted experience using Rust and also using AWS Lambda, I’ve at least started accomplishing that.

But what should I do with these access logs?

Put it in a database? What kind of database? Really I don’t think I can answer those questions until I figure out what I want to do with the data.

What DO I want to do with the data? What am I looking for? What statistics?

Here are a few things I’ve come up with:

Requests per day (total and per page, non bot requests)
Unique visitors (IP’s, non bot requests)
Top referrer per page (figure out how people are getting to my content)

How would I access this data? Query by time range? Group by IP?

I wouldn’t think twice about putting this data in a relational database. I know exactly how to write sql queries for the above statistics

However, there are some better technologies out there for time-series data, Elasticsearch and InfluxDB are two that I have used and work very well.

The problem is, none of these options so far meet my requirement of pay per use, the all have a non-trivial monthly fee for just the service even if it sees no data.

DynamoDB is an option, a bonus is that I likely qualify for the free tier for this app, in fact, as mentioned above this was going to be my original solution.

That was until I discovered Amazon Athena in a nutshell Amazon took Apache Hive and combined it with Presto to allow you to run SQL queries against objects in S3.

Hive is used to project a table structure on your objects in S3 and Presto is used to wire it up.

This is kind of a perfect scenario for me, my costs would be only based on use, putting data in/out of s3 (and storage), and I’m able to write familiar SQL queries to access the data.

So now my app is basically an ETL process for Athena, and really it doesn’t even need to do much, Athena could be setup to directly access the log files shipped from cloudwatch. But what kind of good corporate person would I be if I didn’t inject some piece of middleware that only I understand and can maintain to increase my job security??

Job security aside, there is another good reason to massage the raw cloudfront data.

Verify file version
Remove unnecessary comment lines from the file
Convert the files into another format???
Store the files in a directory structure which can be automatically partitioned by Athena

1. One downside of Athena appears to be very very little tolerance for schema changes. If you wanted to add/remove a column from your datasource it appears you only have a couple options, retroactively apply this change to all your data, or create a new table for the changed data moving forward. You might be able to also handle changes in the configuration of the table (the row formatter) but this would be complex for sure.

Amazon puts a version number at the top of all their access logs, so the first thing we should do is verify the version number is what we expect and error if it doesn’t match our schema, this preventing any schema related load issues.

2. This is a simple one, but we don’t need the version number and header info in our data files, so we can just strip them out.

3. I spent some time on this one, in the Athena docs, they do recommend you store your data in a Columnar Format

Your Amazon Athena query performance improves if you convert your data into open source columnar formats, such as Apache Parquet or ORC

Basically these format were built specifically to use with technologies like Hive and distributed stores like S3.

The problem for me was a lack of mature libraries in Rust which can create these formats, so this was an area that I decided to skip for now to limit the time spent on this project.

I will be using the tab separated values format the logs are provided in, and the Regex row formatter provided by Amazon

4. This is where my ETL process will add the most value

Athena Table and S3 Design

Amazon already did a lot of the work for me providing the schema and regex for parsing the cloud front logs, I just need a partition design for S3

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `date` date,
  `time` string,
  `location` string,
  bytes bigint,
  requestip string,
  method string,
  host string,
  uri string,
  status int,
  referrer string,
  useragent string,
  querystring string,
  cookie string,
  resulttype string,
  requestid string,
  hostheader string,
  requestprotocol int,
  requestbytes bigint,
  timetaken double,
  xforwardedfor string,
  sslprotocol string,
  sslcipher string,
  responseresulttype string,
  httpversion string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
 (
 "input.regex" = "^(?!#)([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)$"
 )
LOCATION 's3://your_log_bucket/prefix/';

Adding partitions by the date seemed like good sense to me, with time-series data there really isn’t a query I should be doing which isn’t bounded by a date.

With my modifications the table definition looks more like this:

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `time` int,
  `location` string,
  bytes bigint,
  requestip string,
  method string,
  host string,
  uri string,
  status int,
  referrer string,
  useragent string,
  querystring string,
  cookie string,
  resulttype string,
  requestid string,
  hostheader string,
  requestprotocol int,
  requestbytes bigint,
  timetaken double,
  xforwardedfor string,
  sslprotocol string,
  sslcipher string,
  responseresulttype string,
  httpversion string
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
 (
 "input.regex" = "^(?!#)([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)$"
 )
LOCATION 's3://your_log_bucket/prefix/';

FIXME: Regex needs to be shortened to account for moving some of the date fields into the partition paths

I removed the date since it’s now in the partition data.

I changed the time column to a int instead of string, and will store the time in BCD format e.g. 10:03:35 -> 100335. This way I can still query in my where clause for time ranges in essentially a human readable way.

Then for auto partition repair to work, my data will need to go into s3 like this:

1

s3://your_log_bucket/prefix/year=2018/month=1/day=7/filename.gz

There is some tradeoff between the number of partitions and the amount of data. In my case it doesn’t seem to make sense to partition beyond one day. Even if there were a million hits to a site, how big would that end up being? A few megabytes? I had one file which had 5 rows in it and it was almost exactly 1000b or about 200b a row. A million hits would crudely extrapolate out to a 200MB file.

If I start getting more than a million hits a day I think I might need to reconsider my partitioning strategy

If I can find the time, it would be good to write a cron process that comes in every day and combines all the previous days files into one file

The ETL Application

I started implementing the code for parsing the file. Got through the first two items, verify the file version and remove the comments. Then hit my first problem when I got to the part where I needed to start actually writing output. Turns out I don’t know how to do this yet.

I start looking at the API for putting an S3 object, looks like it takes a Vec<u8>… hrmmmm I smell a problem. What happens if I’m reading a very very large file? Reading the file is handled in a streaming fashion. I did some empirical testing to verify this, I took a ~300MB compressed, 1GB uncompressed file with about 13 million lines in it and fed it through my app in a unit test:

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950

user@develop:~/projects/access-log-lambda$ /usr/bin/time -v cargo test
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running target/debug/deps/accessloglambda-9101a5ca8b4e29e5

running 3 tests
test tests::it_works ... ok
test tests::test_small_file ... ok
test tests::test_large_file has been running for over 60 seconds
test tests::test_large_file ... FAILED

failures:

---- tests::test_large_file stdout ----
        Retrieving and parsing took 600561ms
thread 'tests::test_large_file' panicked at 'assertion failed: `(left == right)`
  left: `27`,
 right: `13147028`', src/lib.rs:182:8
note: Run with `RUST_BACKTRACE=1` for a backtrace.


failures:
    tests::test_large_file

test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out

error: test failed, to rerun pass '--lib'
Command exited with non-zero status 101
        Command being timed: "cargo test"
        User time (seconds): 367.70
        System time (seconds): 222.09
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 10:01.37
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 39980
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 584
        Minor (reclaiming a frame) page faults: 12110
        Voluntary context switches: 1165
        Involuntary context switches: 71291
        Swaps: 0
        File system inputs: 146968
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 101  

The good news is that according to /usr/bin/time

1

Maximum resident set size (kbytes): 39980

Clearly it’s not reading the entire file into memory.

The bad news:

Elapsed (wall clock) time (h:mm:ss or m:ss): 10:01.37

Given that I wasn’t doing anything with the rows, I’m not sure why it took 10 minutes to read the file. I thought this was a byproduct of running my test through the cargo test harness, so I built a simple binary within my project to run with cargo run

This didn’t seem to have any affect on the performance.

I then took a look at the BufReader and realized the default buffer size was only 8kb, I increased this to 1MB and got bored waiting for it an killed it after 7 mins.

I increased the buffer to 100MB… and I give up after 9mins… Buffer size isn’t making any difference here. Let’s try a release build and see if that makes any difference!

5 mins… FML

I like pouring salt on wounds, I wonder how fast it takes linux tools like zcat and wc

 1 2 3 4 5 6 7 8 910111213141516171819202122232425

user@develop:~/projects/access-log-lambda$ /usr/bin/time -v zcat test/test_large.gz | wc -l
        Command being timed: "zcat test/test_large.gz"
        User time (seconds): 7.25
        System time (seconds): 0.26
        Percent of CPU this job got: 97%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.73
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1484
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 145
        Voluntary context switches: 126
        Involuntary context switches: 3991
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
13147027

Well at least we are close /s

I looked over things a little more, specifically how I was using the gzip library libflate. I decided to try a different library flate2-rs

 1 2 3 4 5 6 7 8 910111213141516171819202122232425

Retrieving and parsing took 10159ms
Processed 13147028 lines
        Command being timed: "cargo run --release --bin benchmark"
        User time (seconds): 10.27
        System time (seconds): 0.15
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.46
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 41164
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 11315
        Voluntary context switches: 72
        Involuntary context switches: 750
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0 

Huh… Well at least we are back on track again, 13 million rows in 10 seconds in 41Mb of RAM (It actually reported much less in top). I put the BufReader back to the default 8kb and removed my little test binary.

cargo test was now running in about a minute on a debug build, and around 10 seconds when running cargo test --release

I did save a tag of the repo which you can run with /usr/bin/time -v cargo run --release --bin benchmark, however to build it you’ll need a version of rusoto which includes the pull request (>= 0.31.0) I made, and update Cargo.toml to use that version rather than the local path I have specified.

Ok where were we….

Right, we can stream read, but we can’t stream write because the S3 put_object method does not take a stream, meaning we would need all the bytes we want in the file to be read into memory first.

For a Lambda capped in a 128MB execution environment this would certainly break for large files (probably anything over 100MB when you take into account the python overhead and whatever memory our app is using)

I thought about this for a bit and noticed that the S3 object also has Multipart Upload support.

It also turns out that GZIP files can be concatenated…

The little wheels in my head are turning now, what if I chunk up the incoming file, gzip it, and send it to s3 as part of a multipart upload? Then when I’m done, tell Amazon I’m completed and it will concatenate all the files for me?? I wonder if this will work?

2018/03/06 Status

I’m in the middle of moving into a new house and selling the old, I was going to finish this blog before publishing but recent events have created a situation where publishing this in its current state will have some value.

Check out the source for this project on gitlab: https://gitlab.com/slim-bean/access-log-lambda