AWS Lambda: Parallelizing Invocations with S3 Batch Operations
I have a simple Lambda function that downloads historical equity trades for a given symbol (“AAPL”) and date (“2020–01–01”) and saves the output as a parquet file to S3 for analysis but how can I parallelize the invocations when my input list is very large? If I use Boto3 and Python on my local machine, the downloads of symbol+date are sequentially invoked which results in O(N) performance. Searching around online for a better solution, I found a post by Ivan Klishch in which he used a separate controller Lambda that accepted a range of input and invoked the base Lambda function in a “fan-out” pattern. This works pretty well but I think I figured out a better solution using S3 batch operations.
What is S3 Batch Operations?
AWS S3 Batch Operations is designed to perform large-scale batch operations on S3 objects using a manifest that can be generated automatically using a S3 inventory report. Among the batch operations that can be performed, you can specify a Lambda function to invoke which, for each invocation, the bucket and key name of an object is provided.
Batch Operations seems useful for existing S3 objects but what about objects that don’t exist yet? Can we tell Batch Operations our desired objects and let it handle the Lambda invocations? The answer is yes, yes we can.
Generating a Custom Manifest with Lambda Arguments
According to the Batch Operations console, the manifest file can be any CSV with the columns “BucketName”, “S3Key”, and an optional “Version” column.
So to generate a custom CSV for downloading historical equity trades, I would just need to generate a CSV containing the arguments for my Lambda such as:
Here I am using the “S3Key” column value as my Lambda argument list delimited by a backslash “/” but really all three columns could be used as arguments to Lambda since they are bound to the event message by Batch Operations passed to each invocation.
I use a simple Python script to loop over the desired symbols plus dates to generate the CSV (with no headers) “manifest.csv” and upload it to my S3 bucket so Batch Operations can find it:
Then I setup a S3 Batch Operations job and specify my custom manifest as the input:
And specify my downloader Lambda:
Configure an output directory for the Batch results including any download errors:
Then create the job config, activate it, and wait for results:
And quicker than you can say “mitochondria is the powerhouse of the cell”, the job completes and my data is available:
This method of using a custom input CSV of Lambda arguments lets AWS S3 Batch Operations handle all of the parallelization and is cost effective (only $0.25 to run a batch job compared to doing all scheduling/invocation manually). There are of course multiple ways of ingesting data like this and I welcome any feedback on better methods of doing so.