This took longer than I’d be comfortable admitting. The idea was to extract portions of a data analysis pipeline from an IPython Notebook and have them be invokable on Lambda. The dependencies involved are common enough (scipy/numpy & pandas) that I’d imagine at least one other person will have to go through this.
My Lambda experience has been confined to Clojurescript/Java, and I haven’t written more than a couple of lines of Python in a few years — shield the eyes, steady the stomach, etc.
We want to end up with a repeatable process for producing a substantial (~50MB) zip file containing all of the dependencies of our handler — including any shared libraries.
As the first/only opportunity we’re given to adjust the execution context of our Lambda-deployed code is in a Python function (with no opportunity to set environment variables upfront), our handler’ll be spawning a Python subprocess with a modified load path before executing any application-specific code.
Setup / Once-off
- Generate “template” zip file containing third-party deps, etc. on Amazon Linux EC2 instance (
deps.zip, let’s say)
- Upload zip to S3 bucket (Similarly,
Deploy Application Code
- Add application-specific code to zip file (incl. subprocess harness)
- Deploy to S31
- Deploy to Lambda
There are other ways this could be done, e.g. zipping environment on development/build machines and adding pre-built shared libraries prior to Lambda deployment / static compilation from source, etc. — adjust as needed.
Accumulating Runtime Dependencies
virtualenv is probably the right way to do this.
#!/usr/bin/env bash set -e -o pipefail sudo yum -y upgrade sudo yum -y groupinstall "Development Tools" sudo yum -y install blas blas-devel lapack \ lapack-devel Cython --enablerepo=epel virtualenv ~/env && cd ~/env && source bin/activate pip install numpy pip install scipy pip install pandas for dir in lib64/python2.7/site-packages \ lib/python2.7/site-packages do if [ -d $dir ] ; then pushd $dir; zip -r ~/deps.zip .; popd fi done mkdir -p local/lib cp /usr/lib64/liblapack.so.3 \ /usr/lib64/libblas.so.3 \ /usr/lib64/libgfortran.so.3 \ /usr/lib64/libquadmath.so.0 \ local/lib/ zip -r ~/deps.zip local/lib
deps.zip, I’m imagining something like
aws s3 cp ~/deps.zip
The idea is to end up with all of the Python package dependencies crammed at the
top-level (i.e. accessible via naive imports in our application code), and the
shared libraries in a
local/lib directory which we’ll take
responsibility for loading once our entrypoint is invoked.
I’m not claiming this is an exhaustive list of the shared objects you’ll need, only the minimal set to do any work.
Adjusting Runtime Environment
Our Lambda function’s handler is going to be running in a Python process which
doesn’t have access to, e.g.
libblas, etc. — that’s going to be a problem.
Here’s one approach:
import os, sys, subprocess, json LIBS = os.path.join(os.getcwd(), 'local', 'lib') def handler(filename): def handle(event, context): env = os.environ.copy() env.update(LD_LIBRARY_PATH=LIBS) proc = subprocess.Popen( ('python', filename), env=env, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) (stdout, _) = proc.communicate(input=json.dumps(event)) try: return json.loads(stdout) except ValueError: raise ValueError(stdout) return handle def invoking(f): output = f(json.load(sys.stdin)) json.dump(output, sys.stdout) my_function = handler('my_function.py') other_function = handler('other_function.py')
import handlers, pandas def my_function(n): return (n * 2, pandas.__version__) if __name__ == '__main__': handlers.invoking(my_function)
handlers.my_function is specified as the handler for a Lambda function, and then invoked:
LD_LIBRARY_PATH=local/lib python my_function.py
- Some JSON representing the Lambda input (a number, in this case) is written to the child’s
- In the
handlers.invokingreads the JSON from stdin, and passes its data representation to
- The result is serialized to JSON and written to stdout by
handler(parent process) conveys this back to the Lambda caller
This obviously isn’t the be-all of inter-process communication — there are so many fancier ways this could be done. Logging, better error handling, package structure etc. can come later.
deps.zip being the output of our first step.)
$ aws s3 cp s3://my-bucket/deps.zip latest.zip $ zip latest.zip handlers.py my_function.py $ aws s3 cp latest.zip s3://my-bucket/ $ aws lambda create-function \ --function-name my-function \ --runtime python2.7 \ --handler handlers.my_function \ --role exquisite-role $ aws lambda update-function-code \ --function-name my-function \ --s3-bucket my-bucket \ --s3-key latest.zip
my-function, with, say,
2 ought to yield something like:
— a trivial example hopefully demonstrating that
pandas can be
successfully imported, and that multiplication remains possible.