Run automated MongoDB backups to S3 using AWS Lambda and Zappa

MongoDB comes with a backup tool called mongodump which is what you should be using when you have the resources to set up a dedicated backup job that is able to install / run the CLI tools that are needed.

If you’re running an application that is limited in scope, scarce on resources and money or just want to set up your backup without too much hassle, you can use a Python script running on AWS Lambda and Zappa to set up automated backups to AWS S3 using the setup described in this article.

We will use the zappa command for all interaction with AWS, so the first thing that is needed to do is to install it using pip:

pip install zappa

Next, you can run zappa init to create a zappa_settings.json file. After it has been created, you will need to make the following adjustments:

since the automated backup function does not need to expose an HTTP interface, apigateway_enabled can be set to false.
as the backup is not time critical, it does not need a “keep warm” callback, this can be disabled by setting keep_warm to false.
Zappa can automatically configure the CloudWatch triggers needed to periodically run the backup function. This is specified in the events key of your environments configuration. The simplest way of configuring an event is passing an object containing a function key that points to the handler you want to run and an expression key that determines when to run the function. Most of the time you should be able to schedule what you want using Cron and Rate Expressions.

In the end, the settings file should look something like this:

{
    "backup": {
        "aws_region": "eu-central-1",
        "project_name": "mongo-backup",
        "runtime": "python3.6",
        "s3_bucket": "zappa-xxxxxxxxx",
        "events": [{
           "function": "main.handler",
           "expression": "cron(0 18 * * ? *)"
        }],
        "apigateway_enabled": false,
        "keep_warm": false
    }
}

Next, the handler running the backup will need to be defined. We’ll use pymongo for interacting with MongoDB and boto3 for uploading the files to S3, so you will need to install these:

pip install pymongo boto3

The backup script

I created a GitHub repository containing the code I am using for running the backup (it contains a few more options), but this is what main.py should roughly look like:

from os import environ
from urllib.parse import urlparse

from pymongo import MongoClient
import boto3 as boto
from bson.json_util import dumps, JSONOptions, DatetimeRepresentation

s3 = boto.resource("s3")


def handler(event, context):
    # the dumps will be stored in a temp file
    # which also limits the size of a backup to slightly
    # below 512MB (Lambda's limit on file sizes) in this setup
    temp_filepath = "/tmp/mongodump.json" 
    bucket_folder = environ.get("BUCKET_FOLDER", "backups")
    bucket_name = environ["BUCKET_NAME"]
    db_uri = environ["MONGO_URI"]
    db_name = environ["MONGO_DATABASE"]

    client = MongoClient(db_uri)
    database = client.get_database(db_name)

    json_options = JSONOptions(datetime_representation=DatetimeRepresentation.ISO8601)
    for collection_name in database.collection_names():
        with open(temp_filepath, "w") as f:
            for doc in database.get_collection(collection_name).find():
                f.write(dumps(doc, json_options=json_options) + "\n")

        s3.Bucket(bucket_name).upload_file(
            temp_filepath, "{}/{}.json".format(bucket_folder, collection_name)
        )

For the function to run properly, the following environment variables need to be set:

MONGO_URI points to your MongoDB host and contains credentials if needed.
MONGO_DATABASE is the name of the database you want to backup.
BUCKET_NAME is the identifier of the bucket you want to store the backup files in. The script assumes this bucket already exists.
By default, the files will be stored in a folder called backups, but this can be overridden by setting BUCKET_FOLDER.

You can set these values either in your zappa_settings.json using the aws_environment_variables key or set them in the AWS UI in case you don’t want them in your code.

Install from pip

If you want to, you can also install the backup handler above using pip:

pip install mongo_lambda_backup

and have a single line in your main.py that passes the handler through to lambda:

from mongo_lambda_backup.handler import handler

Create the Lambda function

Next, you should be able to create your Lambda function. Before doing so you might want to check if your local AWS credentials are correctly set and you have the correct permissions for creating Lambda Functions, CloudWatch Events and S3 buckets.

zappa deploy backup

Zappa creates all that is needed to execute your backup function, packages the code and uploads it to Lambda. After making changes to your code that you want to be deployed, you can run:

zappa update backup

In case you don’t need your backup anymore, run undeploy:

zappa undeploy backup

As your backup is probably not running too frequently and Lambda’s free plan is very generous, this gives us a free automated backup of MongoDB resources. Nice.

Keeping multiple backup versions

You may have noticed that the backup script is overwriting previously existing files on each backup. In case you want to keep multiple versions of your backups, I’d advise you to use S3 Versions and Lifecycle Rules in order to manage how many versions of your backups you want to keep.