Using Amazon Glacier and Python for Personal Backups

Typically the focus of this blog is on gaming, but both EB and I are technically inclined. Occasionally I plan to diverge from gaming and cover topics of a technical nature. Today I'm going to go into some research and coding I did to solve my backup situation through the use of Amazon Glacier.

A couple of weeks ago, EB, myself, and one of our mutual associates (we'll call him DM) were having lunch, when the subject of backup came up. I've got a limited backup solution here at the house, but my offsite backup solution was practically nonexistent. DM and EB both suggested Carbonite, which runs around $60/year for unlimited backups from one computer. I said I was looking into Amazon's new Glacier service, which promised prices around $0.01 per GB per month.

The Situation

My backup needs are relatively modest - there's not a lot of data that I generate that I consider worthy of archiving. Mostly it's family photos that can't be replaced if lost. A lot of my other data is already cloud-based, primarily via Google's Drive service. Prior to this experiment, my backup strategy was:

This was fine for the most part, except that my wife's hard drive was rapidly approaching full, and I was apprehensive about what would happen if there was catastrophic damage to the house - would the DVDs survive in the safe? What if the safe had gotten hot enough to melt or warp the plastic? The DVDs were only burned around once every 6 months, so what would happen if both of the drives (the internal + the external time machine drive) both failed?

What I felt was missing was:

Putting Data on Ice

The first problem is where I thought Glacier might come in. Amazon Glacier is a service that is part of Amazon's Web Services offerings. It's intended for long term, low cost, and low availability backups. Inserting data into Glacier and leaving it there is cheap, but listing what's there or retrieving it is slow and somewhat more costly.

This was my first time using AWS, so I had something of a learning curve. It surprised me that Amazon made no effort to provide a client. I tried out a couple of third party clients, which worked fine for the most part. I tried both Fast Glacier and Cloudberry Backup. Both were competent clients, and if I hadn't been going a for fully-automated Linux-based solution, I probably would have been happy with either.

The things that I did learn as a result of experimenting with a couple of different clients was that Glacier only supports a single metadata field per archive - the archive description, and you can't change it once you've set it. Further, Glacier assigns each archive a random string called an "Archive ID" which you are responsible for keeping track of. You can list a vault's inventory if you wish, but this is a slow process and Amazon intends it only to be used in the case of a catastrophe where you lose your copy of the ids.

Both prepackaged clients I tried used their own description field syntax - Fast Glacier stored some sort of checksum, hash, random number, etc in the field, while Cloudberry stored the full path to the archive file. They also weren't terribly transparent with the archive ids. This made reading the inventory later a real pain. I had to go back and check the size field to correlate the archives I'd already uploaded with their archive ids. I could have removed the files and reinserted them, and my OCD almost wouldn't let me leave well enough alone, but I finally decided it wasn't worth the trouble. Finally, both clients had limitations on the number of concurrent upload threads they'd allow before I had to pay. This capped my bandwidth and made the uploads take longer.

After messing around a bit with Windows clients (and royally screwing up my vault inventory, sigh...) I went looking for a better solution. I'm a big fan of Python, so finding a AWS Python library seemed like a natural solution. I found boto which is a fairly robust AWS library that is easy to install and contains support for Glacier. There's some API references for the Glacier support available.

One thing I learned from using boto is that their abstracted APIs are less feature rich, and harder to use than the lower level ones. With the abstracted APIs, the library assumes that you want to first inventory your vaults, and then pass all of this information to functions that give you little control about how the underlying implementation handles files. For instance, there's a concurrent file uploader, and it's possible to use their "Vault" object to initiate a concurrent upload. However, you can't specify the archive description field, so it is left blank and can't be changed afterwards. You also can't specify a chunk size, or the number of upload threads.

Having realized this, I switched to just using their underlying implementation directly. It actually made less work and less code!

In the following scripts, you'll need to know your Access Key ID and Secret Key, which you can generate via the AWS control panel.

Script: Archive File Upload

Here's my script for uploading an archive file:



[python]
from boto.glacier.layer1 import Layer1
from boto.glacier.vault import Vault
from boto.glacier.concurrent import ConcurrentUploader
import sys
import os.path

access_key_id = "..."
secret_key = "..."
target_vault_name = '...'
fname = sys.argv[1]

if(os.path.isfile(fname) == False):
	print("Can't find the file to upload!");
	sys.exit(-1);

glacier_layer1 = Layer1(aws_access_key_id=access_key_id, aws_secret_access_key=secret_key)

uploader = ConcurrentUploader(glacier_layer1, target_vault_name, 32*1024*1024)

print("operation starting...");

archive_id = uploader.upload(fname, fname)

print("Success! archive id: '%s'"%(archive_id))
[/python]


This uses a 32MB chunk size, and sets the description to be the same as the file name. By default, the concurrent uploader will use 10 upload threads, which was plenty to saturate my upload bandwidth.

The chunk size can be important, since in addition to storage, Amazon bills you per 1k requests. If you upload a large file in tiny chunks, you'll pay more than if you'd done it in larger chunks.

Script: Initiate Job to Get Vault Inventory

If you want to grab a vault's inventory, you can initiate the job with:



[python]
from boto.glacier.layer1 import Layer1
from boto.glacier.vault import Vault
from boto.glacier.job import Job
import sys
import os.path
import json

access_key_id = "..."
secret_key = "..."
target_vault_name = '...'

glacier_layer1 = Layer1(aws_access_key_id=access_key_id, aws_secret_access_key=secret_key)

print("operation starting...");

job_id = glacier_layer1.initiate_job(target_vault_name, {"Description":"inventory-job", "Type":"inventory-retrieval", "Format":"JSON"})

print("inventory job id: %s"%(job_id,));

print("Operation complete.")
[/python]


Once the job is initiated, you'll have to wait a few hours to get the results. You can use the AWS control panel to subscribe to job updates for a particular vault, which means you'll get an email when the job is ready, and you can download the results.

Script: Read Vault Inventory Job Result

Once the inventory job is ready, you can grab the results with:



[python]
from boto.glacier.layer1 import Layer1
from boto.glacier.vault import Vault
from boto.glacier.job import Job
import sys
import os.path
import json

access_key_id = "..."
secret_key = "..."
target_vault_name = '...'

if(len(sys.argv) < 2):
	jobid = None
else:
	jobid = sys.argv[1]

glacier_layer1 = Layer1(aws_access_key_id=access_key_id, aws_secret_access_key=secret_key)

print("operation starting...");

if(jobid != None):
	print glacier_layer1.get_job_output(target_vault_name, jobid)
else:
	print glacier_layer1.list_jobs(target_vault_name, completed=False)

	
print("Operation complete.")
[/python]


You can also call this script with no job-id specified, in which case it will list all active jobs on the vault.

Hopefully from these examples you can see that there's not much to working with Glacier using boto. The actual boto involvement is only a few lines, honestly. There's not even that much in the way of complex Python magic here. As I mentioned, I'm using boto's "layer1" directly, since the upper abstraction layers don't add much to making the library easy to use.

There's a few more things I'd like to add to this - one in particular is the ability to see the status of an in-progress concurrent upload. Currently, I just check my outgoing bandwidth to confirm that the job is still in progress. I'm not 100% sure what happens if an archive upload is aborted - do I need to clean up in some way?

The Plan Comes Together

To integrate Glacier into our backup situation, I set up the following procedure:

  1. We copy data to a flash drive, which is attached to a server on our internal house network. This drive has been partitioned, and contains 2 4GB chunks for backup use. We use them one at a time.
  2. A nightly script runs that rsync-s these partitions to a server that I rent in another state (solving the short-term backup issue)
  3. When one of the partitions is full, we switch to the other and initiate the "archival" backup:
    1. The data in the full partition is consolidated and compressed, and is burned to DVD, as well as being inserted into Glacier
    2. Once the backups are verified, the full partition is cleared for its next use.

Part # 3 is currently a manual operation, although if I got savvy I could probably just make sure there's always a blank DVD in the Linux machine and script the whole thing from start to finish.

I'm also storing the archive IDs of each file in a Google Doc which is shared with my wife, so that we can both know whether or not a particular batch of files has been backed up or not. It helps that she's also tech savvy and can understand the process without getting confused. I'm not sure everyone is willing to deal with this level of complexity, even if it seems pretty simple to me.

Conclusion

I know if I tried to convince EB or DM that my solution was superior, they'd probably balk at the setup I had to go through. They'd probably be right! For them, something like Carbonite, which is very user friendly and hands-off, is probably better. The usability and their level of understanding about things like rsync, python, Linux, etc is a major factor. However, my solution is cheaper and uses bandwidth on my schedule, which I like.

I learned a lot about Glacier in the process, and I think overall I'm happy with how this solution turned out. I feel comfortable knowing that my offsite backups are secure in Amazon's cloud.