Syncing files from Linux to S3

Amazon S3 makes for a great off-site backup of your important files. However, getting files into it in the first place in an automated way can be slightly tricky, particularly if you have a set of files that changes on a regular basis (like an iPhoto Library). My Thecus NAS is meant to be capable of syncing files into S3, but I’ve found the implementation to be limited and buggy, so I wanted to look at alternatives.

Given that Amazon charge for each PUT/GET/COPY/POST/LIST request, I wanted an equivalent of rsync for S3. That way, only files that were changed or deleted would be moved into or updated in S3. I came across s3cmd, which is part of the S3 tools Project, which seemed to meet my criteria.

The rest of this post outlines the steps I took to get s3cmd installed and working on my CentOS virtual server, and set up to regularly sync my iPhoto Library into S3. In a future post, I may look at regularly pushing a copy of my iPhoto Library into Glacier for point-in-time longer-term storage.

Installation of s3cmd

Installation was pretty simple. The S3 tools Project provide a number of package repositories for the common Linux distros. All I had to do was:

  • Download the relevant .repo file (CentOS 5 in my case) into the /etc/yum.repos.d directory
  • Ensure that the repository is enabled by setting enabled=1 inside the s3tools.repo file (this was done by default)
  • Run yum install s3cmd as root

Configuration of s3cmd

Before you can run s3cmd, you’ll need to configure it, so it knows what credentials to use, and whether to use HTTPS, etc. Unfortunately, like many tools of its ilk, it does not seem to support the use of IAM Access Keys and Secret Keys, so you have to provide the master credentials for your AWS account. Not too much of a problem for single-user accounts, but not great if you want to use this in the Enterprise.

To start the configuration, simply run s3cmd --configure and provide the following information:

  • Access Key
  • Secret Key
  • An encryption password (I elected not to provide one)
  • Whether to use HTTPS (I decided to use HTTPS)

You’ll then get a chance to test your settings, and presuming they work ok, you’re good to go! s3cmd will save your configuration to ~/.s3cmd.

Using s3cmd

Once you’re installed and configured, you can start playing around with s3cmd to get a feel for how it works. It largely imitates other *nix commands, so you can do something like s3cmd ls to show all S3 buckets:

2011-01-12 21:26  s3://s3-bucket-1
2012-08-21 13:34  s3://s3-bucket-2

If you don’t already have any buckets, or you want to create a new bucket to sync your files into, s3cmd can help with that; s3cmd mb s3://s3-bucket-3 (note the s3:// prefix to your bucket name). Listing all buckets should now show your new bucket too:

2011-01-12 21:26  s3://s3-bucket-1
2012-08-21 13:34  s3://s3-bucket-2
2012-10-03 21:51  s3://s3-bucket-3

The s3 tools Howto page gives a good overview of what you can do with the tool, and the commands to run to get familiar with the tool.

As my ultimate goal was to be able to upload my iPhoto Library into S3, and then keep it in sync by regularly uploading the deltas, I wanted to make use of the ‘sync’ functionality of s3cmd. The command I have decided to use is as follows:

s3cmd sync --dry-run  --recursive --delete-removed --human-readable-sizes \
--progress /mnt/smbserver/Pictures/iPhoto\ Library/* \
s3://s3-bucket-2/Pictures/iPhoto\ Library/ > /var/log/s3cmd-sync.log 2>&1

Note the use of the --dry-run option on the first run, so I get an idea of what will change. I can then run it again without that option for the changes to actually take effect. In all likelihood, I will set up a cron job to run this without the --dry-run option on a regular basis (otherwise I’ll forget, or simply be too busy to do it manually!).

It’s also worth noting that this isn’t necessarily a quick process. My iPhoto Library isn’t small (it currently weighs in at 56GB), but the s3cmd sync dry-run took about 15 minutes. Then running the actual sync itself took over 90 hours(!), although your mileage may vary depending on total volume of data and upload speed. A subsequent run of the same sync command took 30 minutes, and no data was transferred (as there were no local changes).

So far, my new solution is looking good! Hope this is helpful to someone else out there.

Leave a Reply

Your email address will not be published. Required fields are marked *