Copy a large number of files from a server to AWS S3

TL/DR: Use s3-parallel-put, a Python script to copy to S3

I recently worked on an EC2 instance which needed a large number of files, 200gb+ to be copied over to AWS S3.

The overall goal was to move all data over to s3 to facilitate future growth and allow for multiple instances in different AWS regions.

After doing a little browsing around I came across several tutorials to mount an s3 bucket as a local folder using Fuse and S3fs. At the time, this seemed like the right thing to do.

http://sharadchhetri.com/2013/03/08/how-to-mount-s3-bucket-in-linux-ec2-instance

To avoid bothering the live server, I created a snapshot of the EBS volume attached to the EC2 instance and mounted that on to a new instance.

I installed s3fs and mounted the folder and started to use rsync to copy all the files and folders up to s3. It worked well but it took ages, over 50+ hours in total.

While it was copying I read about alternatives and found out about s3cmds ‘sync’ option.

I started to use this instead and it performed much better, still in took 24 hours or more to copy to S3.

If I was to do this again, I would use s3-parallel-put, a Python script using Boto to perform parallel uploads. I’ve seen this mentioned in several places as the most useful tool to upload a large number of files, source code on Github:

https://github.com/twpayne/s3-parallel-put

Some further reading:

http://serverfault.com/questions/386910/which-is-the-fastest-way-to-copy-400g-of-files-from-an-ec2-elastic-block-store-v

http://stackoverflow.com/questions/11910509/need-help-deciding-between-ebs-vs-s3-on-amazon-web-services

One Comment

  1. Pingback: Recommended Nerdy Reading | Cartazzi :: Scott Johnson and AppData

Leave a Reply

Your email address will not be published. Required fields are marked *