Creating and Uploading Huge Archive Files Swimmingly with Ruby

Posted on 19 September, 2020 at 00:00 UTC by Gentaro "hibariya" Terada

When we have to process a large number of files, we should be aware of the resources the process consumes. Here's the scenario I faced in my work recently: we have to create one archive file from lots of files that are stored on an S3 bucket. Created archive files must be placed on another S3 bucket. The number of files of an archive could be 700,000. The size of each file is up to 10MiB. The size of an archive file will be less than 140GiB.

If we download all the 700K files and make them one ZIP file all at once, we'd probably better prepare a certain amount of disk space to process it. But how much? What if we have to run multiple tasks like that at the same time? Is it possible to upload such a huge file at once in the first place? There are a lot of things to consider. Pipeline is probably a good choice to deal with such a situation.

AWS S3 Multipart Upload

Fortunately, AWS S3 provides a way to upload a large file little by little. It's called “multipart upload.” By this feature, a file will be split and uploaded as multiple parts, one by one. Multiple uploads can be started even when the middle of the file creation. That means we wouldn't have to prepare disk space for the whole archive for that.

Of course, AWS SDK for Ruby has API for this feature. This is an example for multiple uploading.

s3_client = Aws::S3::Client.new(region: 'ap-northeast-1')
archive_object = Aws::S3::Object.new(bucket, 'archive.zip', client: s3_client)
archive_object.upload_stream do |upload_stream|
  upload_stream.binmode

  # write to upload_stream (one side of a pipe, IO object for writing)
end

When we tell the library to initiate a multipart upload by calling Aws::S3::Object#upload_stream, it gives us an IO object, which is one of a pair of a pipe endpoint. To upload a file, just write the content to the IO object. AWS SDK will upload a chunk whenever the buffer size reaches a certain amount (default is 5MiB). That's it.

ZIP Archiving without Consuming Disk Space

It's great if we can avoid occupying disk space for a whole ZIP archive file. Actually, ZIP is well suited to this kind of situation.

ZIP Format Structure

Like other archive formats, ZIP also includes metadata for each file, such as the filename, position of the content of each entry on the archive file, size, and so on. It's called “central directory header.” Although it is called “header,” it's not supposed to be placed top of the archive file. It is supposed to be placed tail of the file.

Long story short, when we begin generating a ZIP file, there's no need to collect information on all files in advance because that information will only be needed after all the archive entries are written. We can simply start compressing/writing each file entry to the archive file one by one, remembering the metadata of the file. Once a file entry is added to the archive, the original file will no longer be needed. So, to create a ZIP archive, it's not necessary to prepare disk space to store all original files.

Pipes and zip_tricks gem

This time, I chose zip_tricks gem to create ZIP files because it lets us use pipes for its output. ZipTricks::Streamer.open takes an IO object as the output of the ZIP archive file. Since zip_tricks doesn't rewind/seek the IO, we can pass pipes to it.

for_write, for_read = IO.pipe # for_read can be used to upload

ZipTricks::Streamer.open for_write do |zip|
  zip.write_deflated_file 'a-file.txt' do |input_stream|
    input_stream.write 'the content of a-file.txt'
  end
end

Why is it so important that we are able to use pipe here? Because we don't even have to create actual files to create if we can use a pipe as the sink of the ZIP archive generation.

In the earlier section, Aws::S3::Object#upload_stream given us an IO object of a pipe. So we can just pass it as it is.

archive_object.upload_stream do |upload_stream|
  upload_stream.binmode

  # upload_stream is a pipe
  ZipTricks::Streamer.open upload_stream do |zip|
    zip.write_deflated_file 'a-file.txt' do |input_stream|
      input_stream.write 'the content of a-file.txt'
    end
  end
end

The Complete Example Code

Finally, we can archive many files and upload huge archives without writing a file to local disks.

require 'aws-sdk-s3'
require 'zip_tricks'

bucket = 'hibariya-sandbox'
s3_client = Aws::S3::Client.new(region: 'ap-northeast-1')

files_to_archive = %w[alpha bravo charlie delta] # whatever
archive_object = Aws::S3::Object.new(bucket, 'archive.zip', client: s3_client)

archive_object.upload_stream tempfile: false, part_size: 20 * 1024 * 1024, thread_count: 3 do |upload_stream|
  upload_stream.binmode

  ZipTricks::Streamer.open upload_stream do |zip|
    files_to_archive.each do |file_path|
      zip.write_deflated_file file_path do |input_stream|
        s3_client.get_object bucket: bucket, key: file_path, response_target: input_stream
      end
    end
  end
end

Note that the example uses the same bucket for both upload/download to make it simple. The keyword arguments for Aws::S3::Object#upload_stream are optional. See the document for more details.

Caveats

There are some things you should know when you use multipart uploadings, such as limitation of the number of parts and incomplete uploads. Perhaps you will have to add a lifecycle policy for the S3 bucket to cleanup incomplete uploads regularly. For the details, check this document.

https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html