Efficient uploading and persistent storage of NeetoRecord videos using AWS S3

March 16, 2024

Efficient uploading and persistent storage of NeetoRecord videos using AWS S3

This is part 2 of our blog on how we are building NeetoRecord, a Loom alternative. Here are part 1 and part 3.

In the previous blog, we learned how to use the Browser APIs to record screen and generate a WEBM file. We now need to upload this file to persistent storage to have a URL to share our recording with our audience.

Uploading a large file all at once is time-consuming and prone to failure due to network errors. The recording is generated in parts, each part pushed to an array and joined together. So it would be ideal if we could upload these smaller parts as and when they are generated, and then join together in the backend once the recording is completed. AWS's Simple Storage Service (S3) made a perfect fit as it provides cheap persistent storage, along with Multipart Uploads feature.

S3 Multipart Uploads allows us to upload large objects in parts. Rather than uploading the entire object in a single operation, multipart uploads break it down into smaller parts, each ranging from 5 MB to 5 GB. Once uploaded, these parts are aggregated to form the complete object.

Initialization

The process begins with an initiation request to S3, where a unique upload ID is generated. This upload ID is used to identify and manage the individual parts of the upload.

s3 = Aws::S3::Client.new

resp = s3.create_multipart_upload({
  bucket: bucket_name,
  key: object_key
})

upload_id = resp.upload_id

Upload Parts

Once the upload is initiated, we can upload the parts to S3 independently. Each part is associated with a sequence number and an ETag (Entity Tag), a checksum of the part's data.

Note that minimum content size for a part is 5MB (There is no minimum size limit on the last part of your multipart upload). So we store the recording chunks in local storage until they are bigger than 5MB. Once we have a part greater than 5MB, we upload it to S3.

part_number = 1
content = recordedChunks

resp = s3.upload_part({
  body: content,
  bucket: bucket_name,
  key: object_key,
  upload_id: upload_id,
  part_number: part_number
})

puts "ETag for Part #{part_number}: #{resp.etag}"

Completion

Once all parts are uploaded, a complete multipart upload request is sent to S3, specifying the upload ID and the list of uploaded parts along with their ETags and sequence numbers. S3 then assembles the parts into a single object and finalizes the upload.

completed_parts = [
  { part_number: 1, etag: 'etag_of_part_1' },
  { part_number: 2, etag: 'etag_of_part_2' },
  ...
  { part_number: N, etag: 'etag_of_part_N' },
]

resp = s3.complete_multipart_upload({
  bucket: bucket_name,
  key: object_key,
  upload_id: upload_id,
  multipart_upload: {
    parts: completed_parts
  }
})

Aborting and Cancelling

At any point during the multipart upload process, you can abort or cancel the upload, which deletes any uploaded parts associated with the upload ID.

s3.abort_multipart_upload({
  bucket: bucket_name,
  key: object_key,
  upload_id: upload_id
})

The uploaded file will finally be available at s3://bucket_name/object_id

S3 Multipart Uploads offers us several advantages:

Fault tolerance

We can resume uploads from where they left off in case of network failures or interruptions. Also, uploading large objects in smaller parts reduces the likelihood of timeouts and connection failures, especially in high-latency or unreliable network environments.

Upload speed optimization

With multipart uploads, you can parallelize the process by uploading multiple parts concurrently, optimizing transfer speeds and reducing overall upload time.

If this blog was helpful, check out our full blog archive.