Download YouTube videos with AWS Lambda and store them on S3
Recently, I was faced with the challenge to download videos from YouTube and store them on S3.
Sounds easy? Remember than Lambda comes with a few limitations:
- 512 MB of disk space available at
/tmp
- 3008 MB of memory
- 15 minutes maximum execution time
While working on a solution, I encountered multiple problems:
- Download the video from YouTube to
/tmp
and then upload it to S3: Does not work with videos larger than 512 MB. - Download the video from YouTube into memory and then upload it to S3: Does not work with videos larger than ~3 GB.
- Download the video from Youtube and stream it to S3 while downloading: Works for all videos that can be processed within 15 minutes. I have not found a video that took longer than a few minutes to process.
Let’s look at how I finally solved the problem with a streaming approach in Node.js. I use the youtube-dl library to get easy access to YouTube videos.
First, we create a PassThrough stream in Node.js. A pass-through stream is a duplex stream where you can write on one side and read on the other side.
const stream = require('stream'); |
Next, we need to write data to the stream. This is done by the youtube-dl
library.
const youtubedl = require('youtube-dl'); |
And finally, we need to upload the stream to S3. We make use of the Multipart Upload feature of S3 which allows us to upload a big file in smaller chunks. This way, we only have to buffer the small junk (64 MB in this case) in memory and not the whole file.
const AWS = require('aws-sdk'); |
That’s it. Now you can download YouTube videos of any size with Lambda and upload them to S3. I recommend running the code in a “big” Lambda function with 3008 MB of memory for better network performance.
You can find the full source code on GitHub including a SAM template to provision the AWS resources. Have fun!
Further reading
- Article Introducing the Object Store: S3
- Article Introducing AWS Lambda
- Article Three ways to run Docker on AWS
- Tag s3
- Tag lambda
- Tag serverless