ETL with a Glue Python Shell Job: Load data from S3 to Redshift
Gaining valuable insights from data is a challenge. After collecting data, the next step is to extract, transform, and load (ETL) the data into an analytics platform like Amazon Redshift. Luckily, there is a platform to build ETL pipelines: AWS Glue.
In short, AWS Glue solves the following problems: a managed-infrastructure to run ETL jobs, a data catalog to organize data stored in data lakes, and crawlers to discover and categorize data.
In the following, I would like to present a simple but exemplary ETL pipeline to load data from S3 to Redshift.
- Someone uploads data to S3.
- An S3 event triggers a Lambda function.
- The Lambda function starts a Glue job.
- The Glue job executes an SQL query to load the data from S3 to Redshift.
AWS Glue offers two different job types:
- Apache Spark
- Python Shell
An Apache Spark job allows you to do complex ETL tasks on vast amounts of data. However, the learning curve is quite steep. Luckily, there is an alternative: Python Shell. A Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume.
Therefore, I recommend a Glue job of type Python Shell to load data from S3 to Redshift without or with minimal transformation.
Level up, strengthen your AWS skills.
Our book Amazon Web Services in Action is a comprehensive introduction to computing, storing, and networking in the AWS cloud. You'll find clear, relevant coverage of all the essential AWS services, emphasizing best practices for security, high availability, and scalability. Get the first chapter for free!
All you need to configure a Glue job is a Python script. The code example executes the following steps:
importmodules that are bundled by AWS Glue by default.
- Define some configuration parameters (e.g., the Redshift hostname
- Read the S3 bucket and object from the arguments (see
getResolvedOptions) handed over when starting the job.
- Establish a connection to Redshift:
- Increase the statement timeout (see
statement_timeout) to one hour.
- Execute the
COPYquery to tell Redshift to the object from S3.
from pgdb import connect
To trigger the ETL pipeline each time someone uploads a new object to an S3 bucket, you need to configure the following resources:
- Create a Lambda function (Node.js) and use the code example from below to start the Glue job
- Attach an IAM role to the Lambda function, which grants access to
- Create a S3 Event Notification that invokes the Lambda function each time someone uploads an object to your S3 bucket.
The following example shows how to start a Glue job and pass the S3 bucket and object as arguments.
There is only one thing left. You might want to set up monitoring for your simple ETL pipeline.
- Create an SNS topic and add your e-mail address as a subscriber.
- Create a CloudWatch Rule with the following event pattern and configure the SNS topic as a target.
By doing so, you will receive an e-mail whenever your Glue job fails.
AWS Glue offers tools for solving ETL challenges. A Glue Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume. For example, loading data from S3 to Redshift can be accomplished with a Glue Python Shell job immediately after someone uploads data to S3.
And by the way: the whole solution is Serverless! No need to manage any EC2 instances.