ETL with a Glue Python Shell Job: Load data from S3 to Redshift

Andreas Wittig ā€“ 21 Nov 2019

Gaining valuable insights from data is a challenge. After collecting data, the next step is to extract, transform, and load (ETL) the data into an analytics platform like Amazon Redshift. Luckily, there is a platform to build ETL pipelines: AWS Glue.

ETL Pipeline

In short, AWS Glue solves the following problems: a managed-infrastructure to run ETL jobs, a data catalog to organize data stored in data lakes, and crawlers to discover and categorize data.

In the following, I would like to present a simple but exemplary ETL pipeline to load data from S3 to Redshift.

  1. Someone uploads data to S3.
  2. An S3 event triggers a Lambda function.
  3. The Lambda function starts a Glue job.
  4. The Glue job executes an SQL query to load the data from S3 to Redshift.

ETL Pipeline to load data from S3 to Redshift

AWS Glue offers two different job types:

  • Apache Spark
  • Python Shell

An Apache Spark job allows you to do complex ETL tasks on vast amounts of data. However, the learning curve is quite steep. Luckily, there is an alternative: Python Shell. A Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume.

Therefore, I recommend a Glue job of type Python Shell to load data from S3 to Redshift without or with minimal transformation.

All you need to configure a Glue job is a Python script. The code example executes the following steps:

  1. import modules that are bundled by AWS Glue by default.
  2. Define some configuration parameters (e.g., the Redshift hostname RS_HOST).
  3. Read the S3 bucket and object from the arguments (see getResolvedOptions) handed over when starting the job.
  4. Establish a connection to Redshift: connect(...).
  5. Increase the statement timeout (see statement_timeout) to one hour.
  6. Execute the COPY query to tell Redshift to the object from S3.
from pgdb import connect
import os
import sys
from awsglue.utils import getResolvedOptions

# CONFIGURATION
RS_HOST = "xyz.redshift.amazonaws.com"
RS_PORT = "5439"
RS_DATABASE = "mydatabase"
RS_USER = "myadmin"
RS_PASSWORD = "XYZ"
RS_SCHEMA = "myschema"
RS_TABLE = "mytable"
RS_COLUMNS = "timestamp,value_a,value_b,value_c"
DELIMITER = "\t"
DATEFORMAT = "YYYY-MM-DD"

# ARGUMENTS
args = getResolvedOptions(sys.argv, ["s3-bucket", "s3-object"])
S3_BUCKET = args["s3_bucket"]
S3_OBJECT = args["s3_object"]

con = connect(host=RS_HOST + ':' + RS_PORT, database=RS_DATABASE, user=RS_USER, password=RS_PASSWORD)
cursor = con.cursor()

cursor.execute("set statement_timeout = 360000")

copy_query = "COPY %s.%s(%s) from 's3://%s/%s' iam_role 'arn:aws:iam::111111111111:role/LoadFromS3ToRedshiftJob' delimiter '%s' DATEFORMAT AS '%s' ROUNDEC TRUNCATECOLUMNS ESCAPE MAXERROR AS 500;" % (
RS_SCHEMA, RS_TABLE, RS_COLUMNS, S3_BUCKET, S3_OBJECT, DELIMITER, DATEFORMAT)

cursor.execute(copy_query)
con.commit()
cursor.close()
con.close();

To trigger the ETL pipeline each time someone uploads a new object to an S3 bucket, you need to configure the following resources:

  1. Create a Lambda function (Node.js) and use the code example from below to start the Glue job LoadFromS3ToRedshift.
  2. Attach an IAM role to the Lambda function, which grants access to glue:StartJobRun.
  3. Create a S3 Event Notification that invokes the Lambda function each time someone uploads an object to your S3 bucket.

The following example shows how to start a Glue job and pass the S3 bucket and object as arguments.

'use strict';
const AWS = require('aws-sdk');
const glue = new AWS.Glue({apiVersion: '2017-03-31'});

exports.handler = async (event, context) => {
console.log(`Handling S3 event: ${JSON.stringify(event)}`);
await glue.startJobRun({
JobName: 'LoadFromS3ToRedshift',
Arguments: {
'--s3-object': decodeURIComponent(event.Records[0].s3.object.key),
'--s3-bucket': event.Records[0].s3.bucket.name
}
}).promise();
};

There is only one thing left. You might want to set up monitoring for your simple ETL pipeline.

  1. Create an SNS topic and add your e-mail address as a subscriber.
  2. Create a CloudWatch Rule with the following event pattern and configure the SNS topic as a target.
{
"detail-type": [
"Glue Job State Change"
],
"source": [
"aws.glue"
],
"detail": {
"state": [
"FAILED",
"TIMEOUT"
]
}
}

By doing so, you will receive an e-mail whenever your Glue job fails.

Summary

AWS Glue offers tools for solving ETL challenges. A Glue Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume. For example, loading data from S3 to Redshift can be accomplished with a Glue Python Shell job immediately after someone uploads data to S3.

And by the way: the whole solution is Serverless! No need to manage any EC2 instances.

Andreas Wittig

Andreas Wittig

Iā€™m the author of Amazon Web Services in Action. I work as a software engineer, and independent consultant focused on AWS and DevOps.

You can contact me via Email, Twitter, and LinkedIn.

Briefcase icon
Hire me
Cover of Rapid Docker on AWS

New book: Rapid Docker on AWS

A rapid way to get your web application up and running on AWS. Made for web developers and DevOps engineers who want to dockerize their web applications and run their containers on Amazon Web Services. Prior knowledge of Docker and AWS is not required.

Buy icon
Buy now
Marbot Logo

Incident Management for Slack

Team up to solve incidents with our chatbot marbot. Never miss a critical alert. Escalate alerts from your AWS infrastructure among your team members. Strong integrations with all parts of your AWS infrastructure: CloudWatch, Elastic Beanstalk, RDS, EC2, ...

Slack icon
Try for free
šŸ“š Rapid Docker on AWS
A rapid way to get your web application up and running on AWS. Learn how to package your application into Docker containers. Learn more.