ETL with a Glue Python Shell Job: Load data from S3 to Redshift

Andreas Wittig – 21 Nov 2019

Gaining valuable insights from data is a challenge. After collecting data, the next step is to extract, transform, and load (ETL) the data into an analytics platform like Amazon Redshift. Luckily, there is a platform to build ETL pipelines: AWS Glue.

ETL Pipeline

In short, AWS Glue solves the following problems: a managed-infrastructure to run ETL jobs, a data catalog to organize data stored in data lakes, and crawlers to discover and categorize data.

In the following, I would like to present a simple but exemplary ETL pipeline to load data from S3 to Redshift.

  1. Someone uploads data to S3.
  2. An S3 event triggers a Lambda function.
  3. The Lambda function starts a Glue job.
  4. The Glue job executes an SQL query to load the data from S3 to Redshift.

ETL Pipeline to load data from S3 to Redshift

AWS Glue offers two different job types:

  • Apache Spark
  • Python Shell

An Apache Spark job allows you to do complex ETL tasks on vast amounts of data. However, the learning curve is quite steep. Luckily, there is an alternative: Python Shell. A Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume.

Therefore, I recommend a Glue job of type Python Shell to load data from S3 to Redshift without or with minimal transformation.

S3 Security Guide

Protect your S3 buckets!

Follow four simple rules to avoid data leaking from S3. Download our S3 Security Guide!

All you need to configure a Glue job is a Python script. The code example executes the following steps:

  1. import modules that are bundled by AWS Glue by default.
  2. Define some configuration parameters (e.g., the Redshift hostname RS_HOST).
  3. Read the S3 bucket and object from the arguments (see getResolvedOptions) handed over when starting the job.
  4. Establish a connection to Redshift: connect(...).
  5. Increase the statement timeout (see statement_timeout) to one hour.
  6. Execute the COPY query to tell Redshift to the object from S3.
from pgdb import connect
import os
import sys
from awsglue.utils import getResolvedOptions

RS_HOST = ""
RS_PORT = "5439"
RS_DATABASE = "mydatabase"
RS_USER = "myadmin"
RS_SCHEMA = "myschema"
RS_TABLE = "mytable"
RS_COLUMNS = "timestamp,value_a,value_b,value_c"

args = getResolvedOptions(sys.argv, ["s3-bucket", "s3-object"])
S3_BUCKET = args["s3_bucket"]
S3_OBJECT = args["s3_object"]

con = connect(host=RS_HOST + ':' + RS_PORT, database=RS_DATABASE, user=RS_USER, password=RS_PASSWORD)
cursor = con.cursor()

cursor.execute("set statement_timeout = 360000")

copy_query = "COPY %s.%s(%s) from 's3://%s/%s' iam_role 'arn:aws:iam::111111111111:role/LoadFromS3ToRedshiftJob' delimiter '%s' DATEFORMAT AS '%s' ROUNDEC TRUNCATECOLUMNS ESCAPE MAXERROR AS 500;" % (


To trigger the ETL pipeline each time someone uploads a new object to an S3 bucket, you need to configure the following resources:

  1. Create a Lambda function (Node.js) and use the code example from below to start the Glue job LoadFromS3ToRedshift.
  2. Attach an IAM role to the Lambda function, which grants access to glue:StartJobRun.
  3. Create a S3 Event Notification that invokes the Lambda function each time someone uploads an object to your S3 bucket.

The following example shows how to start a Glue job and pass the S3 bucket and object as arguments.

'use strict';
const AWS = require('aws-sdk');
const glue = new AWS.Glue({apiVersion: '2017-03-31'});

exports.handler = async (event, context) => {
console.log(`Handling S3 event: ${JSON.stringify(event)}`);
await glue.startJobRun({
JobName: 'LoadFromS3ToRedshift',
Arguments: {
'--s3-object': decodeURIComponent(event.Records[0].s3.object.key),
'--s3-bucket': event.Records[0]

There is only one thing left. You might want to set up monitoring for your simple ETL pipeline.

  1. Create an SNS topic and add your e-mail address as a subscriber.
  2. Create a CloudWatch Rule with the following event pattern and configure the SNS topic as a target.
"detail-type": [
"Glue Job State Change"
"source": [
"detail": {
"state": [

By doing so, you will receive an e-mail whenever your Glue job fails.


AWS Glue offers tools for solving ETL challenges. A Glue Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume. For example, loading data from S3 to Redshift can be accomplished with a Glue Python Shell job immediately after someone uploads data to S3.

And by the way: the whole solution is Serverless! No need to manage any EC2 instances.

Andreas Wittig

Andreas Wittig

I'm an independent consultant, technical writer, and programming founder. All these activities have to do with AWS. I'm writing this blog and all other projects together with my brother Michael.

In 2009, we joined the same company as software developers. Three years later, we were looking for a way to deploy our software—an online banking platform—in an agile way. We got excited about the possibilities in the cloud and the DevOps movement. It’s no wonder we ended up migrating the whole infrastructure of Tullius Walden Bank to AWS. This was a first in the finance industry, at least in Germany! Since 2015, we have accelerated the cloud journeys of startups, mid-sized companies, and enterprises. We have penned books like Amazon Web Services in Action and Rapid Docker on AWS, we regularly update our blog, and we are contributing to the Open Source community. Besides running a 2-headed consultancy, we are entrepreneurs building Software-as-a-Service products.

We are available for projects.

Feedback? Questions? Drop me a line: Email, Twitter, LinkedIn.

Briefcase icon
Hire me