ETL with a Glue Python Shell Job: Load data from S3 to Redshift

Andreas Wittig – 21 Nov 2019

Gaining valuable insights from data is a challenge. After collecting data, the next step is to extract, transform, and load (ETL) the data into an analytics platform like Amazon Redshift. Luckily, there is a platform to build ETL pipelines: AWS Glue.

ETL Pipeline

In short, AWS Glue solves the following problems: a managed-infrastructure to run ETL jobs, a data catalog to organize data stored in data lakes, and crawlers to discover and categorize data.

In the following, I would like to present a simple but exemplary ETL pipeline to load data from S3 to Redshift.

  1. Someone uploads data to S3.
  2. An S3 event triggers a Lambda function.
  3. The Lambda function starts a Glue job.
  4. The Glue job executes an SQL query to load the data from S3 to Redshift.

ETL Pipeline to load data from S3 to Redshift

AWS Glue offers two different job types:

  • Apache Spark
  • Python Shell

An Apache Spark job allows you to do complex ETL tasks on vast amounts of data. However, the learning curve is quite steep. Luckily, there is an alternative: Python Shell. A Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume.

Therefore, I recommend a Glue job of type Python Shell to load data from S3 to Redshift without or with minimal transformation.

Looking for a new challenge?

  • tecRacer

    Cloud Consultant

    tecRacer • Premier AWS Consulting Partner • Germany, Austria, Portugal, and Switzerland
    AWS only Infrastructure as Code EC2 Containers Serverless
  • tecRacer

    Cloud Migration Specialist

    tecRacer • Premier AWS Consulting Partner • Germany, Austria, Portugal, and Switzerland
    Lift&Shift Transformation EC2 RDS VPC

All you need to configure a Glue job is a Python script. The code example executes the following steps:

  1. import modules that are bundled by AWS Glue by default.
  2. Define some configuration parameters (e.g., the Redshift hostname RS_HOST).
  3. Read the S3 bucket and object from the arguments (see getResolvedOptions) handed over when starting the job.
  4. Establish a connection to Redshift: connect(...).
  5. Increase the statement timeout (see statement_timeout) to one hour.
  6. Execute the COPY query to tell Redshift to the object from S3.
from pgdb import connect
import os
import sys
from awsglue.utils import getResolvedOptions

RS_HOST = ""
RS_PORT = "5439"
RS_DATABASE = "mydatabase"
RS_USER = "myadmin"
RS_SCHEMA = "myschema"
RS_TABLE = "mytable"
RS_COLUMNS = "timestamp,value_a,value_b,value_c"

args = getResolvedOptions(sys.argv, ["s3-bucket", "s3-object"])
S3_BUCKET = args["s3_bucket"]
S3_OBJECT = args["s3_object"]

con = connect(host=RS_HOST + ':' + RS_PORT, database=RS_DATABASE, user=RS_USER, password=RS_PASSWORD)
cursor = con.cursor()

cursor.execute("set statement_timeout = 360000")

copy_query = "COPY %s.%s(%s) from 's3://%s/%s' iam_role 'arn:aws:iam::111111111111:role/LoadFromS3ToRedshiftJob' delimiter '%s' DATEFORMAT AS '%s' ROUNDEC TRUNCATECOLUMNS ESCAPE MAXERROR AS 500;" % (


To trigger the ETL pipeline each time someone uploads a new object to an S3 bucket, you need to configure the following resources:

  1. Create a Lambda function (Node.js) and use the code example from below to start the Glue job LoadFromS3ToRedshift.
  2. Attach an IAM role to the Lambda function, which grants access to glue:StartJobRun.
  3. Create a S3 Event Notification that invokes the Lambda function each time someone uploads an object to your S3 bucket.

The following example shows how to start a Glue job and pass the S3 bucket and object as arguments.

'use strict';
const AWS = require('aws-sdk');
const glue = new AWS.Glue({apiVersion: '2017-03-31'});

exports.handler = async (event, context) => {
console.log(`Handling S3 event: ${JSON.stringify(event)}`);
await glue.startJobRun({
JobName: 'LoadFromS3ToRedshift',
Arguments: {
'--s3-object': decodeURIComponent(event.Records[0].s3.object.key),
'--s3-bucket': event.Records[0]

There is only one thing left. You might want to set up monitoring for your simple ETL pipeline.

  1. Create an SNS topic and add your e-mail address as a subscriber.
  2. Create a CloudWatch Rule with the following event pattern and configure the SNS topic as a target.
"detail-type": [
"Glue Job State Change"
"source": [
"detail": {
"state": [

By doing so, you will receive an e-mail whenever your Glue job fails.


AWS Glue offers tools for solving ETL challenges. A Glue Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume. For example, loading data from S3 to Redshift can be accomplished with a Glue Python Shell job immediately after someone uploads data to S3.

And by the way: the whole solution is Serverless! No need to manage any EC2 instances.

Become a cloudonaut supporter

Andreas Wittig

Andreas Wittig ( Email, Twitter, or LinkedIn )

We launched the cloudonaut blog in 2015. Since then, we have published 350 articles, 45 podcast episodes, and 38 videos. It's all free and means a lot of work in our spare time. We enjoy sharing our AWS knowledge with you.

Please support us

Have you learned something new by reading, listening, or watching our content? With your help, we can spend enough time to keep publishing great content in the future. Learn more

Amount must be a multriply of 5. E.g, 5, 10, 15.

Thanks to Alan Leech, Alex DeBrie, ANTHONY RAITI, Christopher Hipwell, Jaap-Jan Frans, Jason Yorty, Jeff Finley, Jens Gehring, jhoadley, Johannes Grumböck, John Culkin, Jonas Mellquist, Juraj Martinka, Kamil Oboril, Ken Snyder, Ross Mohan, Ross Mohan, sam onaga, Satyendra Sharma, Shawn Tolidano, Thorsten Hoeger, Todd Valentine, Victor Grenu, and all anonymous supporters for your help! We also want to thank all supporters who purchased a cloudonaut t-shirt.