ETL with a Glue Python Shell Job: Load data from S3 to Redshift

ETL Pipeline

Gaining valuable insights from data is a challenge. After collecting data, the next step is to extract, transform, and load (ETL) the data into an analytics platform like Amazon Redshift. Luckily, there is a platform to build ETL pipelines: AWS Glue.

In short, AWS Glue solves the following problems: a managed-infrastructure to run ETL jobs, a data catalog to organize data stored in data lakes, and crawlers to discover and categorize data.

In the following, I would like to present a simple but exemplary ETL pipeline to load data from S3 to Redshift.

  1. Someone uploads data to S3.
  2. An S3 event triggers a Lambda function.
  3. The Lambda function starts a Glue job.
  4. The Glue job executes an SQL query to load the data from S3 to Redshift.

AWS Glue offers two different job types:

  • Apache Spark
  • Python Shell

An Apache Spark job allows you to do complex ETL tasks on vast amounts of data. However, the learning curve is quite steep. Luckily, there is an alternative: Python Shell. A Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume.

Therefore, I recommend a Glue job of type Python Shell to load data from S3 to Redshift without or with minimal transformation.

All you need to configure a Glue job is a Python script. The code example executes the following steps:

  1. import modules that are bundled by AWS Glue by default.
  2. Define some configuration parameters (e.g., the Redshift hostname RS_HOST).
  3. Read the S3 bucket and object from the arguments (see getResolvedOptions) handed over when starting the job.
  4. Establish a connection to Redshift: connect(...).
  5. Increase the statement timeout (see statement_timeout) to one hour.
  6. Execute the COPY query to tell Redshift to the object from S3.
from pgdb import connect
import os
import sys
from awsglue.utils import getResolvedOptions

# CONFIGURATION
RS_HOST = "xyz.redshift.amazonaws.com"
RS_PORT = "5439"
RS_DATABASE = "mydatabase"
RS_USER = "myadmin"
RS_PASSWORD = "XYZ"
RS_SCHEMA = "myschema"
RS_TABLE = "mytable"
RS_COLUMNS = "timestamp,value_a,value_b,value_c"
DELIMITER = "t"
DATEFORMAT = "YYYY-MM-DD"

# ARGUMENTS
args = getResolvedOptions(sys.argv, ["s3-bucket", "s3-object"])
S3_BUCKET = args["s3_bucket"]
S3_OBJECT = args["s3_object"]

con = connect(host=RS_HOST + ':' + RS_PORT, database=RS_DATABASE, user=RS_USER, password=RS_PASSWORD)
cursor = con.cursor()

cursor.execute("set statement_timeout = 360000")

copy_query = "COPY %s.%s(%s) from 's3://%s/%s' iam_role 'arn:aws:iam::111111111111:role/LoadFromS3ToRedshiftJob' delimiter '%s' DATEFORMAT AS '%s' ROUNDEC TRUNCATECOLUMNS ESCAPE MAXERROR AS 500;" % (
RS_SCHEMA, RS_TABLE, RS_COLUMNS, S3_BUCKET, S3_OBJECT, DELIMITER, DATEFORMAT)

cursor.execute(copy_query)
con.commit()
cursor.close()
con.close();

To trigger the ETL pipeline each time someone uploads a new object to an S3 bucket, you need to configure the following resources:

  1. Create a Lambda function (Node.js) and use the code example from below to start the Glue job LoadFromS3ToRedshift.
  2. Attach an IAM role to the Lambda function, which grants access to glue:StartJobRun.
  3. Create a S3 Event Notification that invokes the Lambda function each time someone uploads an object to your S3 bucket.

The following example shows how to start a Glue job and pass the S3 bucket and object as arguments.

'use strict';
const AWS = require('aws-sdk');
const glue = new AWS.Glue({apiVersion: '2017-03-31'});

exports.handler = async (event, context) => {
console.log(`Handling S3 event: ${JSON.stringify(event)}`);
await glue.startJobRun({
JobName: 'LoadFromS3ToRedshift',
Arguments: {
'--s3-object': decodeURIComponent(event.Records[0].s3.object.key),
'--s3-bucket': event.Records[0].s3.bucket.name
}
}).promise();
};

There is only one thing left. You might want to set up monitoring for your simple ETL pipeline.

  1. Create an SNS topic and add your e-mail address as a subscriber.
  2. Create a CloudWatch Rule with the following event pattern and configure the SNS topic as a target.
{
"detail-type": [
"Glue Job State Change"
],
"source": [
"aws.glue"
],
"detail": {
"state": [
"FAILED",
"TIMEOUT"
]
}
}

By doing so, you will receive an e-mail whenever your Glue job fails.

Summary

AWS Glue offers tools for solving ETL challenges. A Glue Python Shell job is a perfect fit for ETL tasks with low to medium complexity and data volume. For example, loading data from S3 to Redshift can be accomplished with a Glue Python Shell job immediately after someone uploads data to S3.

And by the way: the whole solution is Serverless! No need to manage any EC2 instances.