Automating Parquet File Imports from Amazon S3 to Amazon Redshift

Effortlessly Keep Your Redshift Data updated with Parquet Files

Nov 12, 2024

In today’s data-driven world, organizations often need to keep data updated across multiple platforms for analytics and reporting. One efficient way to manage large datasets is through automated ETL (Extract, Transform, Load) processes. In this guide, I’ll walk you through automating data ingestion from Parquet files stored in Amazon S3 to Amazon Redshift, using stored procedures and Redshift's scheduled queries.

Why Parquet and Amazon Redshift?

Parquet is an efficient columnar storage format commonly used for big data applications. It offers better compression and faster query performance, making it ideal for analytics. Amazon Redshift, a fully-managed data warehouse, allows seamless integration with S3, making it perfect for loading and querying large datasets.

With this setup, we’ll create a stored procedure in Redshift to load data from S3, automate it with scheduled queries, and ensure that Redshift stays updated daily.

Prerequisites

To follow along, you’ll need:

An Amazon Redshift cluster and an S3 bucket with Parquet files.
An IAM role that allows Redshift to access your S3 bucket.
Basic understanding of SQL and Redshift's stored procedures.

1. Setting Up the IAM Role for S3 Access

The first step is to create an IAM role that grants Redshift permission to read from the S3 bucket containing your Parquet files.

In the IAM Console, create a new role with Amazon S3 access.

Attach a policy like the one below to allow access to your specific bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Attach this role to your Redshift cluster.

2. Writing the Stored Procedure to Import Parquet Files

Now, let’s create a stored procedure in Amazon Redshift that will import data from Parquet files in your S3 bucket.

Here's the stored procedure we’ll use:

CREATE OR REPLACE PROCEDURE public.sp_copyattrparquettoredshift()
LANGUAGE plpgsql
AS $$ 
BEGIN
  -- Clear existing data
  TRUNCATE TABLE data_store;
  
  -- Import data from S3 bucket to Redshift table
  COPY data_store
  FROM 's3://your-s3-bucket/path/to/parquet/'
  IAM_ROLE 'arn:aws:iam::your-account-id:role/YourRedshiftRole'
  FORMAT AS PARQUET;
END;
$$;

Explanation:

TRUNCATE TABLE data_store; clears any existing data in the target table before loading new data.
COPY command reads from the specified S3 path and loads the data into data_store in Parquet format.
IAM_ROLE specifies the IAM role we created for Redshift to access S3.

Run this stored procedure manually first to confirm the data is loading as expected.

3. Automating with Redshift Scheduled Queries

Redshift’s scheduled query feature allows you to automate this procedure. Here’s how to set it up to run daily:

Go to the Redshift Console.
Navigate to Queries and create a new scheduled query.
Use the following query to run the stored procedure:
```
CALL public.sp_copyattrparquettoredshift();
```
Set the schedule to run once a day (or any frequency you need). For daily scheduling, you can use a cron expression like cron(0 0 * * ? *), which runs at midnight UTC every day.
Save the schedule, and Redshift will now call the stored procedure automatically each day, keeping your data up-to-date.

4. Testing and Monitoring

To ensure your setup is working:

Run the stored procedure manually to confirm it imports data as expected.
After setting up the scheduled query, monitor its status in the Redshift Console. You can also set up alerts in Amazon CloudWatch to notify you if there are any issues with the scheduled query.

Conclusion

By using stored procedures and scheduled queries in Redshift, you’ve now set up an automated process for loading data from Parquet files in S3 to Redshift. This approach not only simplifies data management but also ensures your Redshift data is consistently up-to-date for analytics.

With this setup, you’re ready to automate ETL processes efficiently. Give it a try and take advantage of Parquet’s data compression and Redshift’s powerful querying capabilities!

Mad Pandas

Discussion about this post