Create meaningful insights using codeless ETL capabilities and workflow orchestration automation - SQL-Based ETL with Apache Spark on Amazon EKS

Create meaningful insights using codeless ETL capabilities and workflow orchestration automation

Publication date: July 2021

The SQL-Based ETL with Apache Spark on Amazon EKS solution provides declarative data processing support, codeless extract-transform-load (ETL) capabilities, and workflow orchestration automation to help your business users (such as analysts and data scientists) access their data and create meaningful insights without the need for manual IT processes.

This solution abstracts common ETL activities, including formatting, partitioning, and transforming datasets, into configurable and productive data processes. This abstraction results in actionable insights that is derived more quickly to help you accelerate your data-driven business decisions. Additionally, this solution uses an open-source Arc data processing framework, run by Amazon Elastic Kubernetes Service (Amazon EKS) and powered by Apache Spark and container technologies, to simplify Spark application development and deployment.

This solution uses GitHub as the source repository to track ETL asset changes, such as Jupyter notebook file and SQL script updates, allowing for application version control and standardized continuous integration and continuous delivery (CI/CD) deployments. This solution unifies analytical workloads and IT operations using standardized and automated processes, providing a simplified ETL deployment management capability for your organization’s DevOps team. These automated processes help you avoid unintentional human mistakes caused by manual, repetitive tasks.

This solution offers the following features:

  • Built, test, and debug ETL jobs in Jupyter: Uses a web-based JupyterHub as an interactive integrated development environment (IDE). It contains a custom Arc kernel to simplify your ETL application development by allowing you to define each ETL task or stage in separate blocks. The output of each block displays data results and a task log. This log captures real-time data processing status and exception messages that can be used if debugging is needed.

  • Codeless job orchestration: Schedules jobs and manages complex job dependencies without the need to code. Argo workflows declaratively defines job implementation target state, orders, and relationships. It provides a user-friendly graphical dashboard to help you track workflow status and resource usage patterns. The job orchestrator in this solution is a switchable plug-in and can be replaced by another tool of your choice, for example Apache Airflow or Volcano.

  • Docker image auto-deployment: Sets up an AWS continuous improvement and continuous development (CI/CD) pipeline to securely store the Arc Docker image in Amazon Elastic Container Registry (Amazon ECR).

  • ETL CI/CD deployment: Builds a file synchronization process. In this solution, the automated ETL deployment process integrates the Jupyter IDE with the solution’s GitHub repository to detect sample ETL jobs’ changes. As a one-off setup with your choice of CI/CD tool, a GitHub change activates a file sync-up process between your Git repository and the artifact S3 bucket. As a result, Argo Workflows can refer to the Amazon S3 file assets, such as a Jupyter notebook (job specification file) and automatically orchestrate ETL jobs either on-demand or based on a time or an event.

  • SQL-first approach: Uses Spark SQL to implement business logic and data quality checks in ETL pipeline development. You can process and manipulate data in Spark using your existing SQL expertise.

This implementation guide describes architectural considerations and configuration steps for deploying SQL-Based ETL with Apache Spark on Amazon EKS in the Amazon Web Services (AWS) Cloud. It includes links to an AWS CloudFormation template that launches and configures the AWS services required to deploy this solution using AWS best practices for security and availability.

The guide is intended for IT architects, developers, DevOps, data analysts, data scientists, and any SQL authors in an organization who have practical experience in the AWS Cloud.