This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.
Extract, transform and load (ETL) using custom connectors with Apache Spark
AWS Glue: Introduction
AWS Glue
AWS Glue custom connectors
Architecture overview
The following diagram depicts the architecture of the solution where data from any of the SaaS applications can be ingested into Amazon S3 using AWS Glue. Once the data is ingested in Amazon S3, you can catalog it using AWS AWS Glue Data Catalog, and start consuming this data using SQL in Amazon Athena.

AWS Glue-based data ingestion pattern
Usage patterns
Because AWS Glue ETL provides the data engineers with flexibility, scalability, and the open-source distributed computing power of Apache Spark framework, this option is an excellent choice for ingesting SaaS data into the AWS Cloud ecosystem (including Amazon S3 for data lake operations).
Some use cases are as follows:
-
Enrich internal system data with data from SaaS applications during the ETL process.
-
Look up transformations with data from SaaS applications during ETL process.
The flexibility can be applicable in multiple places during the ETL process using AWS Glue:
-
You can either use the existing connectors that can be found in the AWS Marketplace, or if you don’t see a pre-built connector for the SaaS application you are trying to connect to, then you can also build your own customer connector and use that in AWS Glue. The process to create your own connector is documented here: Developing custom connectors.
-
A connector is just an initial piece of the puzzle which allows you to connect to the source application. But often, the data that you want from the source may need to be transformed, normalized, or aligned in a specific format before it can be stored in Amazon S3. This is where AWS Glue shines: It provides the freedom to data engineers to customize their ETL process to cater to the complexities as defined by their use cases. The complexities may arise due to heavy customizations to the SaaS application, they may arise due to limitations of other low-code/no-code tools, or they may arise due to complex business/technical requirements around how the data should be persisted in the target storage.
AWS Glue Studio is a new graphical interface that makes it
easy to create, run, and monitor ETL jobs in AWS Glue. You can
visually compose data transformation workflows and seamlessly
run them on AWS Glue’s Apache Spark-based serverless ETL engine.
Using AWS Glue Studio and AWS Marketplace connectors, you can
create an ETL pipeline with ease. For example, refer to
Migrating
data from Google BigQuery to Amazon S3 using AWS Glue custom
connectors
Considerations
Using custom connectors in AWS Glue Studio has some
limitations.
However, data engineers have the option to
author
jobs using the AWS Glue console and write their code
themselves. They can use the custom connectors or use other
open-source/paid connector codes to access any SaaS
application. For example, refer to
Extract
Salesforce.com data using AWS Glue and analyzing with Amazon Athena
Also, keep in mind when using the AWS Glue ETL pattern for
ingesting SaaS data, the source SaaS applications may have
different limitations built in for bulk data extraction as
well as API limits. For example, Salesforce has these limits
for bulk data extraction:
Bulk
API and Bulk API 2.0 Limits and Allocations