Processing layer
The processing layer in our architecture is composed of two types of components:
-
Components used to create multi-step data processing pipelines.
-
Components to orchestrate data processing pipelines on schedule or in response to event triggers (such as ingestion of new data into the landing zone).
AWS Glue and
AWS Step Functions
AWS Glue is a serverless, pay-per-use ETL service for building and
running Python or Spark jobs (written in
Scala
or
Python)
without requiring you to deploy or manage clusters.
AWS Glue automatically generates the code
Additionally, you can use AWS Glue to define and run crawlers that can crawl folders in the data lake, discover datasets and their partitions, infer schema, and define tables in the Lake Formation catalog. AWS Glue provides more than a dozen built-in classifiers that can parse a variety of data structures stored in open-source formats. AWS Glue also provides triggers and workflow capabilities that you can use to build multi-step end- to-end data processing pipelines that include job dependencies and running parallel steps. You can schedule AWS Glue jobs and workflows or run them on demand. AWS Glue natively integrates with AWS services in storage, catalog, and security layers.
To make it easy to clean and normalize data, Glue also provides a visual data preparation tool called AWS Glue DataBrew which is an interactive, point-and-click visual interface without requiring to write any code.
Step Functions is a serverless engine that you can use to build and
orchestrate scheduled or event-driven data processing workflows. You
use Step Functions to build complex data processing pipelines that
involve orchestrating steps implemented by using multiple AWS
services such as AWS Glue,
AWS Lambda