AWS Glue product family - AWS Glue Best Practices: Building an Operationally Efficient Data Pipeline

AWS Glue product family

AWS Glue is a serverless, fully managed data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (such as table definition and schema) in the AWS Glue Data Catalog.

Once cataloged, your data is immediately searchable, queryable, and available for extract, transform, load (ETL). One of the most difficult tasks in building a data pipeline is to integrate data from various sources which could be structured, semi-structured, or even un-structured; and that is where AWS Glue shines. AWS Glue provides both visual and code-based interfaces to help build ETL jobs and data pipelines faster.

The AWS Glue product family includes several services that cater to varying user personas and allows them to catalog, transform, clean, enrich, and deliver data in a consistent and reliable way.

The AWS Glue product family consists of AWS Glue for cataloging and ETL transformation, and AWS Glue DataBrew for self-service, no-code data preparation.

       The AWS Glue product family, consisting of AWS Glue and AWS Glue DataBrew.

The AWS Glue product family

When should I use AWS Glue?

AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AWS AWS Glue Data Catalog. Data engineers and ETL developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio.

You can use AWS Glue to organize, cleanse, validate, and format data for storage in a data warehouse or data lake. You can transform and move AWS Cloud data into your data store. You can also load data from disparate static or streaming data sources into your data warehouse or data lake for regular reporting and analysis. By storing data in a data warehouse or data lake, you integrate information from different parts of your business and provide a common source of data for decision making.

AWS Glue simplifies many tasks when you are building a data warehouse or data lake:

  • Discovers and catalogs metadata about your data stores into a central catalog. You can process semi-structured data, such as clickstream or process logs.

  • Populates the AWS AWS Glue Data Catalog with table definitions from scheduled crawler programs. Crawlers call classifier logic to infer the schema, format, and data types of your data. This metadata is stored as tables in the AWS AWS Glue Data Catalog, and used in the authoring process of your ETL jobs.

  • Generates ETL scripts to transform, flatten, and enrich your data from source to target.

  • Detects schema changes and adapts based on your preferences.

  • Triggers your ETL jobs based on a schedule or event. You can initiate jobs automatically to move your data into your data warehouse or data lake. Triggers can be used to create a dependency flow between jobs.

  • Gathers runtime metrics to monitor the activities of your data warehouse or data lake.

  • Handles errors and retries automatically.

  • Scales resources, as needed, to run your jobs.

You can use AWS Glue when you run serverless queries against your Amazon S3 data lake. AWS Glue can catalog your Amazon S3 data, making it available for querying with Amazon Athena and Amazon Redshift Spectrum. With crawlers, your metadata stays in sync with the underlying data. Athena and Redshift Spectrum can directly query your S3 data lake using the AWS AWS Glue Data Catalog. With AWS Glue, you access and analyze data through one unified interface without loading it into multiple data silos.

You can create event-driven ETL pipelines with AWS Glue. You can run your ETL jobs as soon as new data becomes available in S3 by invoking your AWS Glue ETL jobs from an AWS Lambda function or event driven workflows. You can also register this new dataset in the AWS AWS Glue Data Catalog as part of your ETL jobs.

You can use AWS Glue to understand your data assets. You can store your data using various AWS services and still maintain a unified view of your data using the AWS AWS Glue Data Catalog. View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository. The Data Catalog also serves as a drop-in replacement for your external Apache Hive Metastore.

What is AWS Glue Studio?

AWS Glue Studio is a new graphical interface that makes it easy to create, run, and monitor ETL jobs in AWS Glue. You can visually compose data transformation workflows, and seamlessly run them on the AWS Glue Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job. Use AWS Glue Studio for a simple visual interface to create ETL workflows for data cleaning and transformation, and run them on AWS Glue. AWS Glue Studio makes it easy for ETL developers to create repeatable processes to move and transform large-scale, semi-structured datasets, and load them into data lakes and data warehouses. It provides a boxes-and-arrows style visual interface for developing and managing AWS Glue ETL workflows that you can optionally customize with code. AWS Glue Studio combines the ease of use of traditional ETL tools, and the power and flexibility of the big AWS Glue data processing engine.

AWS Glue Studio provides multiple ways to customize your ETL scripts, including adding nodes that represent code snippets in the visual editor.

Use AWS Glue Studio for easier job management. AWS Glue Studio provides you with job and job run management interfaces that make it clear how jobs relate to each other, and give an overall picture of your job runs. The job management page makes it easy to do bulk operations on jobs (previously difficult to do in the AWS Glue console). All job runs are available in a single interface where you can search and filter. This gives you a constantly updated view of your ETL operations and the resources you use. You can use the near real-time dashboard in AWS Glue Studio to monitor your job runs and validate that they are operating as intended.

When should I use AWS Glue DataBrew?

Data analysts and data scientists can use AWS Glue DataBrew to visually enrich, clean, and normalize data without writing code. Using DataBrew helps reduce the time it takes to prepare data for analytics and ML by up to 80 percent, compared to custom developed data preparation. You can choose from over 250 ready-made transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats, and correcting invalid values.

Use AWS Glue DataBrew to interactively discover, visualize, clean, and transform raw data. With the intuitive DataBrew interface, you can interactively discover, visualize, clean, and transform raw data. DataBrew makes smart suggestions to help you identify data quality issues that can be difficult to find and time-consuming to fix. With DataBrew preparing your data, you can use your time to act on the results and iterate more quickly. You can save transformation as steps in a recipe, which you can update or reuse later with other datasets, and deploy on a continuing basis.