Benefits of using AWS Glue for data integration - AWS Glue Best Practices: Building an Operationally Efficient Data Pipeline

Benefits of using AWS Glue for data integration

AWS Glue is a fully managed ETL service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (such as table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.

Following are some benefits of using AWS Glue:

  • Less hassle — AWS Glue is integrated across a wide range of AWS services, meaning less hassle for you when onboarding. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon Relational Database Service (RDS) engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon Elastic Compute Cloud (Amazon EC2) or your on-premises environment.

  • Cost effective — AWS Glue is serverless. Because there is no infrastructure to provision or manage, total cost of ownership is lower. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.

  • More power — AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to run your data transformations and loading processes.

AWS Glue also brings number of important features that provide numerous benefits to your enterprise.

  • Discover and search across all your AWS datasets — The AWS Glue Data Catalog is your persistent metadata store for all your data assets, regardless of where the data assets are located. The Data Catalog contains table definitions, job definitions, schemas, and other control information to help you manage your AWS Glue environment. It automatically computes statistics, and registers partitions to make queries against your data efficient and cost-effective. It also maintains a comprehensive schema version history so you can understand how your data has changed over time.

  • Automatic schema discovery — AWS Glue crawlers connect to your source or target data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata in your AWS AWS Glue Data Catalog. The metadata is stored in tables in your data catalog and used in the authoring process of your ETL jobs. You can run crawlers on a schedule, on-demand, or trigger them based on an event to ensure that your metadata is up-to-date.

  • Manage and enforce schemas for data streamsAWS Glue Schema Registry, a feature of AWS Glue, enables you to validate and control the evolution of streaming data using registered Apache Avro schemas, at no additional charge. Through Apache-licensed serializers and de-serializers, the Schema Registry integrates with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda. When data streaming applications are integrated with the Schema Registry, you can improve data quality and safeguard against unexpected changes using compatibility checks that govern schema evolution. Additionally, you can create or update AWS Glue tables and partitions using schemas stored within the registry.

  • Visually transform data with a drag-and-drop interface — AWS Glue Studio allows you to author highly scalable ETL jobs for distributed processing without becoming an Apache Spark expert. Define your ETL process in the drag-and-drop job editor, and AWS Glue automatically generates the code to extract, transform, and load your data. The code is generated in Scala or Python and written for Apache Spark.

  • Build complex ETL pipelines with simple job scheduling — AWS Glue jobs can be invoked on a schedule, on-demand, or based on an event. You can start multiple jobs in parallel or specify dependencies across jobs to build complex ETL pipelines. AWS Glue handles all inter-job dependencies, filters bad data, and retries jobs if they fail. All logs and notifications are pushed to Amazon CloudWatch so you can monitor and get alerts from a central service.

  • Clean and transform streaming data in transit — Serverless streaming ETL jobs in AWS Glue continuously consume data from streaming sources, including Amazon Kinesis and Amazon MSK, clean and transform it in transit, and make it available for analysis in seconds in your target data store. Use this feature to process event data like Internet of Things (IoT) event streams, clickstreams, and network logs. AWS Glue streaming ETL jobs can enrich and aggregate data, join batch and streaming sources, and run a variety of complex analytics and ML operations.

  • Deduplicate and cleanse data with built-in ML — AWS Glue helps clean and prepare your data for analysis without becoming an ML expert. Its FindMatches feature deduplicates and finds records that are imperfect matches of each other. For example, use FindMatches to find duplicate records in your database of restaurants, when one record lists “Joe's Pizza” at “121 Main St.” and another shows a “Joseph's Pizzeria” at “121 Main”. FindMatches will ask you to label sets of records as either “matching” or “not matching.” The system will then learn your criteria for calling a pair of records a match, and will build an ETL job that you can use to find duplicate records within a database, or matching records across two databases.

  • Edit, debug, and test ETL code using AWS Glue interactive sessions — AWS Glue supports interactive application development that assists data engineers to rapidly build, test, and run data preparation and analytics applications. This is achieved using AWS Glue interactive sessions. AWS Glue interactive sessions provide you with on-demand access to a remote Spark runtime environment.

    Flexibility of interactive session lets you interact with it in many ways – the AWS Command Line Interface (AWS CLI), APIs, AWS Glue Studio notebooks, or local Jupyter-compatible notebooks. It provides an open-source Jupyter kernel that integrates almost anywhere that Jupyter does, including integrating with integrated development environments (IDEs) such as PyCharm, IntelliJ, and VS Code. This enables you to author code in your local environment and run it seamlessly on the interactive session backend.

    Interactive sessions provide a faster, cheaper, more-flexible way to build and run data preparation and analytics applications.

  • Normalize data without code using a visual interface — AWS Glue DataBrew provides an interactive, point-and-click visual interface for users such as data analysts and data scientists to clean and normalize data without writing code. You can easily visualize, clean, and normalize data directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS. You can choose from over 250 built-in transformations to combine, pivot, and transpose the data, and automate data preparation tasks by applying saved transformations directly to the new incoming data.