Logging and monitoring Automation Job bookmarks

Important features and concepts

Logging and monitoring

AWS Glue has several logging and monitoring options. By default, AWS Glue sends logs to the aws-glue log group in Amazon CloudWatch. These logs include information such as start and end time, configuration settings, and any errors or warnings that might have occurred.

Additionally, AWS Glue Spark ETL jobs provide the following options, which must be enabled for advanced monitoring:

Job metrics report job-specific metrics to the AWS Glue namespace in CloudWatch every 30 seconds. These job-specific metrics, such as processed records, total input/output data size, and runtime, provide insights into a job’s performance. They can help identify bottlenecks or opportunities to optimize configurations.
Continuous logging streams real-time Apache Spark job logs to the /aws-glue/jobs/logs-v2 log group in CloudWatch. By using real-time logs, you can dynamically monitor AWS Glue jobs while they are running.
Spark UI provides a Spark history server web interface for viewing information about the Spark job, such as the event timeline of each stage, a directed acyclic graph, and job environment variables. The persisted Spark UI event logs are stored in Amazon S3, and you can use them in real time or after the job is complete.
Job run insights simplifies job debugging and optimization by listening for common Spark exceptions, performing root cause analysis, and providing recommended actions to fix issues. The insights are stored in CloudWatch.

Automation

AWS Glue provides two main ways for you to automate ETL jobs: triggers and workflows.

AWS Glue triggers

When fired, AWS Glue triggers start specified jobs and crawlers. A trigger can be fired on demand, based on a predefined schedule, or based on specific events. You can use triggers to design a chain of dependent jobs and crawlers. For more information, see AWS Glue triggers.

AWS Glue workflows

For more complex workloads, you can use AWS Glue workflows to create directed acyclic graphs and to build dependencies between separate AWS Glue entities (triggers, crawlers, and jobs). Workflows also provide a unified interface where you can share parameters, monitor progress, and troubleshoot issues across associated entities.

Setting up many associated entities within AWS Glue workflows can grow increasingly complex. Developers can create AWS Glue blueprints for sharing complex data pipelines with data scientists and business analysts. These templates allow for the consistent and repeatable creation of AWS Glue workflows, abstracting away the technical details.

To learn more about AWS Glue blueprints and workflows, see Performing complex ETL activities using blueprints and workflows in AWS Glue.

Orchestrating AWS Glue jobs with other AWS services

For more automation options, AWS Glue integrates with other AWS services, such as AWS Lambda, AWS Step Functions, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

To compare the different orchestration methods for AWS Glue ETL jobs, see Building an operationally excellent data pipeline.

Job bookmarks

Job bookmarks in AWS Glue are used to keep track of the progress of ETL jobs, which prevents the need to reprocess data in subsequent job runs. When job bookmarks are enabled, AWS Glue maintains a record of data that has already been processed. Then with each run, it processes only the new data in the data source. For more information, see Tracking processed data using job bookmarks.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Data Catalog

DataBrew