Important features and concepts - AWS Prescriptive Guidance

Important features and concepts

With AWS Glue, you can debug, monitor performance, and automate and run jobs in the same S3 bucket without re-running a job on the same set of data.

Logging and monitoring

By default, AWS Glue sends logs to Amazon CloudWatch under the aws-glue log group.


                The CloudWatch, CloudWatch Logs, Log groups screen.

Monitoring options

In addition to the default logs, AWS Glue provides the following options for more advanced level monitoring. Check that the options are appropriate for your use case during or after creating each job. Note that Monitoring options are not available for Python shell jobs.


                Check boxes for choosing monitoring options.
  • Job metrics – After you turn on Job metrics, metrics are reported every 30 seconds to the Glue namespace.

    
                        Metrics screen showing All metrics tab with Glue listed under Custom Namespace.

    For an overview of the metrics that are sent to CloudWatch, see AWS Glue CloudWatch metrics.

  • Continuous logging – AWS Glue provides real-time, continuous logging for AWS Glue jobs. You can view real-time Apache Spark job logs in CloudWatch, including driver logs, executor logs, and an Apache Spark job progress bar. After you turn on Continuous logging, logs appear under the /aws-glue/jobs/logs-v2 log group.

    
                        The CloudWatch, CloudWatch Logs, Log groups screen.
  • Spark UI – When your AWS Glue ETL job runs using Spark, you can store Spark UI logs. You can then deploy the Spark history server to view them on an Amazon Elastic Compute Cloud (Amazon EC2) instance. You could also view them locally using Docker. For more information, see Launching the Spark history server.

Automation

AWS Glue provides two ways for you to automate ETL jobs: triggers and workflows.

AWS Glue triggers

You can design a basic chain of dependent jobs and crawlers based on a specific schedule or condition by using AWS Glue triggers. For more information, see AWS Glue triggers.

AWS Glue workflows

For more complex workloads, you can use workflows to create and visualize complex ETL activities that involve multiple crawlers, jobs, and triggers. Each workflow manages running and monitoring all its components.

The following diagram shows a sample AWS Glue workflow.


                    Diagram showing workflow triggers, jobs, and crawlers.

Integration with other AWS services

AWS Glue also integrates with other AWS services, such as Lambda and AWS Step Functions, for more automation options.

Job bookmarks

To avoid reprocessing data when a scheduled ETL job runs, AWS Glue uses job bookmarks. AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. The persisted state information is the job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. For more information, see Tracking processed data using job bookmarks.