Overview of using AWS Glue - AWS Glue

Overview of using AWS Glue

With AWS Glue, you store metadata in the AWS Glue Data Catalog. You use this metadata to orchestrate ETL jobs that transform data sources and load your data warehouse or data lake. The following steps describe the general workflow and some of the choices that you make when working with AWS Glue.

Note

You can use the following steps, or you can create a workflow that automatically performs steps 1 through 3. For more information, see Performing complex ETL activities using blueprints and workflows in AWS Glue.

  1. Populate the AWS Glue Data Catalog with table definitions.

    In the console, for persistent data stores, you can add a crawler to populate the AWS Glue Data Catalog. You can start the Add crawler wizard from the list of tables or the list of crawlers. You choose one or more data stores for your crawler to access. You can also create a schedule to determine the frequency of running your crawler. For data streams, you can manually create the table definition, and define stream properties.

    Optionally, you can provide a custom classifier that infers the schema of your data. You can create custom classifiers using a grok pattern. However, AWS Glue provides built-in classifiers that are automatically used by crawlers if a custom classifier does not recognize your data. When you define a crawler, you don't have to select a classifier. For more information about classifiers in AWS Glue, see Defining and managing classifiers.

    Crawling some types of data stores requires a connection that provides authentication and location information. If needed, you can create a connection that provides this required information in the AWS Glue console.

    The crawler reads your data store and creates data definitions and named tables in the AWS Glue Data Catalog. These tables are organized into a database of your choosing. You can also populate the Data Catalog with manually created tables. With this method, you provide the schema and other metadata to create table definitions in the Data Catalog. Because this method can be a bit tedious and error prone, it's often better to have a crawler create the table definitions.

    For more information about populating the AWS Glue Data Catalog with table definitions, see Creating tables.

  2. Define a job that describes the transformation of data from source to target.

    Generally, to create a job, you have to make the following choices:

    • Choose a table from the AWS Glue Data Catalog to be the source of the job. Your job uses this table definition to access your data source and interpret the format of your data.

    • Choose a table or location from the AWS Glue Data Catalog to be the target of the job. Your job uses this information to access your data store.

    • Tell AWS Glue to generate a script to transform your source to target. AWS Glue generates the code to call built-in transforms to convert data from its source schema to target schema format. These transforms perform operations such as copy data, rename columns, and filter data to transform data as necessary. You can modify this script in the AWS Glue console.

    For more information about defining jobs in AWS Glue, see Building visual ETL jobs with AWS Glue Studio.

  3. Run your job to transform your data.

    You can run your job on demand, or start it based on a one of these trigger types:

    • A trigger that is based on a cron schedule.

    • A trigger that is event-based; for example, the successful completion of another job can start an AWS Glue job.

    • A trigger that starts a job on demand.

    For more information about triggers in AWS Glue, see Starting jobs and crawlers using triggers.

  4. Monitor your scheduled crawlers and triggered jobs.

    Use the AWS Glue console to view the following:

    • Job run details and errors.

    • Crawler run details and errors.

    • Any notifications about AWS Glue activities

    For more information about monitoring your crawlers and jobs in AWS Glue, see Monitoring AWS Glue.