AWS Glue Data Catalog - AWS Prescriptive Guidance

AWS Glue Data Catalog

The AWS Glue Data Catalog consists of the following components:

  • Databases and tables

  • Crawlers and classifiers

  • Connections

  • AWS Glue Schema Registry

AWS Glue databases and tables

The Data Catalog consists of database and tables. A table can be in only one database. Your database can contain tables from many different sources that AWS Glue supports.

The following image shows a sample view of a Data Catalog database and corresponding tables.

                View includes classification and last updated date.

You can create the database and tables in the following ways:

  • The AWS Glue crawler

  • An AWS Glue ETL job

  • AWS Glue console

  • The CreateTable operation in the AWS Glue API

  • AWS CloudFormation templates

  • A migrated Apache Hive metastore

Currently, AWS Glue supports Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and Java Database Connectivity (JDBC) data sources.

AWS Glue crawlers and classifiers

A crawler helps create and update the Data Catalog tables. It can crawl both file-based and table-based data stores.

Crawlers can crawl the following data stores through their respective native interfaces:

  • Amazon S3

  • DynamoDB

Crawlers can crawl the following data stores through a JDBC connection:

  • Amazon Redshift

  • Amazon Relational Database Service (Amazon RDS)

    • Amazon Aurora

    • Microsoft SQL Server

    • MySQL

    • Oracle

    • PostgreSQL

  • Publicly accessible databases (on-premises or on another cloud provider environment)

    • Aurora

    • Microsoft SQL Server

    • MySQL

    • Oracle

    • PostgreSQL

In the AWS Glue crawler, a classifier recognizes the format of the data and generates the schema. AWS Glue comes with set of built-in classifiers, but you can also create custom classifiers.

The built-in classifiers for various formats include JavaScript Object Notation (JSON), comma-separated values (CSV), web logs, and many database systems.

If AWS Glue doesn't find a custom classifier that fits the input data format with 100 percent certainty, it invokes the built-in classifiers. The classifiers are invoked in the order that is shown in the table in Built-in classifiers in AWS Glue. The built-in classifiers return a result to indicate whether the format matches (certainty=1.0) or does not match (certainty=0.0). The first classifier that has certainty=1.0 provides the classification string and schema for a metadata table in your Data Catalog.

AWS Glue connections

You can use connections to define connection information, such as login credentials and virtual private cloud (VPC) IDs, in one place. This saves time, because you don’t need to provide connection information every time you create a crawler or job.

The following connection types are available:

  • JDBC

  • Amazon RDS

  • Amazon Redshift

  • MongoDB, including Amazon DocumentDB (with MongoDB compatibility)

  • Network (designates a connection to a data source within a VPC environment on AWS)

AWS Glue Schema Registry

The AWS Glue Schema Registry enables disparate systems to share a schema for serialization and deserialization. For example, assume that you have a producer and a consumer of data. The producer knows the schema when it publishes the serialized data. The consumer uses the Schema Registry deserializer library that parses the schema version ID from the record payload. The consumer then uses the schema to deserialize the data. For examples of use cases regarding producer and consumer resources, see Integrating with AWS Glue Schema Registry.

With the AWS Glue Schema Registry, you can manage and enforce schemas on your data streaming applications using convenient integrations with the following data input sources:

  • Apache Kafka

  • Amazon Managed Streaming for Apache Kafka

  • Amazon Kinesis Data Streams

  • Amazon Kinesis Data Analytics for Apache Flink

  • AWS Lambda

Schema Registry consists of the following components:

  • Schemas – A schema is the abstraction to represent the structure and format of a data record.

  • Registry – A registry is a logical container of schemas. Registries allow you to organize your schemas and manage access control for your applications.