サンプルデータベースサンプルのデータベース、テーブル、パーティションサンプルの Grok 分類子サンプルの JSON 分類子サンプルの XML 分類子 Amazon S3 クローラーのサンプルサンプルの接続サンプルの JDBC クローラー Amazon S3 から Amazon S3 へのサンプルジョブ Amazon S3 に書き込む JDBC のサンプルジョブサンプルのオンデマンドトリガーサンプルのスケジュールされたトリガーサンプルの条件付きトリガー機械学習変換のサンプルサンプルのデータ品質ルールセット EventBridge スケジューラを使用するサンプルのデータ品質ルールセットサンプルの開発エンドポイント

AWS Glue の場合は AWS CloudFormation

AWS CloudFormation は、多くの AWS リソースを作成できるサービスです。AWS Glue には、AWS Glue Data Catalog でオブジェクトを作成するための API オペレーションが用意されています。ただし、AWS Glue オブジェクトや他の関連する AWS リソースオブジェクトを AWS CloudFormation テンプレートファイルで定義して作成するほうが便利な場合があります。この場合、オブジェクトの作成プロセスを自動化できます。

AWS CloudFormation では、JSON (JavaScript Object Notation) または YAML (YAML Ain't Markup Language) のどちらかの簡略化された構文を使用し、AWS リソースの作成を記述します。AWS CloudFormation テンプレートを使用して、データベース、テーブル、パーティション、クローラー、分類子、接続などのデータカタログオブジェクトを定義できます。ジョブ、トリガー、開発エンドポイントなどの ETL オブジェクトを定義することもできます。必要なすべての AWS リソースを記述するテンプレートを作成すると、これらのリソースが AWS CloudFormation で自動的にプロビジョニングおよび設定されます。

詳細については、AWS CloudFormation ユーザーガイドの「What Is AWS CloudFormation?」および「Working with AWS CloudFormation Templates」を参照してください。

管理者として AWS Glue と互換性がある AWS CloudFormationテンプレートを使用する場合は、依存する AWS CloudFormation および AWS のサービスとアクションにアクセス権を付与する必要があります。AWS CloudFormation リソースを作成するアクセス権限を付与するには、次のポリシーを、AWS CloudFormation を使用するユーザーにアタッチします。


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [                
        "cloudformation:*"        
      ],
      "Resource": "*"
    }
  ]
}

次の表は、AWS CloudFormation テンプレートで自動的に実行できるアクションの一覧です。AWS CloudFormation テンプレートに追加できる AWS リソースタイプやプロパティタイプに関する情報へのリンクが含まれています。

AWS Glue リソース:	AWS CloudFormation テンプレート	AWS Glueサンプル
分類子	AWS::Glue::Classifier	Grok 分類子、JSON 分類子、XML 分類子
Connection	AWS::Glue::Connection	MySQL 接続
Crawler	AWS::Glue::Crawler	Amazon S3 クローラー、MySQL クローラー
データベース	AWS::Glue::Database	空のデータベース、テーブルを含むデータベース
開発エンドポイント	AWS::Glue::DevEndpoint	開発エンドポイント
ジョブ	AWS::Glue::Job	Amazon S3 ジョブ、JDBC ジョブ
機械学習変換	AWS::Glue::MLTransform	機械学習変換
データ品質ルールセット	AWS::Glue::DataQualityRuleset	データ品質ルールセット、EventBridge スケジューラを使用したデータ品質ルールセット
パーティション	AWS::Glue::Partition	テーブルのパーティション
テーブル	AWS::Glue::Table	データベース内のテーブル
Trigger トリガー)	AWS::Glue::Trigger	オンデマンドのトリガー、スケジュールされたトリガー、条件付きトリガー

使用を開始するには、以下のサンプルテンプレートを独自のメタデータを使用してカスタマイズします。次に AWS CloudFormation コンソールを使用して AWS CloudFormation スタックを作成し、AWS Glue および関連サービスにオブジェクトを追加します。AWS Glue オブジェクトの多くのフィールドはオプションです。これらのテンプレートは、AWS Glue オブジェクトの使用や機能に必須または必要なフィールドを示しています。

AWS CloudFormation テンプレートは JSON 形式または YAML 形式のいずれかで使用できます。以下の例では、読みやすい YAML を使用しています。各例には、テンプレートで定義されている値を説明するコメント (#) が含まれています。

AWS CloudFormation テンプレートには Parameters セクションを含めることができます。このセクションは、サンプルテキストを編集して変更できます。または、YAML ファイルを AWS CloudFormation コンソールに送信してスタックを作成するときに変更できます。テンプレートの Resources セクションには、AWS Glue および関連オブジェクトの定義が含まれています。AWS CloudFormation テンプレートの構文定義には、詳細なプロパティ構文を含むプロパティが含まれている場合があります。AWS Glue オブジェクトを作成するために、すべてのプロパティが必要になるわけではありません。これらのサンプルは、AWS Glue オブジェクトを作成するための一般的なプロパティの値の例です。

AWS Glue データベース用のサンプル AWS CloudFormation テンプレート

Data Catalog の AWS Glue データベースにはメタデータテーブルが含まれています。このデータベースは、構成するプロパティが非常に少なく、AWS CloudFormation テンプレートを使用して Data Catalog に作成できます。次のサンプルテンプレートでは、使用を開始する方法と、AWS Glue での AWS CloudFormation スタックの使い方を示します。このサンプルテンプレートで作成されるリソースは cfn-mysampledatabase というデータベースのみです。このデータベースは、サンプルのテキストを編集するか、YAML の送信時に AWS CloudFormation コンソールで値を変更することで、変更できます。

次に示すのは、AWS Glue データベースを作成するための一般的なプロパティの値の例です。AWS CloudFormation 用の AWS Glue データベーステンプレートの詳細については、「AWS::Glue::Database」を参照してください。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database named mysampledatabase
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  CFNDatabaseName:
    Type: String
    Default: cfn-mysampledatabse

# Resources section defines metadata for the Data Catalog
Resources:
# Create an AWS Glue database
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      # The database is created in the Data Catalog for your account
      CatalogId: !Ref AWS::AccountId   
      DatabaseInput:
        # The name of the database is defined in the Parameters section above
        Name: !Ref CFNDatabaseName	
        Description: Database to hold tables for flights data
        LocationUri: s3://crawler-public-us-east-1/flight/2016/csv/
        #Parameters: Leave AWS database parameters blank

AWS Glue データベース、テーブル、およびパーティション用のサンプル AWS CloudFormation テンプレート

AWS Glue テーブルには、ETL スクリプトで処理するデータの構造と場所を定義するメタデータが含まれています。テーブル内に、データを並列処理するためのパーティションを定義できます。パーティションは、キーを使用して定義したデータのチャンクです。たとえば、キーとして月を使用すると、1 月のすべてのデータが同じパーティションに含まれます。AWS Glue では、データベースにテーブルを含め、テーブルにパーティションを含めることができます。

次のサンプルでは、AWS CloudFormation テンプレートを使用して、データベース、テーブル、およびパーティションを事前設定する方法を示します。元のデータ形式は csv であり、カンマ (,) で区切られています。テーブルを作成するには事前にデータベースが必要であり、パーティションを作成するには事前にテーブルが必要であるため、テンプレートでは DependsOn ステートメントを使用して、これらのオブジェクトの作成時に相互の依存関係を定義します。

次のサンプルの値では、一般に利用可能な Amazon S3 バケットのフライトデータを含むテーブルを定義します。わかりやすくするために、データのいくつかの列と 1 つのパーティションキーのみが定義されています。4 つのパーティションも Data Catalog に定義されています。基本データのストレージを記述するいくつかのフィールドも StorageDescriptor フィールドに示されています。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database, a table, and partitions
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters substituted in the Resources section
# These parameters are names of the resources created in the Data Catalog
Parameters:
  CFNDatabaseName:
    Type: String
    Default: cfn-database-flights-1
  CFNTableName1:
    Type: String
    Default: cfn-manual-table-flights-1
# Resources to create metadata in the Data Catalog
Resources:
###
# Create an AWS Glue database
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName	
        Description: Database to hold tables for flights data
###
# Create an AWS Glue table
  CFNTableFlights:
    # Creating the table waits for the database to be created
    DependsOn: CFNDatabaseFlights
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableInput:
        Name: !Ref CFNTableName1
        Description: Define the first few columns of the flights table
        TableType: EXTERNAL_TABLE
        Parameters: {
    "classification": "csv"
  }
#       ViewExpandedText: String
        PartitionKeys:
        # Data is partitioned by month
        - Name: mon
          Type: bigint
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: year
            Type: bigint
          - Name: quarter
            Type: bigint
          - Name: month
            Type: bigint
          - Name: day_of_month
            Type: bigint			
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 1
# Create an AWS Glue partition  
  CFNPartitionMon1:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 1
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=1/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 2
# Create an AWS Glue partition 
  CFNPartitionMon2:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 2
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=2/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 3
# Create an AWS Glue partition 
  CFNPartitionMon3:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 3
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=3/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 4
# Create an AWS Glue partition 
  CFNPartitionMon4:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 4
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=4/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

AWS Glue Grok 分類子用のサンプル AWS CloudFormation テンプレート

AWS Glue 分類子はデータのスキーマを決定します。1 つのタイプのカスタム分類子では、grok パターンを使用してデータをマッチングします。パターンがマッチすると、カスタム分類子ではテーブルのスキーマを作成し、分類子の定義に設定された値に classification を設定します。

このサンプルで作成する分類子では、message という列が 1 つあるスキーマを作成し、分類を greedy に設定します。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-grok-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses grok pattern to put all data in one column and classifies it as "greedy".	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      GrokClassifier:
        #Grok classifier that puts all data in one column		
        Name: !Ref CFNClassifierName
        Classification: greedy                                                        	   
        GrokPattern: "%{GREEDYDATA:message}"
        #CustomPatterns: none

AWS Glue JSON 分類子用のサンプル AWS CloudFormation テンプレート

AWS Glue 分類子はデータのスキーマを決定します。1 つのタイプのカスタム分類子では、分類のための分類子の JSON データを定義する JsonPath 文字列を使用します。「Writing JsonPath Custom Classifiers」に記載されているように、AWS Glue では JsonPath の演算子のサブセットがサポートされています。

パターンが一致すると、カスタム識別子を使用してテーブルのスキーマが作成されます。

このサンプルでは、オブジェクトの Records3 配列で各スキーマを使用してスキーマを作成する識別子を作成します。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a JSON classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-json-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses a JSON pattern.	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      JSONClassifier:
        #JSON classifier		
        Name: !Ref CFNClassifierName
        JsonPath: $.Records3[*]

AWS Glue XML 分類子用のサンプル AWS CloudFormation テンプレート

AWS Glue 分類子はデータのスキーマを決定します。1 つのタイプのカスタム分類子は、XML タグを指定して、解析中の XML ドキュメントの各レコードを含む要素を指定します。パターンがマッチすると、カスタム分類子ではテーブルのスキーマを作成し、分類子の定義に設定された値に classification を設定します。

このサンプルで作成する分類子では、Record タグの各レコードでスキーマを作成し、分類を XML に設定します。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating an XML classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-xml-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses the XML pattern and classifies it as "XML".	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      XMLClassifier:
        #XML classifier		
        Name: !Ref CFNClassifierName
        Classification: XML   
        RowTag: <Records>

Amazon S3 の AWS Glue クローラー用のサンプル AWS CloudFormation テンプレート

AWS Glue クローラーでは、データに対応するメタデータテーブルをデータカタログに作成します。次に、これらのテーブル定義を ETL ジョブのソースおよびターゲットとして使用できます。

このサンプルでは、クローラー、必要な IAM ロール、および AWS Glue データベースをデータカタログに作成します。このクローラーを実行すると、クローラーは IAM ロールを取得し、公開フライトデータ用のテーブルをデータベースに作成します。テーブルは、プレフィックス cfn_sample_1_ を使用して作成されます。このテンプレートで作成された IAM ロールでは、グローバルのアクセス許可が付与されるので、カスタムロールを作成する必要がある場合もあります。この分類子で定義されるカスタム分類子はありません。AWS Glue の組み込み分類子がデフォルトで使用されます。

このサンプルを AWS CloudFormation コンソールに送信する場合は、IAM ロールを作成することを確認する必要があります。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNCrawlerName:  
    Type: String
    Default: cfn-crawler-flights-1
  CFNDatabaseName:
    Type: String
    Default: cfn-database-flights-1
  CFNTablePrefixName:
    Type: String
    Default: cfn_sample_1_	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
#Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions.
  CFNRoleFlights:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        -
          PolicyName: "root"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: "Allow"
                Action: "*"
                Resource: "*"
 # Create a database to contain tables created by the crawler
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName
        Description: "AWS Glue container to hold metadata tables for the flights crawler"
 #Create a crawler to crawl the flights data on a public S3 bucket
  CFNCrawlerFlights:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Ref CFNCrawlerName
      Role: !GetAtt CFNRoleFlights.Arn
      #Classifiers: none, use the default classifier
      Description: AWS Glue crawler to crawl flights data
      #Schedule: none, use default run-on-demand
      DatabaseName: !Ref CFNDatabaseName
      Targets:
        S3Targets:
          # Public S3 bucket with the flights data
          - Path: "s3://crawler-public-us-east-1/flight/2016/csv"
      TablePrefix: !Ref CFNTablePrefixName
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
      Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

AWS Glue 接続用のサンプル AWS CloudFormation テンプレート

Data Catalog の AWS Glue 接続には、JDBC データベースに接続するために必要な JDBC およびネットワーク情報が含まれています。この情報は、JDBC データベースに接続して ETL ジョブをクロールまたは実行するときに使用されます。

このサンプルでは、Amazon RDS MySQL データベース (devdb) への接続を作成します。この接続を使用する場合は、IAM ロール、データベース認証情報、およびネットワーク接続の値も指定する必要があります。テンプレートの必須フィールドの詳細を参照してください。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a connection
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the connection to be created
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
  CFNJDBCString:  
    Type: String
    Default: "jdbc:mysql://xxx-mysql.yyyyyyyyyyyyyy.us-east-1.rds.amazonaws.com:3306/devdb"
  CFNJDBCUser:  
    Type: String
    Default: "master"
  CFNJDBCPassword:  
    Type: String
    Default: "12345678"
    NoEcho: true
#
#
# Resources section defines metadata for the Data Catalog
Resources:
  CFNConnectionMySQL:
    Type: AWS::Glue::Connection
    Properties:
      CatalogId: !Ref AWS::AccountId
      ConnectionInput: 
        Description: "Connect to MySQL database."
        ConnectionType: "JDBC"
        #MatchCriteria: none		
        PhysicalConnectionRequirements:
          AvailabilityZone: "us-east-1d"
          SecurityGroupIdList: 
           - "sg-7d52b812"
          SubnetId: "subnet-84f326ee" 
        ConnectionProperties: {
          "JDBC_CONNECTION_URL": !Ref CFNJDBCString,
          "USERNAME": !Ref CFNJDBCUser,
          "PASSWORD": !Ref CFNJDBCPassword
        }
        Name: !Ref CFNConnectionName

JDBC の AWS Glue クローラー用のサンプル AWS CloudFormation テンプレート

このサンプルでは、クローラー、必要な IAM ロール、および AWS Glue データベースをデータカタログに作成します。このクローラーを実行すると、クローラーは IAM ロールを取得し、MySQL データベースに保存されている公開フライトデータ用のテーブルをデータベースに作成します。テーブルは、プレフィックス cfn_jdbc_1_ を使用して作成されます。このテンプレートで作成された IAM ロールでは、グローバルのアクセス許可が付与されるので、カスタムロールを作成する必要がある場合もあります。JDBC データに対してはカスタム分類子を定義できません。AWS Glue の組み込み分類子がデフォルトで使用されます。

このサンプルを AWS CloudFormation コンソールに送信する場合は、IAM ロールを作成することを確認する必要があります。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNCrawlerName:  
    Type: String
    Default: cfn-crawler-jdbc-flights-1
# The name of the database to be created to contain tables	
  CFNDatabaseName:
    Type: String
    Default: cfn-database-jdbc-flights-1
# The prefix for all tables crawled and created	
  CFNTablePrefixName:
    Type: String
    Default: cfn_jdbc_1_
# The name of the existing connection to the MySQL database
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
# The name of the JDBC path (database/schema/table) with wildcard (%) to crawl	
  CFNJDBCPath:  
    Type: String
    Default: saldev/%		
#
#
# Resources section defines metadata for the Data Catalog
Resources:
#Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions.
  CFNRoleFlights:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        -
          PolicyName: "root"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: "Allow"
                Action: "*"
                Resource: "*"
 # Create a database to contain tables created by the crawler
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName
        Description: "AWS Glue container to hold metadata tables for the flights crawler"
 #Create a crawler to crawl the flights data in MySQL database
  CFNCrawlerFlights:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Ref CFNCrawlerName
      Role: !GetAtt CFNRoleFlights.Arn
      #Classifiers: none, use the default classifier
      Description: AWS Glue crawler to crawl flights data
      #Schedule: none, use default run-on-demand
      DatabaseName: !Ref CFNDatabaseName
      Targets:
        JdbcTargets:
          # JDBC MySQL database with the flights data
          - ConnectionName: !Ref CFNConnectionName
            Path: !Ref CFNJDBCPath
          #Exclusions: none
      TablePrefix: !Ref CFNTablePrefixName
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
	  Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

Amazon S3 から Amazon S3 への AWS Glue ジョブのサンプル AWS CloudFormation テンプレート

Data Catalog の AWS Glue ジョブには、AWS Glue でスクリプトを実行するために必要なパラメータ値が含まれています。

このサンプルで作成するジョブでは、Amazon S3 バケットのフライトデータを csv 形式で読み取り、Amazon S3 の Parquet ファイルに書き込みます。このジョブで実行するスクリプトは既存している必要があります。環境に応じた ETL スクリプトを AWS Glue コンソールで生成できます。このジョブ実行時に、適切なアクセス許可が設定された IAM ロールも指定する必要があります。

テンプレートには、一般的なパラメータ値が示されています。たとえば、AllocatedCapacity (DPU) はデフォルトで 5 になります。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using the public flights S3 table in a public bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the job to be created
  CFNJobName:  
    Type: String
    Default: cfn-job-S3-to-S3-2
# The name of the IAM role that the job assumes. It must have access to data, script, temporary directory
  CFNIAMRoleName:  
    Type: String
    Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
  CFNScriptLocation:  
    Type: String
    Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-test2	
#
#
# Resources section defines metadata for the Data Catalog
Resources:                                      
# Create job to run script which accesses flightscsv table and write to S3 file as parquet.
# The script already exists and is called by this job	
  CFNJobFlights:
    Type: AWS::Glue::Job   
    Properties:
      Role: !Ref CFNIAMRoleName  
      #DefaultArguments: JSON object 
      # If script written in Scala, then set DefaultArguments={'--job-language'; 'scala', '--class': 'your scala class'}
      #Connections:  No connection needed for S3 to S3 job 
      #  ConnectionsList  
      #MaxRetries: Double  
      Description: Job created with CloudFormation  
      #LogUri: String  
      Command:   
        Name: glueetl  
        ScriptLocation: !Ref CFNScriptLocation
             # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-"					 
             # script uses temp directory from job definition if required (temp directory not used S3 to S3)
             # script defines target for output as s3://aws-glue-target/sal    			 
      AllocatedCapacity: 5  
      ExecutionProperty:   
        MaxConcurrentRuns: 1  
      Name: !Ref CFNJobName

Amazon S3 に書き込む JDBC の AWS Glue ジョブ用のサンプル AWS CloudFormation テンプレート

Data Catalog の AWS Glue ジョブには、AWS Glue でスクリプトを実行するために必要なパラメータ値が含まれています。

このサンプルで作成するジョブでは、cfn-connection-mysql-flights-1 という接続で定義された MySQL JDBC データベースからフライトデータを読み取り、Amazon S3 の Parquet ファイルに書き込みます。このジョブで実行するスクリプトは既存している必要があります。環境に応じた ETL スクリプトを AWS Glue コンソールで生成できます。このジョブ実行時に、適切なアクセス許可が設定された IAM ロールも指定する必要があります。

テンプレートには、一般的なパラメータ値が示されています。たとえば、AllocatedCapacity (DPU) はデフォルトで 5 になります。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using a MySQL JDBC DB with the flights data to an S3 file
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the job to be created
  CFNJobName:  
    Type: String
    Default: cfn-job-JDBC-to-S3-1
# The name of the IAM role that the job assumes. It must have access to data, script, temporary directory
  CFNIAMRoleName:  
    Type: String
    Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
  CFNScriptLocation:  
    Type: String
    Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-dec4a	
# The name of the connection used for JDBC data source
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
#
#
# Resources section defines metadata for the Data Catalog
Resources:                                      
# Create job to run script which accesses JDBC flights table via a connection and write to S3 file as parquet.
# The script already exists and is called by this job	
  CFNJobFlights:
    Type: AWS::Glue::Job   
    Properties:
      Role: !Ref CFNIAMRoleName  
      #DefaultArguments: JSON object  
      # For example, if required by script, set temporary directory as DefaultArguments={'--TempDir'; 's3://aws-glue-temporary-xyc/sal'}
      Connections:
        Connections:
        - !Ref CFNConnectionName 
      #MaxRetries: Double  
      Description: Job created with CloudFormation using existing script
      #LogUri: String  
      Command:   
        Name: glueetl  
        ScriptLocation: !Ref CFNScriptLocation
             # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-"					 
             # if required, script defines temp directory as argument TempDir and used in script like redshift_tmp_dir = args["TempDir"] 
             # script defines target for output as s3://aws-glue-target/sal    			 
      AllocatedCapacity: 5  
      ExecutionProperty:   
        MaxConcurrentRuns: 1  
      Name: !Ref CFNJobName

AWS Glue オンデマンドトリガー用のサンプル AWS CloudFormation テンプレート

Data Catalog の AWS Glue トリガーには、トリガーの発動によってジョブ実行を開始するために必要なパラメータ値が含まれています。オンデマンドトリガーは、このトリガーを有効にしたときに発生します。

このサンプルで作成するオンデマンドトリガーでは、cfn-job-S3-to-S3-1 という 1 つのジョブを開始します。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating an on-demand trigger
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-ondemand-flights-1	
#
# Resources section defines metadata for the Data Catalog
# Sample CFN YAML to demonstrate creating an on-demand trigger for a job	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) on an on-demand schedule.	
  CFNTriggerSample:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: ON_DEMAND                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      #Schedule: 
      #Predicate:

AWS Glue のスケジュールされたトリガー用のサンプル AWS CloudFormation テンプレート

Data Catalog の AWS Glue トリガーには、トリガーの発動によってジョブ実行を開始するために必要なパラメータ値が含まれています。スケジュールされたトリガーは、このトリガーを有効にして cron タイマーがポップすると、発生します。

このサンプルで作成するスケジュールされたトリガーでは、cfn-job-S3-to-S3-1 という 1 つのジョブを開始します。このタイマーは、平日の 10 分ごとにジョブを実行する cron 式です。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a scheduled trigger
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-scheduled-flights-1	
#
# Resources section defines metadata for the Data Catalog
# Sample CFN YAML to demonstrate creating a scheduled trigger for a job
#	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) on a cron schedule.	
  TriggerSample1CFN:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: SCHEDULED                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      # # Run the trigger every 10 minutes on Monday to Friday 		
      Schedule: cron(0/10 * ? * MON-FRI *) 
      #Predicate:

AWS Glue の条件付きトリガー用のサンプル AWS CloudFormation テンプレート

Data Catalog の AWS Glue トリガーには、トリガーの発動によってジョブ実行を開始するために必要なパラメータ値が含まれています。条件付きトリガーは、このトリガーを有効にして、その条件が満たされる (例: ジョブが正常に完了する) と、発生します。

このサンプルで作成する条件付きトリガーでは、cfn-job-S3-to-S3-1 という 1 つのジョブを開始します。このジョブは、cfn-job-S3-to-S3-2 というジョブが正常に完了すると、開始されます。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a conditional trigger for a job, which starts when another job completes
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The existing job that when it finishes causes trigger to fire
  CFNJobName2:
    Type: String
    Default: cfn-job-S3-to-S3-2	
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-conditional-1	
#	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) when another job completes (CFNJobName2).	
  CFNTriggerSample:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: CONDITIONAL                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      #Schedule: none 
      Predicate:
        #Value for Logical is required if more than 1 job listed in Conditions	  
        Logical: AND
        Conditions:
          - LogicalOperator: EQUALS	
            JobName: !Ref CFNJobName2
            State: SUCCEEDED

AWS Glue の開発エンドポイント用のサンプル AWS CloudFormation テンプレート

AWS Glue 機械学習変換は、データをクレンジングするためのカスタム変換です。現在、FindMatches という名前の変換が 1 つあります。FindMatches 変換を使用すると、レコードに共通の一意の識別子がなく、正確に一致するフィールドがない場合でも、データセット内の重複レコードまたは一致するレコードを識別できます。

このサンプルでは、機械学習変換を作成します。機械学習変換の作成に必要なパラメータの詳細については、「AWS Lake Formation FindMatches によるレコードのマッチング」を参照してください。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a machine learning transform
#
# Resources section defines metadata for the machine learning transform
Resources:
  MyMLTransform:
    Type: "AWS::Glue::MLTransform"
    Condition: "isGlueMLGARegion"
    Properties:
      Name: !Sub "MyTransform"
      Description: "The bestest transform ever"
      Role: !ImportValue MyMLTransformUserRole
      GlueVersion: "1.0"
      WorkerType: "Standard"
      NumberOfWorkers: 5
      Timeout: 120
      MaxRetries: 1
      InputRecordTables:
        GlueTables:
          - DatabaseName: !ImportValue MyMLTransformDatabase
            TableName: !ImportValue MyMLTransformTable
      TransformParameters:
        TransformType: "FIND_MATCHES"
        FindMatchesParameters:
          PrimaryKeyColumnName: "testcolumn"
          PrecisionRecallTradeoff: 0.5
          AccuracyCostTradeoff: 0.5
          EnforceProvidedLabels: True
      Tags:
        key1: "value1"
        key2: "value2"
      TransformEncryption:
        TaskRunSecurityConfigurationName: !ImportValue MyMLTransformSecurityConfiguration
        MLUserDataEncryption:
          MLUserDataEncryptionMode: "SSE-KMS"
          KmsKeyId: !ImportValue MyMLTransformEncryptionKey

AWS Glue Data Quality ルールセット用のサンプル AWS CloudFormation テンプレート

AWS Glue Data Quality ルールセットには、データカタログ内のテーブルで評価できるルールが含まれています。ルールセットをターゲットテーブルに配置したら、データカタログにアクセスして、ルールセット内のルールに対してデータを実行する評価を実行できます。これらのルールは、行数の評価からデータの参照整合性の評価までさまざまです。

次のサンプルは、指定されたターゲットテーブルにさまざまなルールを含むルールセットを作成する CloudFormation テンプレートです。


AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a DataQualityRuleset
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
  # The name of the ruleset to be created
  RulesetName:  
    Type: String
    Default: "CFNRulesetName"
  RulesetDescription:  
    Type: String
    Default: "CFN DataQualityRuleset"
  # Rules that will be associated with this ruleset
  Rules:  
    Type: String
    Default: 'Rules = [
        RowCount > 100,
        IsUnique "id",
        IsComplete "nametype"
        ]'
  # Name of database and table within Data Catalog which the ruleset will 
  # be applied too
  DatabaseName:  
    Type: String
    Default: "ExampleDatabaseName"
  TableName:  
    Type: String
    Default: "ExampleTableName"

# Resources section defines metadata for the Data Catalog
Resources:
  # Creates a Data Quality ruleset under specified rules 
  DQRuleset:
    Type: AWS::Glue::DataQualityRuleset
    Properties:
      Name: !Ref RulesetName
      Description: !Ref RulesetDescription
      # The String within rules must be formatted in DQDL, a language 
      # used specifically to make rules
      Ruleset: !Ref Rules
      # The targeted table must exist within Data Catalog alongside 
      # the correct database
      TargetTable:
        DatabaseName: !Ref DatabaseName
        TableName: !Ref TableName

EventBridge スケジューラを使用する AWS Glue Data Quality ルールセット用の AWS CloudFormation テンプレート

AWS Glue Data Quality ルールセットには、データカタログ内のテーブルで評価できるルールが含まれています。ルールセットをターゲットテーブルに配置したら、データカタログにアクセスして、ルールセット内のルールに対してデータを実行する評価を実行できます。手動でデータカタログにアクセスしてルールセットを評価する代わりに、CloudFormation テンプレート内に EventBridge スケジューラを追加して、これらのルールセットの評価を一定間隔でスケジュールすることもできます。

次のサンプルは、前述のルールセットを 5 分ごとに評価する、Data Quality ルールセットと EventBridge スケジューラを作成する CloudFormation テンプレートです。


AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a DataQualityRuleset
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
  # The name of the ruleset to be created
  RulesetName:  
    Type: String
    Default: "CFNRulesetName"
  # Rules that will be associated with this Ruleset
  Rules:  
    Type: String
    Default: 'Rules = [
        RowCount > 100,
        IsUnique "id",
        IsComplete "nametype"
        ]'
  # The name of the Schedule to be created  
  ScheduleName:  
    Type: String
    Default: "ScheduleDQRulsetEvaluation"
  # This expression determines the rate at which the Schedule will evaluate
  # your data using the above ruleset
  ScheduleRate:
    Type: String
    Default: "rate(5 minutes)"
  # The Request that being sent must match the details of the Data Quality Ruleset
  ScheduleRequest:
    Type: String
    Default: '
        { "DataSource": { "GlueTable": { "DatabaseName": "ExampleDatabaseName",
         "TableName": "ExampleTableName" } },
         "Role": "role/AWSGlueServiceRoleDefault",
          "RulesetNames": [ ""CFNRulesetName"" ] }
        '

# Resources section defines metadata for the Data Catalog
Resources:
  # Creates a Data Quality ruleset under specified rules 
  DQRuleset:
    Type: AWS::Glue::DataQualityRuleset
    Properties:
      Name: !Ref RulesetName
      Description: "CFN DataQualityRuleset"
      # The String within rules must be formatted in DQDL, a language 
      # used specifically to make rules
      Ruleset: !Ref Rules
      # The targeted table must exist within Data Catalog alongside 
      # the correct database
      TargetTable:
        DatabaseName: "ExampleDatabaseName"
        TableName: "ExampleTableName"
  # Create a Scheduler to schedule evaluation runs on the above ruleset
  ScheduleDQEval:
    Type: AWS::Scheduler::Schedule
    Properties: 
      Name: !Ref ScheduleName
      Description: "Schedule DataQualityRuleset Evaluations"
      FlexibleTimeWindow: 
        Mode: "OFF"
      ScheduleExpression: !Ref ScheduleRate
      ScheduleExpressionTimezone: "America/New_York"
      State: "ENABLED"
      Target: 
        # The ARN is the API that will be run, since we want to evaluate our ruleset
        # we want this specific ARN
        Arn: "arn:aws:scheduler:::aws-sdk:glue:startDataQualityRulesetEvaluationRun"
        # Your RoleArn must have approval to schedule
        RoleArn: "arn:aws:iam::123456789012:role/AWSGlueServiceRoleDefault"
        # This is the Request that is being sent to the Arn
        Input: '
        { "DataSource": { "GlueTable": { "DatabaseName": "sampledb", "TableName": "meteorite" } },
         "Role": "role/AWSGlueServiceRoleDefault",
          "RulesetNames": [ "TestCFN" ] }
        '

AWS Glue の開発エンドポイント用のサンプル AWS CloudFormation テンプレート

AWS Glue の開発エンドポイントは、AWS Glue スクリプトの開発およびテストに使用できる環境です。

このサンプルで作成する開発エンドポイントでは、正常な作成に最低限必要なネットワークパラメータ値を使用します。開発エンドポイントの設定に必要なパラメータの詳細については、「AWS Glue のための開発用ネットワークの設定」を参照してください。

開発エンドポイントを作成するには、既存の IAM ロール ARN (Amazon リソースネーム) を指定します。開発エンドポイントでノートブックサーバーを作成する場合は、有効な RSA パブリックキーを指定し、対応するプライベートキーを使用可能な状態に保持します。

注記

作成した開発エンドポイントに関連付けられているすべてのノートブックサーバーを管理します。したがって、開発エンドポイントを削除した場合、ノートブックサーバーを削除するには AWS CloudFormation コンソールで AWS CloudFormation スタックを削除する必要があります。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a development endpoint
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNEndpointName:  
    Type: String
    Default: cfn-devendpoint-1
  CFNIAMRoleArn:
    Type: String
    Default: arn:aws:iam::123456789012/role/AWSGlueServiceRoleGA	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
  CFNDevEndpoint:
    Type: AWS::Glue::DevEndpoint
    Properties:
      EndpointName: !Ref CFNEndpointName
      #ExtraJarsS3Path: String
      #ExtraPythonLibsS3Path: String
      NumberOfNodes: 5
      PublicKey: ssh-rsa public.....key myuserid-key
      RoleArn: !Ref CFNIAMRoleArn
      SecurityGroupIds: 
        - sg-64986c0b
      SubnetId: subnet-c67cccac

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

設計図の実行の表示

AWS Glue プログラミングガイド