AWS CloudFormation 適用於 AWS Glue

焦點模式

AWS CloudFormation 適用於 AWS Glue - AWS Glue

範本資料庫範例資料庫、資料表、分割區範例 grok 分類器範例 JSON 分類器範例 XML 分類器範例 Amazon S3 爬蟲程式連線範例範例 JDBC 爬蟲程式 Amazon S3 至 Amazon S3 的範例任務 JDBC 至 Amazon S3 的範例任務範例隨需觸發條件範例排程觸發條件範例條件式觸發條件機器學習轉換範例資料品質規則集範例使用 EventBridge 排程器的資料品質規則集範例範例開發端點

AWS CloudFormation 是一項可建立許多 AWS 資源的服務。 AWS Glue提供 API 操作以在中建立物件 AWS Glue Data Catalog。不過，在 AWS CloudFormation 範本檔案中定義和建立AWS Glue物件和其他相關 AWS 資源物件可能更為方便。接下來即可將建立物件的程序自動化。

AWS CloudFormation 提供簡化的語法 - JSON (JavaScript 物件標記）或 YAML (YAML 非標記語言）來表達 AWS 資源的建立。可使用 AWS CloudFormation 範本來定義資料目錄物件，例如資料庫、資料表、分割區、爬蟲程式、分類器及連線。也可定義 ETL 物件，如任務、觸發條件、開發端點。您可以建立範本來描述您想要的所有 AWS 資源，並 AWS CloudFormation 負責為您佈建和設定這些資源。

如需詳細資訊，請參閱AWS CloudFormation 《使用者指南》中的什麼是 AWS CloudFormation？和使用 AWS CloudFormation 範本。

如果您計劃使用與相容的 AWS CloudFormation 範本AWS Glue，身為管理員，您必須授予其所依賴 AWS CloudFormation 之 AWS 服務和動作的存取權。若要授予建立 AWS CloudFormation 資源的許可，請將下列政策連接至使用的使用者 AWS CloudFormation：


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [                
        "cloudformation:*"        
      ],
      "Resource": "*"
    }
  ]
}

下表包含 AWS CloudFormation 範本可代表您執行的動作。它包含 AWS 資源類型及其屬性類型的相關資訊連結，您可以將其新增至 AWS CloudFormation 範本。

AWS Glue 資源	AWS CloudFormation 範本	AWS Glue 範例
分類器	AWS::Glue::Classifier	Grok 分類器、JSON 分類器、XML 分類器
連線	AWS::Glue::Connection	MySQL 連線
爬蟲程式	AWS::Glue::Crawler	Amazon S3 爬蟲程式、MySQL 爬蟲程式
資料庫	AWS::Glue::Database	空資料庫、含資料表的資料庫
開發端點	AWS::Glue::DevEndpoint	開發端點
任務	AWS::Glue::Job	Amazon S3 任務、JDBC 任務
機器學習轉換	AWS::Glue::MLTransform	機器學習轉換
資料品質規則集	AWS::Glue::DataQualityRuleset	資料品質規則集，使用 EventBridge 排程器的資料品質規則集
分區	AWS::Glue::Partition	表格分割區
資料表	AWS::Glue::Table	資料庫表格
觸發條件	AWS::Glue::Trigger	隨需觸發、排程觸發、條件式觸發

若要開始，請使用下方的範本，使用您自己的中繼資料加以自訂。然後使用 AWS CloudFormation 主控台建立 AWS CloudFormation 堆疊，將物件新增至 AWS Glue和任何相關聯的服務。AWS Glue 物件有許多欄位為選填。這些範本會說明 AWS Glue 物件若要正常有效運作需要填寫哪些欄位，或哪些欄位為必填。

AWS CloudFormation 範本可以是 JSON 或 YAML 格式。這些範例會使用 YAML 以方便閱讀。範例內有評論 (#) 會說明範本中定義的值。

AWS CloudFormation 範本可以包含 Parameters區段。您可以在範例文字中或 YAML 檔案提交至 AWS CloudFormation 主控台以建立堆疊時變更本節。範本的 Resources區段包含 AWS Glue和相關物件的定義。 AWS CloudFormation 範本語法定義可能包含包含更詳細屬性語法的屬性。建立 AWS Glue 物件時，並不需要使用到所有屬性。以下範本為用於建立 AWS Glue 物件的常見屬性的範例值。

AWS Glue 資料庫的範例 AWS CloudFormation 範本

資料目錄內的 AWS Glue 資料庫含有中繼資料資料表。資料庫由極少的屬性組成，並且可以使用 AWS CloudFormation 範本在 Data Catalog 中建立。下列範例範本旨在協助您開始使用，並說明搭配使用 AWS CloudFormation 堆疊的方式AWS Glue。此範本範例唯一建立的資源是名為 cfn-mysampledatabase 的資料庫。您可以在提交 YAML 時編輯範例的文字，或在 AWS CloudFormation 主控台上變更值，以變更範例。

以下顯示的是用於建立 AWS Glue 資料庫的常見屬性的範例值。如需的 AWS CloudFormation 資料庫範本詳細資訊AWS Glue，請參閱 AWS::Glue::Database。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database named mysampledatabase
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  CFNDatabaseName:
    Type: String
    Default: cfn-mysampledatabse

# Resources section defines metadata for the Data Catalog
Resources:
# Create an AWS Glue database
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      # The database is created in the Data Catalog for your account
      CatalogId: !Ref AWS::AccountId   
      DatabaseInput:
        # The name of the database is defined in the Parameters section above
        Name: !Ref CFNDatabaseName	
        Description: Database to hold tables for flights data
        LocationUri: s3://crawler-public-us-east-1/flight/2016/csv/
        #Parameters: Leave AWS database parameters blank

AWS Glue 資料庫、資料表和分割區的範例 AWS CloudFormation 範本

AWS Glue 資料表內含的中繼資料定義了希望以 ETL 指定碼處理的資料之結構和位置。在此資料表中，可定義要用以將資料處理平行化的分區。分區是您以金鑰值定義的資料區塊。舉例而言，使用月份做為金鑰值，則所有一月份的資料都會包含在同一個分區內。在 AWS Glue 中，資料庫可含有資料表，而資料表可包含分區。

以下範例顯示了如何使用 AWS CloudFormation 範本產生資料庫、資料表和分區。基本資料格式為 csv，並以逗號 (,) 分隔。由於資料庫必須在含有資料表前先已存在，而資料表必須先存在才可建立分區，因此範本使用 DependsOn 陳述式在物件建立時定義其相依性。

此範例中的值定義了某個資料表，表內含有從某個公開的 Amazon S3 儲存貯體取得的航班資料。為了說明之用，僅定義了少許資料欄位和一個分區金鑰。資料目錄中也定義了四個分區。有些用於描述基本資料的儲存的欄位也會顯示於 StorageDescriptor 的欄位中。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database, a table, and partitions
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters substituted in the Resources section
# These parameters are names of the resources created in the Data Catalog
Parameters:
  CFNDatabaseName:
    Type: String
    Default: cfn-database-flights-1
  CFNTableName1:
    Type: String
    Default: cfn-manual-table-flights-1
# Resources to create metadata in the Data Catalog
Resources:
###
# Create an AWS Glue database
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName	
        Description: Database to hold tables for flights data
###
# Create an AWS Glue table
  CFNTableFlights:
    # Creating the table waits for the database to be created
    DependsOn: CFNDatabaseFlights
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableInput:
        Name: !Ref CFNTableName1
        Description: Define the first few columns of the flights table
        TableType: EXTERNAL_TABLE
        Parameters: {
    "classification": "csv"
  }
#       ViewExpandedText: String
        PartitionKeys:
        # Data is partitioned by month
        - Name: mon
          Type: bigint
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: year
            Type: bigint
          - Name: quarter
            Type: bigint
          - Name: month
            Type: bigint
          - Name: day_of_month
            Type: bigint			
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 1
# Create an AWS Glue partition  
  CFNPartitionMon1:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 1
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=1/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 2
# Create an AWS Glue partition 
  CFNPartitionMon2:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 2
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=2/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 3
# Create an AWS Glue partition 
  CFNPartitionMon3:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 3
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=3/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
# Partition 4
# Create an AWS Glue partition 
  CFNPartitionMon4:
    DependsOn: CFNTableFlights
    Type: AWS::Glue::Partition
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableName: !Ref CFNTableName1
      PartitionInput:
        Values:
        - 4
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: mon
            Type: bigint
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=4/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

grok AWS Glue 分類器的範例 AWS CloudFormation 範本

AWS Glue 分類器可判斷資料的結構描述。One 類型的自訂分類器會使用 grok 模式配對您的資料。若模式比對符合，則會使用自訂分類器來建立資料表的結構資料，並將 classification 設為分類器定義中所設的值。

這個範例所建立的分類器，會建立含有一個名為 message 的欄位的資料結構，並將分類設為 greedy。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-grok-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses grok pattern to put all data in one column and classifies it as "greedy".	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      GrokClassifier:
        #Grok classifier that puts all data in one column		
        Name: !Ref CFNClassifierName
        Classification: greedy                                                        	   
        GrokPattern: "%{GREEDYDATA:message}"
        #CustomPatterns: none

AWS Glue JSON 分類器的範例 AWS CloudFormation 範本

AWS Glue 分類器可判斷資料的結構描述。一種自訂分類器使用 JsonPath 字串，定義 JSON 以供分類器分類。AWS Glue 支援JsonPath 的運算子子集，如撰寫 JsonPath 自訂分類器中所述。

如果模式符合，則自訂分類器可用於建立資料表的結構描述。

這個範本所建立的分類器會建立結構描述，每個記錄皆位於物件中的 Records3 陣列。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a JSON classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-json-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses a JSON pattern.	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      JSONClassifier:
        #JSON classifier		
        Name: !Ref CFNClassifierName
        JsonPath: $.Records3[*]

AWS Glue XML 分類器的範例 AWS CloudFormation 範本

AWS Glue 分類器可判斷資料的結構描述。一種自訂分類器指定 XML 標籤，以在經剖析的 XML 文件內指定包含各記錄的元素。若模式比對符合，則會使用自訂分類器來建立資料表的結構資料，並將 classification 設為分類器定義中所設的值。

這個範例所建立的分類器，會建立一個每個記錄皆位於 Record 標籤的結構描述，並將分類設為 XML。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating an XML classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the classifier to be created
  CFNClassifierName:  
    Type: String
    Default: cfn-classifier-xml-one-column-1                                                               	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses the XML pattern and classifies it as "XML".	
  CFNClassifierFlights:
    Type: AWS::Glue::Classifier   
    Properties:
      XMLClassifier:
        #XML classifier		
        Name: !Ref CFNClassifierName
        Classification: XML   
        RowTag: <Records>

Amazon S3 AWS Glue爬蟲程式的範例 AWS CloudFormation 範本

AWS Glue 爬蟲程式會在資料目錄中建立與資料對應的中繼資料資料表。接下來可使用這些資料表定義做為 ETL 任務的來源和目標。

此範例會在資料目錄中建立一個爬蟲程式、所需的 IAM 角色、AWS Glue 資料庫。在執行此爬蟲程式時，其會擔任 IAM 角色，並為公開的航班資料的資料庫建立一份資料表。資料表建立時會附帶字首「cfn_sample_1_」。此範本所建立的 IAM 角色允許全域許可，您可能會想建立一個自訂角色。此分類器並未定義任何自訂的分類器。預設使用 AWS Glue 內建的分類器。

當您將此範例提交至 AWS CloudFormation 主控台時，您必須確認是否要建立 IAM 角色。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNCrawlerName:  
    Type: String
    Default: cfn-crawler-flights-1
  CFNDatabaseName:
    Type: String
    Default: cfn-database-flights-1
  CFNTablePrefixName:
    Type: String
    Default: cfn_sample_1_	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
#Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions.
  CFNRoleFlights:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        -
          PolicyName: "root"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: "Allow"
                Action: "*"
                Resource: "*"
 # Create a database to contain tables created by the crawler
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName
        Description: "AWS Glue container to hold metadata tables for the flights crawler"
 #Create a crawler to crawl the flights data on a public S3 bucket
  CFNCrawlerFlights:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Ref CFNCrawlerName
      Role: !GetAtt CFNRoleFlights.Arn
      #Classifiers: none, use the default classifier
      Description: AWS Glue crawler to crawl flights data
      #Schedule: none, use default run-on-demand
      DatabaseName: !Ref CFNDatabaseName
      Targets:
        S3Targets:
          # Public S3 bucket with the flights data
          - Path: "s3://crawler-public-us-east-1/flight/2016/csv"
      TablePrefix: !Ref CFNTablePrefixName
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
      Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

AWS Glue 連線的範例 AWS CloudFormation 範本

資料目錄內的 AWS Glue 連線含有連線到 JDBC 資料庫所需的 JDBC 和網路資訊。在連線到 JDBC 資料庫以探索或執行 ETL 任務時，均會用到此資訊。

此範例會建立一個連至 Amazon RDS MySQL 資料庫的連線，名為 devdb。使用此連線時，也須提供 IAM 角色、資料庫登入資料、網路連線的值。請參閱範本內的必要欄位詳細資訊。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a connection
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the connection to be created
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
  CFNJDBCString:  
    Type: String
    Default: "jdbc:mysql://xxx-mysql.yyyyyyyyyyyyyy.us-east-1.rds.amazonaws.com:3306/devdb"
  CFNJDBCUser:  
    Type: String
    Default: "master"
  CFNJDBCPassword:  
    Type: String
    Default: "12345678"
    NoEcho: true
#
#
# Resources section defines metadata for the Data Catalog
Resources:
  CFNConnectionMySQL:
    Type: AWS::Glue::Connection
    Properties:
      CatalogId: !Ref AWS::AccountId
      ConnectionInput: 
        Description: "Connect to MySQL database."
        ConnectionType: "JDBC"
        #MatchCriteria: none		
        PhysicalConnectionRequirements:
          AvailabilityZone: "us-east-1d"
          SecurityGroupIdList: 
           - "sg-7d52b812"
          SubnetId: "subnet-84f326ee" 
        ConnectionProperties: {
          "JDBC_CONNECTION_URL": !Ref CFNJDBCString,
          "USERNAME": !Ref CFNJDBCUser,
          "PASSWORD": !Ref CFNJDBCPassword
        }
        Name: !Ref CFNConnectionName

JDBC AWS Glue 爬蟲程式的範例 AWS CloudFormation 範本

AWS Glue 爬蟲程式會在資料目錄中建立與資料對應的中繼資料資料表。接下來可使用這些資料表定義做為 ETL 任務的來源和目標。

此範例會在資料目錄中建立一個爬蟲程式、所需的 IAM 角色、AWS Glue 資料庫。在執行此爬蟲程式時，其會擔任 IAM 角色，並為儲存在某個 MySQL 資料庫內的航班資料所建的資料庫建立一份資料表。資料表建立時會附帶字首「cfn_jdbc_1_」。此範本所建立的 IAM 角色允許全域許可，您可能會想建立一個自訂角色。無法為 JDBC 資料定義自訂分類器。預設使用 AWS Glue 內建的分類器。

當您將此範例提交至 AWS CloudFormation 主控台時，您必須確認是否要建立 IAM 角色。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNCrawlerName:  
    Type: String
    Default: cfn-crawler-jdbc-flights-1
# The name of the database to be created to contain tables	
  CFNDatabaseName:
    Type: String
    Default: cfn-database-jdbc-flights-1
# The prefix for all tables crawled and created	
  CFNTablePrefixName:
    Type: String
    Default: cfn_jdbc_1_
# The name of the existing connection to the MySQL database
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
# The name of the JDBC path (database/schema/table) with wildcard (%) to crawl	
  CFNJDBCPath:  
    Type: String
    Default: saldev/%		
#
#
# Resources section defines metadata for the Data Catalog
Resources:
#Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions.
  CFNRoleFlights:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        -
          PolicyName: "root"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: "Allow"
                Action: "*"
                Resource: "*"
 # Create a database to contain tables created by the crawler
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName
        Description: "AWS Glue container to hold metadata tables for the flights crawler"
 #Create a crawler to crawl the flights data in MySQL database
  CFNCrawlerFlights:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Ref CFNCrawlerName
      Role: !GetAtt CFNRoleFlights.Arn
      #Classifiers: none, use the default classifier
      Description: AWS Glue crawler to crawl flights data
      #Schedule: none, use default run-on-demand
      DatabaseName: !Ref CFNDatabaseName
      Targets:
        JdbcTargets:
          # JDBC MySQL database with the flights data
          - ConnectionName: !Ref CFNConnectionName
            Path: !Ref CFNJDBCPath
          #Exclusions: none
      TablePrefix: !Ref CFNTablePrefixName
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
	  Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

Amazon S3 至 Amazon S3 AWS Glue任務的範例 AWS CloudFormation 範本

在資料目錄中的 AWS Glue 任務含有在 AWS Glue 中執行指令碼所需的參數值。

此範例會建立一項任務，用以讀取來自 Amazon S3 儲存貯體的航班資料 (格式為 csv)，並將之寫入 Amazon S3 Parquet 檔案。此任務所執行的此指令碼必須已先存在。可以使用 AWS Glue 主控台為您的環境產生 ETL 指令碼。執行工作時，也必須提供具有正確許可的 IAM 角色。

常見的參數值會出現在範本中。舉例而言，AllocatedCapacity (DPU) 預設值為 5。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using the public flights S3 table in a public bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the job to be created
  CFNJobName:  
    Type: String
    Default: cfn-job-S3-to-S3-2
# The name of the IAM role that the job assumes. It must have access to data, script, temporary directory
  CFNIAMRoleName:  
    Type: String
    Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
  CFNScriptLocation:  
    Type: String
    Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-test2	
#
#
# Resources section defines metadata for the Data Catalog
Resources:                                      
# Create job to run script which accesses flightscsv table and write to S3 file as parquet.
# The script already exists and is called by this job	
  CFNJobFlights:
    Type: AWS::Glue::Job   
    Properties:
      Role: !Ref CFNIAMRoleName  
      #DefaultArguments: JSON object 
      # If script written in Scala, then set DefaultArguments={'--job-language'; 'scala', '--class': 'your scala class'}
      #Connections:  No connection needed for S3 to S3 job 
      #  ConnectionsList  
      #MaxRetries: Double  
      Description: Job created with CloudFormation  
      #LogUri: String  
      Command:   
        Name: glueetl  
        ScriptLocation: !Ref CFNScriptLocation
             # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-"					 
             # script uses temp directory from job definition if required (temp directory not used S3 to S3)
             # script defines target for output as s3://aws-glue-target/sal    			 
      AllocatedCapacity: 5  
      ExecutionProperty:   
        MaxConcurrentRuns: 1  
      Name: !Ref CFNJobName

JDBC 至 Amazon S3 AWS Glue任務的範例 AWS CloudFormation 範本

在資料目錄中的 AWS Glue 任務含有在 AWS Glue 中執行指令碼所需的參數值。

此範例會建立一項任務，如名為 cfn-connection-mysql-flights-1 的連線所定義，從 MySQL JDBC 資料庫讀取航班資料，並將資料寫入 Amazon S3 Parquet 檔案。此任務所執行的此指令碼必須已先存在。可以使用 AWS Glue 主控台為您的環境產生 ETL 指令碼。執行工作時，也必須提供具有正確許可的 IAM 角色。

常見的參數值會出現在範本中。舉例而言，AllocatedCapacity (DPU) 預設值為 5。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using a MySQL JDBC DB with the flights data to an S3 file
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the job to be created
  CFNJobName:  
    Type: String
    Default: cfn-job-JDBC-to-S3-1
# The name of the IAM role that the job assumes. It must have access to data, script, temporary directory
  CFNIAMRoleName:  
    Type: String
    Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
  CFNScriptLocation:  
    Type: String
    Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-dec4a	
# The name of the connection used for JDBC data source
  CFNConnectionName:  
    Type: String
    Default: cfn-connection-mysql-flights-1
#
#
# Resources section defines metadata for the Data Catalog
Resources:                                      
# Create job to run script which accesses JDBC flights table via a connection and write to S3 file as parquet.
# The script already exists and is called by this job	
  CFNJobFlights:
    Type: AWS::Glue::Job   
    Properties:
      Role: !Ref CFNIAMRoleName  
      #DefaultArguments: JSON object  
      # For example, if required by script, set temporary directory as DefaultArguments={'--TempDir'; 's3://aws-glue-temporary-xyc/sal'}
      Connections:
        Connections:
        - !Ref CFNConnectionName 
      #MaxRetries: Double  
      Description: Job created with CloudFormation using existing script
      #LogUri: String  
      Command:   
        Name: glueetl  
        ScriptLocation: !Ref CFNScriptLocation
             # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-"					 
             # if required, script defines temp directory as argument TempDir and used in script like redshift_tmp_dir = args["TempDir"] 
             # script defines target for output as s3://aws-glue-target/sal    			 
      AllocatedCapacity: 5  
      ExecutionProperty:   
        MaxConcurrentRuns: 1  
      Name: !Ref CFNJobName

AWS Glue 隨需觸發的範例 AWS CloudFormation 範本

資料目錄中的 AWS Glue 觸發條件含有必要的參數值，在觸發條件觸動而開始執行任務時會需要。啟用後，隨需觸發條件即會觸動。

此範例會建立一項隨需觸發條件，會開始進行名為 cfn-job-S3-to-S3-1 的任務。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating an on-demand trigger
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-ondemand-flights-1	
#
# Resources section defines metadata for the Data Catalog
# Sample CFN YAML to demonstrate creating an on-demand trigger for a job	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) on an on-demand schedule.	
  CFNTriggerSample:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: ON_DEMAND                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      #Schedule: 
      #Predicate:

AWS Glue 排程觸發的範例 AWS CloudFormation 範本

資料目錄中的 AWS Glue 觸發條件含有必要的參數值，在觸發條件觸動而開始執行任務時會需要。排程觸發條件在啟用時即會觸發，並會跳出 cron 計時器。

此範例會建立一項排程觸發條件，會開始進行名為 cfn-job-S3-to-S3-1 的任務。計時器為 cron 表達式，在任務天每 10 分鐘就會執行一次任務。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a scheduled trigger
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-scheduled-flights-1	
#
# Resources section defines metadata for the Data Catalog
# Sample CFN YAML to demonstrate creating a scheduled trigger for a job
#	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) on a cron schedule.	
  TriggerSample1CFN:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: SCHEDULED                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      # # Run the trigger every 10 minutes on Monday to Friday 		
      Schedule: cron(0/10 * ? * MON-FRI *) 
      #Predicate:

AWS Glue 條件式觸發的範例 AWS CloudFormation 範本

資料目錄中的 AWS Glue 觸發條件含有必要的參數值，在觸發條件觸動而開始執行任務時會需要。條件式觸發條件會在啟用時觸發，例如任務成功完成。

此範例會建立一項條件式觸發條件，會開始進行名為 cfn-job-S3-to-S3-1 的任務。此任務會在名為 cfn-job-S3-to-S3-2 的任務成功完成後發動。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a conditional trigger for a job, which starts when another job completes
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
  # The existing job to be started by this trigger 
  CFNJobName:
    Type: String
    Default: cfn-job-S3-to-S3-1
  # The existing job that when it finishes causes trigger to fire
  CFNJobName2:
    Type: String
    Default: cfn-job-S3-to-S3-2	
  # The name of the trigger to be created
  CFNTriggerName:
    Type: String
    Default: cfn-trigger-conditional-1	
#	
Resources:                                      
# Create trigger to run an existing job (CFNJobName) when another job completes (CFNJobName2).	
  CFNTriggerSample:
    Type: AWS::Glue::Trigger   
    Properties:
      Name:
        Ref: CFNTriggerName		
      Description: Trigger created with CloudFormation
      Type: CONDITIONAL                                                        	   
      Actions:
        - JobName: !Ref CFNJobName                	  
        # Arguments: JSON object
      #Schedule: none 
      Predicate:
        #Value for Logical is required if more than 1 job listed in Conditions	  
        Logical: AND
        Conditions:
          - LogicalOperator: EQUALS	
            JobName: !Ref CFNJobName2
            State: SUCCEEDED

AWS Glue 開發端點的範例 AWS CloudFormation 範本

AWS Glue 機器學習轉換是一種自訂轉換，可清理您的資料。目前有一個名為 FindMatches 的可用轉換。FindMatches 轉換可讓您識別資料集中重複或相符的記錄，即使記錄沒有通用的唯一識別符，也沒有欄位完全相符。

此範例會建立機器學習轉換。如需有關建立機器學習轉換所需參數的詳細資訊，請參閱記錄與 AWS Lake Formation FindMatches 相符。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a machine learning transform
#
# Resources section defines metadata for the machine learning transform
Resources:
  MyMLTransform:
    Type: "AWS::Glue::MLTransform"
    Condition: "isGlueMLGARegion"
    Properties:
      Name: !Sub "MyTransform"
      Description: "The bestest transform ever"
      Role: !ImportValue MyMLTransformUserRole
      GlueVersion: "1.0"
      WorkerType: "Standard"
      NumberOfWorkers: 5
      Timeout: 120
      MaxRetries: 1
      InputRecordTables:
        GlueTables:
          - DatabaseName: !ImportValue MyMLTransformDatabase
            TableName: !ImportValue MyMLTransformTable
      TransformParameters:
        TransformType: "FIND_MATCHES"
        FindMatchesParameters:
          PrimaryKeyColumnName: "testcolumn"
          PrecisionRecallTradeoff: 0.5
          AccuracyCostTradeoff: 0.5
          EnforceProvidedLabels: True
      Tags:
        key1: "value1"
        key2: "value2"
      TransformEncryption:
        TaskRunSecurityConfigurationName: !ImportValue MyMLTransformSecurityConfiguration
        MLUserDataEncryption:
          MLUserDataEncryptionMode: "SSE-KMS"
          KmsKeyId: !ImportValue MyMLTransformEncryptionKey

AWS Glue Data Quality 規則集的範例 AWS CloudFormation 範本

AWS Glue Data Quality 規則集包含可在 Data Catalog 內的資料表上評估的規則。將規則集放置在目標資料表上後，您便可以進入資料型錄並執行評估，該評估會根據規則集中的這些規則執行資料。從評估資料列計數到評估資料的參照完整性，這些規則可能有所不同。

下列範例是 CloudFormation 範本，可在指定的目標資料表上建立包含各種規則的規則集。


AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a DataQualityRuleset
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
  # The name of the ruleset to be created
  RulesetName:  
    Type: String
    Default: "CFNRulesetName"
  RulesetDescription:  
    Type: String
    Default: "CFN DataQualityRuleset"
  # Rules that will be associated with this ruleset
  Rules:  
    Type: String
    Default: 'Rules = [
        RowCount > 100,
        IsUnique "id",
        IsComplete "nametype"
        ]'
  # Name of database and table within Data Catalog which the ruleset will 
  # be applied too
  DatabaseName:  
    Type: String
    Default: "ExampleDatabaseName"
  TableName:  
    Type: String
    Default: "ExampleTableName"

# Resources section defines metadata for the Data Catalog
Resources:
  # Creates a Data Quality ruleset under specified rules 
  DQRuleset:
    Type: AWS::Glue::DataQualityRuleset
    Properties:
      Name: !Ref RulesetName
      Description: !Ref RulesetDescription
      # The String within rules must be formatted in DQDL, a language 
      # used specifically to make rules
      Ruleset: !Ref Rules
      # The targeted table must exist within Data Catalog alongside 
      # the correct database
      TargetTable:
        DatabaseName: !Ref DatabaseName
        TableName: !Ref TableName

使用 EventBridge 排程器的AWS Glue Data Quality規則集範例 AWS CloudFormation 範本

AWS Glue Data Quality 規則集包含可在 Data Catalog 內的資料表上評估的規則。將規則集放置在目標資料表上後，您便可以進入資料型錄並執行評估，該評估會根據規則集中的這些規則執行資料。您也可以在 CloudFormation 範本中新增 EventBridge 排程器，以便在定時間隔為您排程這些規則集評估，而不必手動進入資料型錄來評估規則集。

下列範例是 CloudFormation 範本，可建立資料品質規則集和 EventBridge 排程器，每五分鐘即評估上述規則集一次。


AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a DataQualityRuleset
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
  # The name of the ruleset to be created
  RulesetName:  
    Type: String
    Default: "CFNRulesetName"
  # Rules that will be associated with this Ruleset
  Rules:  
    Type: String
    Default: 'Rules = [
        RowCount > 100,
        IsUnique "id",
        IsComplete "nametype"
        ]'
  # The name of the Schedule to be created  
  ScheduleName:  
    Type: String
    Default: "ScheduleDQRulsetEvaluation"
  # This expression determines the rate at which the Schedule will evaluate
  # your data using the above ruleset
  ScheduleRate:
    Type: String
    Default: "rate(5 minutes)"
  # The Request that being sent must match the details of the Data Quality Ruleset
  ScheduleRequest:
    Type: String
    Default: '
        { "DataSource": { "GlueTable": { "DatabaseName": "ExampleDatabaseName",
         "TableName": "ExampleTableName" } },
         "Role": "role/AWSGlueServiceRoleDefault",
          "RulesetNames": [ ""CFNRulesetName"" ] }
        '

# Resources section defines metadata for the Data Catalog
Resources:
  # Creates a Data Quality ruleset under specified rules 
  DQRuleset:
    Type: AWS::Glue::DataQualityRuleset
    Properties:
      Name: !Ref RulesetName
      Description: "CFN DataQualityRuleset"
      # The String within rules must be formatted in DQDL, a language 
      # used specifically to make rules
      Ruleset: !Ref Rules
      # The targeted table must exist within Data Catalog alongside 
      # the correct database
      TargetTable:
        DatabaseName: "ExampleDatabaseName"
        TableName: "ExampleTableName"
  # Create a Scheduler to schedule evaluation runs on the above ruleset
  ScheduleDQEval:
    Type: AWS::Scheduler::Schedule
    Properties: 
      Name: !Ref ScheduleName
      Description: "Schedule DataQualityRuleset Evaluations"
      FlexibleTimeWindow: 
        Mode: "OFF"
      ScheduleExpression: !Ref ScheduleRate
      ScheduleExpressionTimezone: "America/New_York"
      State: "ENABLED"
      Target: 
        # The ARN is the API that will be run, since we want to evaluate our ruleset
        # we want this specific ARN
        Arn: "arn:aws:scheduler:::aws-sdk:glue:startDataQualityRulesetEvaluationRun"
        # Your RoleArn must have approval to schedule
        RoleArn: "arn:aws:iam::123456789012:role/AWSGlueServiceRoleDefault"
        # This is the Request that is being sent to the Arn
        Input: '
        { "DataSource": { "GlueTable": { "DatabaseName": "sampledb", "TableName": "meteorite" } },
         "Role": "role/AWSGlueServiceRoleDefault",
          "RulesetNames": [ "TestCFN" ] }
        '

AWS Glue 開發端點的範例 AWS CloudFormation 範本

AWS Glue 開發端點是一種環境，可讓您用於開發並測試 AWS Glue 指令碼。

次範例會建立一個開發端點，僅使用可成功建立端點的最低限量參數值。如需開發端點設定所需的參數的詳細資訊，請參閱專為 AWS Glue 的開發設定聯網。

您要提供現有的 IAM 角色 ARN (Amazon Resource Name) 以建立開發端點。若打算在開發端點上建立筆記型電腦伺服器，請提供有效的 RSA 公有金鑰，並將對應的私有金鑰保持在可用狀態。

注意

只要是您建立並與開發端點關聯的筆記本伺服器，您就可以管理。因此，如果您刪除開發端點，若要刪除筆記本伺服器，則必須刪除 AWS CloudFormation 主控台上的 AWS CloudFormation 堆疊。



---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a development endpoint
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:                                                                                                       
# The name of the crawler to be created
  CFNEndpointName:  
    Type: String
    Default: cfn-devendpoint-1
  CFNIAMRoleArn:
    Type: String
    Default: arn:aws:iam::123456789012/role/AWSGlueServiceRoleGA	
#
#
# Resources section defines metadata for the Data Catalog
Resources:
  CFNDevEndpoint:
    Type: AWS::Glue::DevEndpoint
    Properties:
      EndpointName: !Ref CFNEndpointName
      #ExtraJarsS3Path: String
      #ExtraPythonLibsS3Path: String
      NumberOfNodes: 5
      PublicKey: ssh-rsa public.....key myuserid-key
      RoleArn: !Ref CFNIAMRoleArn
      SecurityGroupIds: 
        - sg-64986c0b
      SubnetId: subnet-c67cccac