Visual job API - AWS Glue

Visual job API

The Visual Job API allows you to create data integration jobs by using the AWS Glue API from a JSON object that represents a visual configuration of a AWS Glue job.

A list of CodeGenConfigurationNodes are provided to a create or update job API to register a DAG in AWS Glue Studio for the created job and generate the associated code.

Data types

CodeGenConfigurationNode structure

CodeGenConfigurationNode enumerates all valid Node types. One and only one of its member variables can be populated.

Fields

  • AthenaConnectorSource – An AthenaConnectorSource object.

    Specifies a connector to an Amazon Athena data source.

  • JDBCConnectorSource – A JDBCConnectorSource object.

    Specifies a connector to a JDBC data source.

  • SparkConnectorSource – A SparkConnectorSource object.

    Specifies a connector to an Apache Spark data source.

  • CatalogSource – A CatalogSource object.

    Specifies a data store in the AWS Glue Data Catalog.

  • RedshiftSource – A RedshiftSource object.

    Specifies an Amazon Redshift data store.

  • S3CatalogSource – A S3CatalogSource object.

    Specifies an Amazon S3 data store in the AWS Glue Data Catalog.

  • S3CsvSource – A S3CsvSource object.

    Specifies a command-separated value (CSV) data store stored in Amazon S3.

  • S3JsonSource – A S3JsonSource object.

    Specifies a JSON data store stored in Amazon S3.

  • S3ParquetSource – A S3ParquetSource object.

    Specifies an Apache Parquet data store stored in Amazon S3.

  • RelationalCatalogSource – A RelationalCatalogSource object.

    Specifies a relational data source in the AWS Glue Data Catalog.

  • DynamoDBCatalogSource – A DynamoDBCatalogSource object.

    Specifies a DynamoDB data source in the AWS Glue Data Catalog.

  • JDBCConnectorTarget – A JDBCConnectorTarget object.

    Specifies a data target that writes to Amazon S3 in Apache Parquet columnar storage.

  • SparkConnectorTarget – A SparkConnectorTarget object.

    Specifies a target that uses an Apache Spark connector.

  • CatalogTarget – A BasicCatalogTarget object.

    Specifies a target that uses a AWS Glue Data Catalog table.

  • RedshiftTarget – A RedshiftTarget object.

    Specifies a target that uses Amazon Redshift.

  • S3CatalogTarget – A S3CatalogTarget object.

    Specifies a data target that writes to Amazon S3 using the AWS Glue Data Catalog.

  • S3GlueParquetTarget – A S3GlueParquetTarget object.

    Specifies a data target that writes to Amazon S3 in Apache Parquet columnar storage.

  • S3DirectTarget – A S3DirectTarget object.

    Specifies a data target that writes to Amazon S3.

  • ApplyMapping – An ApplyMapping object.

    Specifies a transform that maps data property keys in the data source to data property keys in the data target. You can rename keys, modify the data types for keys, and choose which keys to drop from the dataset.

  • SelectFields – A SelectFields object.

    Specifies a transform that chooses the data property keys that you want to keep.

  • DropFields – A DropFields object.

    Specifies a transform that chooses the data property keys that you want to drop.

  • RenameField – A RenameField object.

    Specifies a transform that renames a single data property key.

  • Spigot – A Spigot object.

    Specifies a transform that writes samples of the data to an Amazon S3 bucket.

  • Join – A Join object.

    Specifies a transform that joins two datasets into one dataset using a comparison phrase on the specified data property keys. You can use inner, outer, left, right, left semi, and left anti joins.

  • SplitFields – A SplitFields object.

    Specifies a transform that splits data property keys into two DynamicFrames. The output is a collection of DynamicFrames: one with selected data property keys, and one with the remaining data property keys.

  • SelectFromCollection – A SelectFromCollection object.

    Specifies a transform that chooses one DynamicFrame from a collection of DynamicFrames. The output is the selected DynamicFrame

  • FillMissingValues – A FillMissingValues object.

    Specifies a transform that locates records in the dataset that have missing values and adds a new field with a value determined by imputation. The input data set is used to train the machine learning model that determines what the missing value should be.

  • Filter – A Filter object.

    Specifies a transform that splits a dataset into two, based on a filter condition.

  • CustomCode – A CustomCode object.

    Specifies a transform that uses custom code you provide to perform the data transformation. The output is a collection of DynamicFrames.

  • SparkSQL – A SparkSQL object.

    Specifies a transform where you enter a SQL query using Spark SQL syntax to transform the data. The output is a single DynamicFrame.

  • DirectKinesisSource – A DirectKinesisSource object.

    Specifies a direct Amazon Kinesis data source.

  • DirectKafkaSource – A DirectKafkaSource object.

    Specifies an Apache Kafka data store.

  • CatalogKinesisSource – A CatalogKinesisSource object.

    Specifies a Kinesis data source in the AWS Glue Data Catalog.

  • CatalogKafkaSource – A CatalogKafkaSource object.

    Specifies an Apache Kafka data store in the Data Catalog.

  • DropNullFields – A DropNullFields object.

    Specifies a transform that removes columns from the dataset if all values in the column are 'null'. By default, AWS Glue Studio will recognize null objects, but some values such as empty strings, strings that are "null", -1 integers or other placeholders such as zeros, are not automatically recognized as nulls.

  • Merge – A Merge object.

    Specifies a transform that merges a DynamicFrame with a staging DynamicFrame based on the specified primary keys to identify records. Duplicate records (records with the same primary keys) are not de-duplicated.

  • Union – An Union object.

    Specifies a transform that combines the rows from two or more datasets into a single result.

  • PIIDetection – A PIIDetection object.

    Specifies a transform that identifies, removes or masks PII data.

  • Aggregate – An Aggregate object.

    Specifies a transform that groups rows by chosen fields and computes the aggregated value by specified function.

  • DropDuplicates – A DropDuplicates object.

    Specifies a transform that removes rows of repeating data from a data set.

  • GovernedCatalogTarget – A GovernedCatalogTarget object.

    Specifies a data target that writes to a goverened catalog.

  • GovernedCatalogSource – A GovernedCatalogSource object.

    Specifies a data source in a goverened Data Catalog.

  • MicrosoftSQLServerCatalogSource – A MicrosoftSQLServerCatalogSource object.

    Specifies a Microsoft SQL server data source in the AWS Glue Data Catalog.

  • MySQLCatalogSource – A MySQLCatalogSource object.

    Specifies a MySQL data source in the AWS Glue Data Catalog.

  • OracleSQLCatalogSource – An OracleSQLCatalogSource object.

    Specifies an Oracle data source in the AWS Glue Data Catalog.

  • PostgreSQLCatalogSource – A PostgreSQLCatalogSource object.

    Specifies a PostgresSQL data source in the AWS Glue Data Catalog.

  • MicrosoftSQLServerCatalogTarget – A MicrosoftSQLServerCatalogTarget object.

    Specifies a target that uses Microsoft SQL.

  • MySQLCatalogTarget – A MySQLCatalogTarget object.

    Specifies a target that uses MySQL.

  • OracleSQLCatalogTarget – An OracleSQLCatalogTarget object.

    Specifies a target that uses Oracle SQL.

  • PostgreSQLCatalogTarget – A PostgreSQLCatalogTarget object.

    Specifies a target that uses Postgres SQL.

JDBCConnectorOptions structure

Additional connection options for the connector.

Fields

  • FilterPredicate – UTF-8 string, matching the Custom string pattern #30.

    Extra condition clause to filter data from source. For example:

    BillingCity='Mountain View'

    When using a query instead of a table name, you should validate that the query works with the specified filterPredicate.

  • PartitionColumn – UTF-8 string, matching the Custom string pattern #30.

    The name of an integer column that is used for partitioning. This option works only when it's included with lowerBound, upperBound, and numPartitions. This option works the same way as in the Spark SQL JDBC reader.

  • LowerBound – Number (long), not more than None.

    The minimum value of partitionColumn that is used to decide partition stride.

  • UpperBound – Number (long), not more than None.

    The maximum value of partitionColumn that is used to decide partition stride.

  • NumPartitions – Number (long), not more than None.

    The number of partitions. This value, along with lowerBound (inclusive) and upperBound (exclusive), form partition strides for generated WHERE clause expressions that are used to split the partitionColumn.

  • JobBookmarkKeys – An array of UTF-8 strings.

    The name of the job bookmark keys on which to sort.

  • JobBookmarkKeysSortOrder – UTF-8 string, matching the Custom string pattern #30.

    Specifies an ascending or descending sort order.

  • DataTypeMapping – A map array of key-value pairs.

    Each key is a UTF-8 string (valid values: ARRAY | BIGINT | BINARY | BIT | BLOB | BOOLEAN | CHAR | CLOB | DATALINK | DATE | DECIMAL | DISTINCT | DOUBLE | FLOAT | INTEGER | JAVA_OBJECT | LONGNVARCHAR | LONGVARBINARY | LONGVARCHAR | NCHAR | NCLOB | NULL | NUMERIC | NVARCHAR | OTHER | REAL | REF | REF_CURSOR | ROWID | SMALLINT | SQLXML | STRUCT | TIME | TIME_WITH_TIMEZONE | TIMESTAMP | TIMESTAMP_WITH_TIMEZONE | TINYINT | VARBINARY | VARCHAR).

    Each value is a UTF-8 string (valid values: DATE | STRING | TIMESTAMP | INT | FLOAT | LONG | BIGDECIMAL | BYTE | SHORT | DOUBLE).

    Custom data type mapping that builds a mapping from a JDBC data type to an AWS Glue data type. For example, the option "dataTypeMapping":{"FLOAT":"STRING"} maps data fields of JDBC type FLOAT into the Java String type by calling the ResultSet.getString() method of the driver, and uses it to build the AWS Glue record. The ResultSet object is implemented by each driver, so the behavior is specific to the driver you use. Refer to the documentation for your JDBC driver to understand how the driver performs the conversions.

StreamingDataPreviewOptions structure

Specifies options related to data preview for viewing a sample of your data.

Fields

  • PollingTime – Number (long), at least 10.

    The polling time in milliseconds.

  • RecordPollingLimit – Number (long), at least 1.

    The limit to the number of records polled.

AthenaConnectorSource structure

Specifies a connector to an Amazon Athena data source.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data source.

  • ConnectionNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the connection that is associated with the connector.

  • ConnectorNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of a connector that assists with accessing the data store in AWS Glue Studio.

  • ConnectionTypeRequired: UTF-8 string, matching the Custom string pattern #30.

    The type of connection, such as marketplace.athena or custom.athena, designating a connection to an Amazon Athena data store.

  • ConnectionTable – UTF-8 string, matching the Custom string pattern #31.

    The name of the table in the data source.

  • SchemaNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the Cloudwatch log group to read from. For example, /aws-glue/jobs/output.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the custom Athena source.

JDBCConnectorSource structure

Specifies a connector to a JDBC data source.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data source.

  • ConnectionNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the connection that is associated with the connector.

  • ConnectorNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of a connector that assists with accessing the data store in AWS Glue Studio.

  • ConnectionTypeRequired: UTF-8 string, matching the Custom string pattern #30.

    The type of connection, such as marketplace.jdbc or custom.jdbc, designating a connection to a JDBC data store.

  • AdditionalOptions – A JDBCConnectorOptions object.

    Additional connection options for the connector.

  • ConnectionTable – UTF-8 string, matching the Custom string pattern #31.

    The name of the table in the data source.

  • Query – UTF-8 string, matching the Custom string pattern #32.

    The table or SQL query to get the data from. You can specify either ConnectionTable or query, but not both.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the custom JDBC source.

SparkConnectorSource structure

Specifies a connector to an Apache Spark data source.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data source.

  • ConnectionNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the connection that is associated with the connector.

  • ConnectorNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of a connector that assists with accessing the data store in AWS Glue Studio.

  • ConnectionTypeRequired: UTF-8 string, matching the Custom string pattern #30.

    The type of connection, such as marketplace.spark or custom.spark, designating a connection to an Apache Spark data store.

  • AdditionalOptions – A map array of key-value pairs.

    Each key is a UTF-8 string, matching the Custom string pattern #30.

    Each value is a UTF-8 string, matching the Custom string pattern #30.

    Additional connection options for the connector.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies data schema for the custom spark source.

CatalogSource structure

Specifies a data store in the AWS Glue Data Catalog.

Fields

MySQLCatalogSource structure

Specifies a MySQL data source in the AWS Glue Data Catalog.

Fields

PostgreSQLCatalogSource structure

Specifies a PostgresSQL data source in the AWS Glue Data Catalog.

Fields

OracleSQLCatalogSource structure

Specifies an Oracle data source in the AWS Glue Data Catalog.

Fields

MicrosoftSQLServerCatalogSource structure

Specifies a Microsoft SQL server data source in the AWS Glue Data Catalog.

Fields

CatalogKinesisSource structure

Specifies a Kinesis data source in the AWS Glue Data Catalog.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data source.

  • WindowSize – Number (integer), not more than None.

    The amount of time to spend processing each micro batch.

  • DetectSchema – Boolean.

    Whether to automatically determine the schema from the incoming data.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to read from.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to read from.

  • StreamingOptions – A KinesisStreamingSourceOptions object.

    Additional options for the Kinesis streaming data source.

  • DataPreviewOptions – A StreamingDataPreviewOptions object.

    Additional options for data preview.

DirectKinesisSource structure

Specifies a direct Amazon Kinesis data source.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data source.

  • WindowSize – Number (integer), not more than None.

    The amount of time to spend processing each micro batch.

  • DetectSchema – Boolean.

    Whether to automatically determine the schema from the incoming data.

  • StreamingOptions – A KinesisStreamingSourceOptions object.

    Additional options for the Kinesis streaming data source.

  • DataPreviewOptions – A StreamingDataPreviewOptions object.

    Additional options for data preview.

KinesisStreamingSourceOptions structure

Additional options for the Amazon Kinesis streaming data source.

Fields

  • EndpointUrl – UTF-8 string, matching the Custom string pattern #30.

    The URL of the Kinesis endpoint.

  • StreamName – UTF-8 string, matching the Custom string pattern #30.

    The name of the Kinesis data stream.

  • Classification – UTF-8 string, matching the Custom string pattern #30.

    An optional classification.

  • Delimiter – UTF-8 string, matching the Custom string pattern #30.

    Specifies the delimiter character.

  • StartingPosition – UTF-8 string (valid values: latest="LATEST" | trim_horizon="TRIM_HORIZON" | earliest="EARLIEST").

    The starting position in the Kinesis data stream to read data from. The possible values are "latest", "trim_horizon", or "earliest". The default value is "latest".

  • MaxFetchTimeInMs – Number (long), not more than None.

    The maximum time spent in the job executor to fetch a record from the Kinesis data stream per shard, specified in milliseconds (ms). The default value is 1000.

  • MaxFetchRecordsPerShard – Number (long), not more than None.

    The maximum number of records to fetch per shard in the Kinesis data stream. The default value is 100000.

  • MaxRecordPerRead – Number (long), not more than None.

    The maximum number of records to fetch from the Kinesis data stream in each getRecords operation. The default value is 10000.

  • AddIdleTimeBetweenReads – Boolean.

    Adds a time delay between two consecutive getRecords operations. The default value is "False". This option is only configurable for Glue version 2.0 and above.

  • IdleTimeBetweenReadsInMs – Number (long), not more than None.

    The minimum time delay between two consecutive getRecords operations, specified in ms. The default value is 1000. This option is only configurable for Glue version 2.0 and above.

  • DescribeShardInterval – Number (long), not more than None.

    The minimum time interval between two ListShards API calls for your script to consider resharding. The default value is 1s.

  • NumRetries – Number (integer), not more than None.

    The maximum number of retries for Kinesis Data Streams API requests. The default value is 3.

  • RetryIntervalMs – Number (long), not more than None.

    The cool-off time period (specified in ms) before retrying the Kinesis Data Streams API call. The default value is 1000.

  • MaxRetryIntervalMs – Number (long), not more than None.

    The maximum cool-off time period (specified in ms) between two retries of a Kinesis Data Streams API call. The default value is 10000.

  • AvoidEmptyBatches – Boolean.

    Avoids creating an empty microbatch job by checking for unread data in the Kinesis data stream before the batch is started. The default value is "False".

  • StreamArn – UTF-8 string, matching the Custom string pattern #30.

    The Amazon Resource Name (ARN) of the Kinesis data stream.

  • RoleArn – UTF-8 string, matching the Custom string pattern #30.

    The Amazon Resource Name (ARN) of the role to assume using AWS Security Token Service (AWS STS). This role must have permissions for describe or read record operations for the Kinesis data stream. You must use this parameter when accessing a data stream in a different account. Used in conjunction with "awsSTSSessionName".

  • RoleSessionName – UTF-8 string, matching the Custom string pattern #30.

    An identifier for the session assuming the role using AWS STS. You must use this parameter when accessing a data stream in a different account. Used in conjunction with "awsSTSRoleARN".

CatalogKafkaSource structure

Specifies an Apache Kafka data store in the Data Catalog.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data store.

  • WindowSize – Number (integer), not more than None.

    The amount of time to spend processing each micro batch.

  • DetectSchema – Boolean.

    Whether to automatically determine the schema from the incoming data.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to read from.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to read from.

  • StreamingOptions – A KafkaStreamingSourceOptions object.

    Specifies the streaming options.

  • DataPreviewOptions – A StreamingDataPreviewOptions object.

    Specifies options related to data preview for viewing a sample of your data.

DirectKafkaSource structure

Specifies an Apache Kafka data store.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data store.

  • StreamingOptions – A KafkaStreamingSourceOptions object.

    Specifies the streaming options.

  • WindowSize – Number (integer), not more than None.

    The amount of time to spend processing each micro batch.

  • DetectSchema – Boolean.

    Whether to automatically determine the schema from the incoming data.

  • DataPreviewOptions – A StreamingDataPreviewOptions object.

    Specifies options related to data preview for viewing a sample of your data.

KafkaStreamingSourceOptions structure

Additional options for streaming.

Fields

  • BootstrapServers – UTF-8 string, matching the Custom string pattern #30.

    A list of bootstrap server URLs, for example, as b-1.vpc-test-2.o4q88o.c6.kafka.us-east-1.amazonaws.com:9094. This option must be specified in the API call or defined in the table metadata in the Data Catalog.

  • SecurityProtocol – UTF-8 string, matching the Custom string pattern #30.

    The protocol used to communicate with brokers. The possible values are "SSL" or "PLAINTEXT".

  • ConnectionName – UTF-8 string, matching the Custom string pattern #30.

    The name of the connection.

  • TopicName – UTF-8 string, matching the Custom string pattern #30.

    The topic name as specified in Apache Kafka. You must specify at least one of "topicName", "assign" or "subscribePattern".

  • Assign – UTF-8 string, matching the Custom string pattern #30.

    The specific TopicPartitions to consume. You must specify at least one of "topicName", "assign" or "subscribePattern".

  • SubscribePattern – UTF-8 string, matching the Custom string pattern #30.

    A Java regex string that identifies the topic list to subscribe to. You must specify at least one of "topicName", "assign" or "subscribePattern".

  • Classification – UTF-8 string, matching the Custom string pattern #30.

    An optional classification.

  • Delimiter – UTF-8 string, matching the Custom string pattern #30.

    Specifies the delimiter character.

  • StartingOffsets – UTF-8 string, matching the Custom string pattern #30.

    The starting position in the Kafka topic to read data from. The possible values are "earliest" or "latest". The default value is "latest".

  • EndingOffsets – UTF-8 string, matching the Custom string pattern #30.

    The end point when a batch query is ended. Possible values are either "latest" or a JSON string that specifies an ending offset for each TopicPartition.

  • PollTimeoutMs – Number (long), not more than None.

    The timeout in milliseconds to poll data from Kafka in Spark job executors. The default value is 512.

  • NumRetries – Number (integer), not more than None.

    The number of times to retry before failing to fetch Kafka offsets. The default value is 3.

  • RetryIntervalMs – Number (long), not more than None.

    The time in milliseconds to wait before retrying to fetch Kafka offsets. The default value is 10.

  • MaxOffsetsPerTrigger – Number (long), not more than None.

    The rate limit on the maximum number of offsets that are processed per trigger interval. The specified total number of offsets is proportionally split across topicPartitions of different volumes. The default value is null, which means that the consumer reads all offsets until the known latest offset.

  • MinPartitions – Number (integer), not more than None.

    The desired minimum number of partitions to read from Kafka. The default value is null, which means that the number of spark partitions is equal to the number of Kafka partitions.

RedshiftSource structure

Specifies an Amazon Redshift data store.

Fields

S3CatalogSource structure

Specifies an Amazon S3 data store in the AWS Glue Data Catalog.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data store.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The database to read from.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The database table to read from.

  • PartitionPredicate – UTF-8 string, matching the Custom string pattern #30.

    Partitions satisfying this predicate are deleted. Files within the retention period in these partitions are not deleted. Set to "" – empty by default.

  • AdditionalOptions – A S3SourceAdditionalOptions object.

    Specifies additional connection options.

GovernedCatalogSource structure

Specifies the data store in the governed AWS Glue Data Catalog.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data store.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The database to read from.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The database table to read from.

  • PartitionPredicate – UTF-8 string, matching the Custom string pattern #30.

    Partitions satisfying this predicate are deleted. Files within the retention period in these partitions are not deleted. Set to "" – empty by default.

  • AdditionalOptions – A S3SourceAdditionalOptions object.

    Specifies additional connection options.

S3SourceAdditionalOptions structure

Specifies additional connection options for the Amazon S3 data store.

Fields

  • BoundedSize – Number (long).

    Sets the upper limit for the target size of the dataset in bytes that will be processed.

  • BoundedFiles – Number (long).

    Sets the upper limit for the target number of files that will be processed.

S3CsvSource structure

Specifies a command-separated value (CSV) data store stored in Amazon S3.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data store.

  • PathsRequired: An array of UTF-8 strings.

    A list of the Amazon S3 paths to read from.

  • CompressionType – UTF-8 string (valid values: gzip="GZIP" | bzip2="BZIP2").

    Specifies how the data is compressed. This is generally not necessary if the data has a standard file extension. Possible values are "gzip" and "bzip").

  • Exclusions – An array of UTF-8 strings.

    A string containing a JSON list of Unix-style glob patterns to exclude. For example, "[\"**.pdf\"]" excludes all PDF files.

  • GroupSize – UTF-8 string, matching the Custom string pattern #30.

    The target group size in bytes. The default is computed based on the input data size and the size of your cluster. When there are fewer than 50,000 input files, "groupFiles" must be set to "inPartition" for this to take effect.

  • GroupFiles – UTF-8 string, matching the Custom string pattern #30.

    Grouping files is turned on by default when the input contains more than 50,000 files. To turn on grouping with fewer than 50,000 files, set this parameter to "inPartition". To disable grouping when there are more than 50,000 files, set this parameter to "none".

  • Recurse – Boolean.

    If set to true, recursively reads files in all subdirectories under the specified paths.

  • MaxBand – Number (integer), not more than None.

    This option controls the duration in milliseconds after which the s3 listing is likely to be consistent. Files with modification timestamps falling within the last maxBand milliseconds are tracked specially when using JobBookmarks to account for Amazon S3 eventual consistency. Most users don't need to set this option. The default is 900000 milliseconds, or 15 minutes.

  • MaxFilesInBand – Number (integer), not more than None.

    This option specifies the maximum number of files to save from the last maxBand seconds. If this number is exceeded, extra files are skipped and only processed in the next job run.

  • AdditionalOptions – A S3DirectSourceAdditionalOptions object.

    Specifies additional connection options.

  • SeparatorRequired: UTF-8 string (valid values: comma="COMMA" | ctrla="CTRLA" | pipe="PIPE" | semicolon="SEMICOLON" | tab="TAB").

    Specifies the delimiter character. The default is a comma: ",", but any other character can be specified.

  • Escaper – UTF-8 string, matching the Custom string pattern #31.

    Specifies a character to use for escaping. This option is used only when reading CSV files. The default value is none. If enabled, the character which immediately follows is used as-is, except for a small set of well-known escapes (\n, \r, \t, and \0).

  • QuoteCharRequired: UTF-8 string (valid values: quote="QUOTE" | quillemet="QUILLEMET" | single_quote="SINGLE_QUOTE" | disabled="DISABLED").

    Specifies the character to use for quoting. The default is a double quote: '"'. Set this to -1 to turn off quoting entirely.

  • Multiline – Boolean.

    A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to True if any record spans multiple lines. The default value is False, which allows for more aggressive file-splitting during parsing.

  • WithHeader – Boolean.

    A Boolean value that specifies whether to treat the first line as a header. The default value is False.

  • WriteHeader – Boolean.

    A Boolean value that specifies whether to write the header to output. The default value is True.

  • SkipFirst – Boolean.

    A Boolean value that specifies whether to skip the first data line. The default value is False.

  • OptimizePerformance – Boolean.

    A Boolean value that specifies whether to use the advanced SIMD CSV reader along with Apache Arrow based columnar memory formats. Only available in AWS Glue version 3.0.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the S3 CSV source.

S3DirectSourceAdditionalOptions structure

Specifies additional connection options for the Amazon S3 data store.

Fields

  • BoundedSize – Number (long).

    Sets the upper limit for the target size of the dataset in bytes that will be processed.

  • BoundedFiles – Number (long).

    Sets the upper limit for the target number of files that will be processed.

  • EnableSamplePath – Boolean.

    Sets option to enable a sample path.

  • SamplePath – UTF-8 string, matching the Custom string pattern #30.

    If enabled, specifies the sample path.

S3JsonSource structure

Specifies a JSON data store stored in Amazon S3.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data store.

  • PathsRequired: An array of UTF-8 strings.

    A list of the Amazon S3 paths to read from.

  • CompressionType – UTF-8 string (valid values: gzip="GZIP" | bzip2="BZIP2").

    Specifies how the data is compressed. This is generally not necessary if the data has a standard file extension. Possible values are "gzip" and "bzip").

  • Exclusions – An array of UTF-8 strings.

    A string containing a JSON list of Unix-style glob patterns to exclude. For example, "[\"**.pdf\"]" excludes all PDF files.

  • GroupSize – UTF-8 string, matching the Custom string pattern #30.

    The target group size in bytes. The default is computed based on the input data size and the size of your cluster. When there are fewer than 50,000 input files, "groupFiles" must be set to "inPartition" for this to take effect.

  • GroupFiles – UTF-8 string, matching the Custom string pattern #30.

    Grouping files is turned on by default when the input contains more than 50,000 files. To turn on grouping with fewer than 50,000 files, set this parameter to "inPartition". To disable grouping when there are more than 50,000 files, set this parameter to "none".

  • Recurse – Boolean.

    If set to true, recursively reads files in all subdirectories under the specified paths.

  • MaxBand – Number (integer), not more than None.

    This option controls the duration in milliseconds after which the s3 listing is likely to be consistent. Files with modification timestamps falling within the last maxBand milliseconds are tracked specially when using JobBookmarks to account for Amazon S3 eventual consistency. Most users don't need to set this option. The default is 900000 milliseconds, or 15 minutes.

  • MaxFilesInBand – Number (integer), not more than None.

    This option specifies the maximum number of files to save from the last maxBand seconds. If this number is exceeded, extra files are skipped and only processed in the next job run.

  • AdditionalOptions – A S3DirectSourceAdditionalOptions object.

    Specifies additional connection options.

  • JsonPath – UTF-8 string, matching the Custom string pattern #30.

    A JsonPath string defining the JSON data.

  • Multiline – Boolean.

    A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to True if any record spans multiple lines. The default value is False, which allows for more aggressive file-splitting during parsing.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the S3 JSON source.

S3ParquetSource structure

Specifies an Apache Parquet data store stored in Amazon S3.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data store.

  • PathsRequired: An array of UTF-8 strings.

    A list of the Amazon S3 paths to read from.

  • CompressionType – UTF-8 string (valid values: snappy="SNAPPY" | lzo="LZO" | gzip="GZIP" | uncompressed="UNCOMPRESSED" | none="NONE").

    Specifies how the data is compressed. This is generally not necessary if the data has a standard file extension. Possible values are "gzip" and "bzip").

  • Exclusions – An array of UTF-8 strings.

    A string containing a JSON list of Unix-style glob patterns to exclude. For example, "[\"**.pdf\"]" excludes all PDF files.

  • GroupSize – UTF-8 string, matching the Custom string pattern #30.

    The target group size in bytes. The default is computed based on the input data size and the size of your cluster. When there are fewer than 50,000 input files, "groupFiles" must be set to "inPartition" for this to take effect.

  • GroupFiles – UTF-8 string, matching the Custom string pattern #30.

    Grouping files is turned on by default when the input contains more than 50,000 files. To turn on grouping with fewer than 50,000 files, set this parameter to "inPartition". To disable grouping when there are more than 50,000 files, set this parameter to "none".

  • Recurse – Boolean.

    If set to true, recursively reads files in all subdirectories under the specified paths.

  • MaxBand – Number (integer), not more than None.

    This option controls the duration in milliseconds after which the s3 listing is likely to be consistent. Files with modification timestamps falling within the last maxBand milliseconds are tracked specially when using JobBookmarks to account for Amazon S3 eventual consistency. Most users don't need to set this option. The default is 900000 milliseconds, or 15 minutes.

  • MaxFilesInBand – Number (integer), not more than None.

    This option specifies the maximum number of files to save from the last maxBand seconds. If this number is exceeded, extra files are skipped and only processed in the next job run.

  • AdditionalOptions – A S3DirectSourceAdditionalOptions object.

    Specifies additional connection options.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the S3 Parquet source.

DynamoDBCatalogSource structure

Specifies a DynamoDB data source in the AWS Glue Data Catalog.

Fields

RelationalCatalogSource structure

Specifies a Relational database data source in the AWS Glue Data Catalog.

Fields

JDBCConnectorTarget structure

Specifies a data target that writes to Amazon S3 in Apache Parquet columnar storage.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • ConnectionNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the connection that is associated with the connector.

  • ConnectionTableRequired: UTF-8 string, matching the Custom string pattern #31.

    The name of the table in the data target.

  • ConnectorNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of a connector that will be used.

  • ConnectionTypeRequired: UTF-8 string, matching the Custom string pattern #30.

    The type of connection, such as marketplace.jdbc or custom.jdbc, designating a connection to a JDBC data target.

  • AdditionalOptions – A map array of key-value pairs.

    Each key is a UTF-8 string, matching the Custom string pattern #30.

    Each value is a UTF-8 string, matching the Custom string pattern #30.

    Additional connection options for the connector.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the JDBC target.

SparkConnectorTarget structure

Specifies a target that uses an Apache Spark connector.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • ConnectionNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of a connection for an Apache Spark connector.

  • ConnectorNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of an Apache Spark connector.

  • ConnectionTypeRequired: UTF-8 string, matching the Custom string pattern #30.

    The type of connection, such as marketplace.spark or custom.spark, designating a connection to an Apache Spark data store.

  • AdditionalOptions – A map array of key-value pairs.

    Each key is a UTF-8 string, matching the Custom string pattern #30.

    Each value is a UTF-8 string, matching the Custom string pattern #30.

    Additional connection options for the connector.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the custom spark target.

BasicCatalogTarget structure

Specifies a target that uses a AWS Glue Data Catalog table.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of your data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The database that contains the table you want to use as the target. This database must already exist in the Data Catalog.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The table that defines the schema of your output data. This table must already exist in the Data Catalog.

MySQLCatalogTarget structure

Specifies a target that uses MySQL.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to write to.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to write to.

PostgreSQLCatalogTarget structure

Specifies a target that uses Postgres SQL.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to write to.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to write to.

OracleSQLCatalogTarget structure

Specifies a target that uses Oracle SQL.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to write to.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to write to.

MicrosoftSQLServerCatalogTarget structure

Specifies a target that uses Microsoft SQL.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to write to.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to write to.

RedshiftTarget structure

Specifies a target that uses Amazon Redshift.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to write to.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to write to.

  • RedshiftTmpDir – UTF-8 string, matching the Custom string pattern #30.

    The Amazon S3 path where temporary data can be staged when copying out of the database.

  • TmpDirIAMRole – UTF-8 string, matching the Custom string pattern #30.

    The IAM role with permissions.

  • UpsertRedshiftOptions – An UpsertRedshiftTargetOptions object.

    The set of options to configure an upsert operation when writing to a Redshift target.

UpsertRedshiftTargetOptions structure

The options to configure an upsert operation when writing to a Redshift target .

Fields

  • TableLocation – UTF-8 string, matching the Custom string pattern #30.

    The physical location of the Redshift table.

  • ConnectionName – UTF-8 string, matching the Custom string pattern #30.

    The name of the connection to use to write to Redshift.

  • UpsertKeys – An array of UTF-8 strings.

    The keys used to determine whether to perform an update or insert.

S3CatalogTarget structure

Specifies a data target that writes to Amazon S3 using the AWS Glue Data Catalog.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • PartitionKeys – An array of EnclosedInStringProperty members..

    An array of UTF-8 strings.

    Specifies native partitioning using a sequence of keys.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to write to.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to write to.

  • SchemaChangePolicy – A CatalogSchemaChangePolicy object.

    A policy that specifies update behavior for the crawler.

GovernedCatalogTarget structure

Specifies a data target that writes to Amazon S3 using the AWS Glue Data Catalog.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • PartitionKeys – An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    Specifies native partitioning using a sequence of keys.

  • TableRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the table in the database to write to.

  • DatabaseRequired: UTF-8 string, matching the Custom string pattern #30.

    The name of the database to write to.

  • SchemaChangePolicy – A CatalogSchemaChangePolicy object.

    A policy that specifies update behavior for the governed catalog.

S3GlueParquetTarget structure

Specifies a data target that writes to Amazon S3 in Apache Parquet columnar storage.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • PartitionKeys – An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    Specifies native partitioning using a sequence of keys.

  • PathRequired: UTF-8 string, matching the Custom string pattern #30.

    A single Amazon S3 path to write to.

  • Compression – UTF-8 string (valid values: snappy="SNAPPY" | lzo="LZO" | gzip="GZIP" | uncompressed="UNCOMPRESSED" | none="NONE").

    Specifies how the data is compressed. This is generally not necessary if the data has a standard file extension. Possible values are "gzip" and "bzip").

  • SchemaChangePolicy – A DirectSchemaChangePolicy object.

    A policy that specifies update behavior for the crawler.

CatalogSchemaChangePolicy structure

A policy that specifies update behavior for the crawler.

Fields

  • EnableUpdateCatalog – Boolean.

    Whether to use the specified update behavior when the crawler finds a changed schema.

  • UpdateBehavior – UTF-8 string (valid values: UPDATE_IN_DATABASE | LOG).

    The update behavior when the crawler finds a changed schema.

S3DirectTarget structure

Specifies a data target that writes to Amazon S3.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the data target.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The nodes that are inputs to the data target.

  • PartitionKeys – An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    Specifies native partitioning using a sequence of keys.

  • PathRequired: UTF-8 string, matching the Custom string pattern #30.

    A single Amazon S3 path to write to.

  • Compression – UTF-8 string, matching the Custom string pattern #30.

    Specifies how the data is compressed. This is generally not necessary if the data has a standard file extension. Possible values are "gzip" and "bzip").

  • FormatRequired: UTF-8 string (valid values: json="JSON" | csv="CSV" | avro="AVRO" | orc="ORC" | parquet="PARQUET").

    Specifies the data output format for the target.

  • SchemaChangePolicy – A DirectSchemaChangePolicy object.

    A policy that specifies update behavior for the crawler.

DirectSchemaChangePolicy structure

A policy that specifies update behavior for the crawler.

Fields

  • EnableUpdateCatalog – Boolean.

    Whether to use the specified update behavior when the crawler finds a changed schema.

  • UpdateBehavior – UTF-8 string (valid values: UPDATE_IN_DATABASE | LOG).

    The update behavior when the crawler finds a changed schema.

  • Table – UTF-8 string, matching the Custom string pattern #30.

    Specifies the table in the database that the schema change policy applies to.

  • Database – UTF-8 string, matching the Custom string pattern #30.

    Specifies the database that the schema change policy applies to.

ApplyMapping structure

Specifies a transform that maps data property keys in the data source to data property keys in the data target. You can rename keys, modify the data types for keys, and choose which keys to drop from the dataset.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • MappingRequired: An array of Mapping objects.

    Specifies the mapping of data property keys in the data source to data property keys in the data target.

Mapping structure

Specifies the mapping of data property keys.

Fields

  • ToKey – UTF-8 string, matching the Custom string pattern #30.

    After the apply mapping, what the name of the column should be. Can be the same as FromPath.

  • FromPath – An array of UTF-8 strings.

    The table or column to be modified.

  • FromType – UTF-8 string, matching the Custom string pattern #30.

    The type of the data to be modified.

  • ToType – UTF-8 string, matching the Custom string pattern #30.

    The data type that the data is to be modified to.

  • Dropped – Boolean.

    If true, then the column is removed.

  • Children – An array of Mapping objects.

    Only applicable to nested data structures. If you want to change the parent structure, but also one of its children, you can fill out this data strucutre. It is also Mapping, but its FromPath will be the parent's FromPath plus the FromPath from this structure.

    For the children part, suppose you have the structure:

    { "FromPath": "OuterStructure", "ToKey": "OuterStructure", "ToType": "Struct", "Dropped": false, "Chidlren": [{ "FromPath": "inner", "ToKey": "inner", "ToType": "Double", "Dropped": false, }] }

    You can specify a Mapping that looks like:

    { "FromPath": "OuterStructure", "ToKey": "OuterStructure", "ToType": "Struct", "Dropped": false, "Chidlren": [{ "FromPath": "inner", "ToKey": "inner", "ToType": "Double", "Dropped": false, }] }

SelectFields structure

Specifies a transform that chooses the data property keys that you want to keep.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • PathsRequired: An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    A JSON path to a variable in the data structure.

DropFields structure

Specifies a transform that chooses the data property keys that you want to drop.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • PathsRequired: An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    A JSON path to a variable in the data structure.

RenameField structure

Specifies a transform that renames a single data property key.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • SourcePathRequired: An array of UTF-8 strings.

    A JSON path to a variable in the data structure for the source data.

  • TargetPathRequired: An array of UTF-8 strings.

    A JSON path to a variable in the data structure for the target data.

Spigot structure

Specifies a transform that writes samples of the data to an Amazon S3 bucket.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • PathRequired: UTF-8 string, matching the Custom string pattern #30.

    A path in Amazon S3 where the transform will write a subset of records from the dataset to a JSON file in an Amazon S3 bucket.

  • Topk – Number (integer), not more than 100.

    Specifies a number of records to write starting from the beginning of the dataset.

  • Prob – Number (double), not more than 1.

    The probability (a decimal value with a maximum value of 1) of picking any given record. A value of 1 indicates that each row read from the dataset should be included in the sample output.

Join structure

Specifies a transform that joins two datasets into one dataset using a comparison phrase on the specified data property keys. You can use inner, outer, left, right, left semi, and left anti joins.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 2 or more than 2 strings.

    The data inputs identified by their node names.

  • JoinTypeRequired: UTF-8 string (valid values: equijoin="EQUIJOIN" | left="LEFT" | right="RIGHT" | outer="OUTER" | leftsemi="LEFT_SEMI" | leftanti="LEFT_ANTI").

    Specifies the type of join to be performed on the datasets.

  • ColumnsRequired: An array of JoinColumn objects, not less than 2 or more than 2 structures.

    A list of the two columns to be joined.

JoinColumn structure

Specifies a column to be joined.

Fields

  • FromRequired: UTF-8 string, matching the Custom string pattern #30.

    The column to be joined.

  • KeysRequired: An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    The key of the column to be joined.

SplitFields structure

Specifies a transform that splits data property keys into two DynamicFrames. The output is a collection of DynamicFrames: one with selected data property keys, and one with the remaining data property keys.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • PathsRequired: An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    A JSON path to a variable in the data structure.

SelectFromCollection structure

Specifies a transform that chooses one DynamicFrame from a collection of DynamicFrames. The output is the selected DynamicFrame

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • IndexRequired: Number (integer), not more than None.

    The index for the DynamicFrame to be selected.

FillMissingValues structure

Specifies a transform that locates records in the dataset that have missing values and adds a new field with a value determined by imputation. The input data set is used to train the machine learning model that determines what the missing value should be.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • ImputedPathRequired: UTF-8 string, matching the Custom string pattern #30.

    A JSON path to a variable in the data structure for the dataset that is imputed.

  • FilledPath – UTF-8 string, matching the Custom string pattern #30.

    A JSON path to a variable in the data structure for the dataset that is filled.

Filter structure

Specifies a transform that splits a dataset into two, based on a filter condition.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • LogicalOperatorRequired: UTF-8 string (valid values: AND | OR).

    The operator used to filter rows by comparing the key value to a specified value.

  • FiltersRequired: An array of FilterExpression objects.

    Specifies a filter expression.

FilterExpression structure

Specifies a filter expression.

Fields

  • OperationRequired: UTF-8 string (valid values: EQ | LT | GT | LTE | GTE | REGEX | ISNULL).

    The type of operation to perform in the expression.

  • Negated – Boolean.

    Whether the expression is to be negated.

  • ValuesRequired: An array of FilterValue objects.

    A list of filter values.

FilterValue structure

Represents a single entry in the list of values for a FilterExpression.

Fields

  • TypeRequired: UTF-8 string (valid values: COLUMNEXTRACTED | CONSTANT).

    The type of filter value.

  • ValueRequired: An array of UTF-8 strings.

    The value to be associated.

CustomCode structure

Specifies a transform that uses custom code you provide to perform the data transformation. The output is a collection of DynamicFrames.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, at least 1 string.

    The data inputs identified by their node names.

  • CodeRequired: UTF-8 string, matching the Custom string pattern #26.

    The custom code that is used to perform the data transformation.

  • ClassNameRequired: UTF-8 string, matching the Custom string pattern #30.

    The name defined for the custom code node class.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the custom code transform.

SparkSQL structure

Specifies a transform where you enter a SQL query using Spark SQL syntax to transform the data. The output is a single DynamicFrame.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, at least 1 string.

    The data inputs identified by their node names. You can associate a table name with each input node to use in the SQL query. The name you choose must meet the Spark SQL naming restrictions.

  • SqlQueryRequired: UTF-8 string, matching the Custom string pattern #32.

    A SQL query that must use Spark SQL syntax and return a single data set.

  • SqlAliasesRequired: An array of SqlAlias objects.

    A list of aliases. An alias allows you to specify what name to use in the SQL for a given input. For example, you have a datasource named "MyDataSource". If you specify From as MyDataSource, and Alias as SqlName, then in your SQL you can do:

    select * from SqlName

    and that gets data from MyDataSource.

  • OutputSchemas – An array of GlueSchema objects.

    Specifies the data schema for the SparkSQL transform.

SqlAlias structure

Represents a single entry in the list of values for SqlAliases.

Fields

DropNullFields structure

Specifies a transform that removes columns from the dataset if all values in the column are 'null'. By default, AWS Glue Studio will recognize null objects, but some values such as empty strings, strings that are "null", -1 integers or other placeholders such as zeros, are not automatically recognized as nulls.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • NullCheckBoxList – A NullCheckBoxList object.

    A structure that represents whether certain values are recognized as null values for removal.

  • NullTextList – An array of NullValueField objects, not more than 50 structures.

    A structure that specifies a list of NullValueField structures that represent a custom null value such as zero or other value being used as a null placeholder unique to the dataset.

    The DropNullFields transform removes custom null values only if both the value of the null placeholder and the datatype match the data.

NullCheckBoxList structure

Represents whether certain values are recognized as null values for removal.

Fields

  • IsEmpty – Boolean.

    Specifies that an empty string is considered as a null value.

  • IsNullString – Boolean.

    Specifies that a value spelling out the word 'null' is considered as a null value.

  • IsNegOne – Boolean.

    Specifies that an integer value of -1 is considered as a null value.

NullValueField structure

Represents a custom null value such as a zeros or other value being used as a null placeholder unique to the dataset.

Fields

  • ValueRequired: UTF-8 string, matching the Custom string pattern #30.

    The value of the null placeholder.

  • DatatypeRequired: A Datatype object.

    The datatype of the value.

Datatype structure

A structure representing the datatype of the value.

Fields

Merge structure

Specifies a transform that merges a DynamicFrame with a staging DynamicFrame based on the specified primary keys to identify records. Duplicate records (records with the same primary keys) are not de-duplicated.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 2 or more than 2 strings.

    The data inputs identified by their node names.

  • SourceRequired: UTF-8 string, matching the Custom string pattern #29.

    The source DynamicFrame that will be merged with a staging DynamicFrame.

  • PrimaryKeysRequired: An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    The list of primary key fields to match records from the source and staging dynamic frames.

DropDuplicates structure

Specifies a transform that removes rows of repeating data from a data set.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The data inputs identified by their node names.

  • Columns – An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    The name of the columns to be merged or removed if repeating.

Union structure

Specifies a transform that combines the rows from two or more datasets into a single result.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 2 or more than 2 strings.

    The node ID inputs to the transform.

  • UnionTypeRequired: UTF-8 string (valid values: ALL | DISTINCT).

    Indicates the type of Union transform.

    Specify ALL to join all rows from data sources to the resulting DynamicFrame. The resulting union does not remove duplicate rows.

    Specify DISTINCT to remove duplicate rows in the resulting DynamicFrame.

PIIDetection structure

Specifies a transform that identifies, removes or masks PII data.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    The node ID inputs to the transform.

  • PiiTypeRequired: UTF-8 string (valid values: RowAudit | RowMasking | ColumnAudit | ColumnMasking).

    Indicates the type of PIIDetection transform.

  • EntityTypesToDetectRequired: An array of UTF-8 strings.

    Indicates the types of entities the PIIDetection transform will identify as PII data.

    PII type entities include: PERSON_NAME, DATE, USA_SNN, EMAIL, USA_ITIN, USA_PASSPORT_NUMBER, PHONE_NUMBER, BANK_ACCOUNT, IP_ADDRESS, MAC_ADDRESS, USA_CPT_CODE, USA_HCPCS_CODE, USA_NATIONAL_DRUG_CODE, USA_MEDICARE_BENEFICIARY_IDENTIFIER, USA_HEALTH_INSURANCE_CLAIM_NUMBER,CREDIT_CARD,USA_NATIONAL_PROVIDER_IDENTIFIER,USA_DEA_NUMBER,USA_DRIVING_LICENSE

  • OutputColumnName – UTF-8 string, matching the Custom string pattern #30.

    Indicates the output column name that will contain any entity type detected in that row.

  • SampleFraction – Number (double), not more than 1.

    Indicates the fraction of the data to sample when scanning for PII entities.

  • ThresholdFraction – Number (double), not more than 1.

    Indicates the fraction of the data that must be met in order for a column to be identified as PII data.

  • MaskValue – UTF-8 string, not more than 256 bytes long, matching the Custom string pattern #28.

    Indicates the value that will replace the detected entity.

Aggregate structure

Specifies a transform that groups rows by chosen fields and computes the aggregated value by specified function.

Fields

  • NameRequired: UTF-8 string, matching the Custom string pattern #33.

    The name of the transform node.

  • InputsRequired: An array of UTF-8 strings, not less than 1 or more than 1 strings.

    Specifies the fields and rows to use as inputs for the aggregate transform.

  • GroupsRequired: An array of EnclosedInStringProperty members..

    An array of UTF-8 strings..

    Specifies the fields to group by.

  • AggsRequired: An array of EnclosedInStringProperty members, not less than 1 or more than 30 structures.

    Specifies the aggregate functions to be performed on specified fields.

GlueSchema structure

Specifies a user-defined schema when a schema cannot be determined by AWS Glue.

Fields

  • Columns – An array of GlueStudioSchemaColumn objects.

    Specifies the column definitions that make up a AWS Glue schema.

GlueStudioSchemaColumn structure

Specifies a single column in a AWS Glue Studio schema definition.

Fields

  • NameRequired: UTF-8 string, not more than 1024 bytes long, matching the Single-line string pattern.

    The name of the column in the AWS Glue Studio schema.

  • Type – UTF-8 string, not more than 131072 bytes long, matching the Single-line string pattern.

    The hive type for this column in the AWS Glue Studio schema.

GlueStudioColumn structure

Specifies a single column in AWS Glue Studio.

Fields

  • KeyRequired: UTF-8 string, matching the Custom string pattern #31.

    The key of the column in AWS Glue Studio.

  • FullPathRequired: An array of UTF-8 strings.

    The full URL of the column in AWS Glue Studio.

  • TypeRequired: UTF-8 string (valid values: array="ARRAY" | bigint="BIGINT" | bigint array="BIGINT_ARRAY" | binary="BINARY" | binary array="BINARY_ARRAY" | boolean="BOOLEAN" | boolean array="BOOLEAN_ARRAY" | byte="BYTE" | byte array="BYTE_ARRAY" | char="CHAR" | char array="CHAR_ARRAY" | choice="CHOICE" | choice array="CHOICE_ARRAY" | date="DATE" | date array="DATE_ARRAY" | decimal="DECIMAL" | decimal array="DECIMAL_ARRAY" | double="DOUBLE" | double array="DOUBLE_ARRAY" | enum="ENUM" | enum array="ENUM_ARRAY" | float="FLOAT" | float array="FLOAT_ARRAY" | int="INT" | int array="INT_ARRAY" | interval="INTERVAL" | interval array="INTERVAL_ARRAY" | long="LONG" | long array="LONG_ARRAY" | object="OBJECT" | short="SHORT" | short array="SHORT_ARRAY" | smallint="SMALLINT" | smallint array="SMALLINT_ARRAY" | string="STRING" | string array="STRING_ARRAY" | timestamp="TIMESTAMP" | timestamp array="TIMESTAMP_ARRAY" | tinyint="TINYINT" | tinyint array="TINYINT_ARRAY" | varchar="VARCHAR" | varchar array="VARCHAR_ARRAY" | null="NULL" | unknown="UNKNOWN" | unknown array="UNKNOWN_ARRAY").

    The type of column in AWS Glue Studio.

  • Children – An array of a structures.

    The children of the parent column in Glue Studio.