Special Parameters Used by AWS Glue - AWS Glue

Special Parameters Used by AWS Glue

AWS Glue recognizes several argument names that you can use to set up the script environment for your jobs and job runs:

  • --job-language  —  The script programming language. This value must be either scala or python. If this parameter is not present, the default is python.

  • --class  —  The Scala class that serves as the entry point for your Scala script. This applies only if your --job-language is set to scala.

  • --scriptLocation  —  The Amazon Simple Storage Service (Amazon S3) location where your ETL script is located (in the form s3://path/to/my/script.py). This parameter overrides a script location set in the JobCommand object.

  • --extra-py-files  —  The Amazon S3 paths to additional Python modules that AWS Glue adds to the Python path before executing your script. Multiple values must be complete paths separated by a comma (,). Only individual files are supported, not a directory path. Currently, only pure Python modules work. Extension modules written in C or other languages are not supported.

  • --extra-jars  —  The Amazon S3 paths to additional Java .jar files that AWS Glue adds to the Java classpath before executing your script. Multiple values must be complete paths separated by a comma (,).

  • --extra-files  —  The Amazon S3 paths to additional files, such as configuration files that AWS Glue copies to the working directory of your script before executing it. Multiple values must be complete paths separated by a comma (,). Only individual files are supported, not a directory path.

  • --job-bookmark-option  —  Controls the behavior of a job bookmark. The following option values can be set.

    ‑‑job‑bookmark‑option Value Description
    job-bookmark-enable Keep track of previously processed data. When a job runs, process new data since the last checkpoint.
    job-bookmark-disable Always process the entire dataset. You are responsible for managing the output from previous job runs.
    job-bookmark-pause Process incremental data since the last successful run or the data in the range identified by the following suboptions, without updating the state of the last bookmark. You are responsible for managing the output from previous job runs. The two suboptions are as follows:
    • job-bookmark-from <from-value> is the run ID that represents all the input that was processed until the last successful run before and including the specified run ID. The corresponding input is ignored.

    • job-bookmark-to <to-value> is the run ID that represents all the input that was processed until the last successful run before and including the specified run ID. The corresponding input excluding the input identified by the <from-value> is processed by the job. Any input later than this input is also excluded for processing.

    The job bookmark state is not updated when this option set is specified.

    The suboptions are optional. However, when used, both suboptions must be provided.

    For example, to enable a job bookmark, pass the following argument.

    '--job-bookmark-option': 'job-bookmark-enable'
  • --TempDir  —  Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job.

    For example, to set a temporary directory, pass the following argument.

    '--TempDir': 's3-path-to-directory'
  • --enable-s3-parquet-optimized-committer  —  Enables the EMRFS S3-optimized committer for writing Parquet data into Amazon S3. You can supply the parameter/value pair via the AWS Glue console when creating or updating an AWS Glue job. Setting the value to true enables the committer. By default the flag is turned off.

    For more information, see Using the EMRFS S3-optimized Committer.

  • --enable-glue-datacatalog  —  Enables you to use the AWS Glue Data Catalog as an Apache Spark Hive metastore.

  • --enable-metrics  —  Enables the collection of metrics for job profiling for this job run. These metrics are available on the AWS Glue console and the Amazon CloudWatch console. To enable metrics, only specify the key; no value is needed.

  • --enable-continuous-cloudwatch-log  —  Enables real-time continuous logging for AWS Glue jobs. You can view real-time Apache Spark job logs in CloudWatch.

  • --enable-continuous-log-filter  —  Specifies a standard filter (true) or no filter (false) when you create or edit a job enabled for continuous logging. Choosing the standard filter prunes out non-useful Apache Spark driver/executor and Apache Hadoop YARN heartbeat log messages. Choosing no filter gives you all the log messages.

  • --continuous-log-logGroup  —  Specifies a custom Amazon CloudWatch log group name for a job enabled for continuous logging.

  • --continuous-log-logStreamPrefix  —  Specifies a custom CloudWatch log stream prefix for a job enabled for continuous logging.

  • --continuous-log-conversionPattern  —  Specifies a custom conversion log pattern for a job enabled for continuous logging. The conversion pattern applies only to driver logs and executor logs. It does not affect the AWS Glue progress bar.

For example, the following is the syntax for running a job with a --argument and a special parameter.

$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://my_glue/libraries/test_lib.py"'

The following are several argument names that AWS Glue uses internally that you should never set:

  • --conf  —  Internal to AWS Glue. Do not set.

  • --debug  —  Internal to AWS Glue. Do not set.

  • --mode  —  Internal to AWS Glue. Do not set.

  • --JOB_NAME  —  Internal to AWS Glue. Do not set.