Calling AWS Glue APIs in Python - AWS Glue

Calling AWS Glue APIs in Python

Note that Boto 3 resource APIs are not yet available for AWS Glue. Currently, only the Boto 3 client APIs can be used.

AWS Glue API names in Python

AWS Glue API names in Java and other programming languages are generally CamelCased. However, when called from Python, these generic names are changed to lowercase, with the parts of the name separated by underscore characters to make them more "Pythonic". In the AWS Glue API reference documentation, these Pythonic names are listed in parentheses after the generic CamelCased names.

However, although the AWS Glue API names themselves are transformed to lowercase, their parameter names remain capitalized. It is important to remember this, because parameters should be passed by name when calling AWS Glue APIs, as described in the following section.

Passing and accessing Python parameters in AWS Glue

In Python calls to AWS Glue APIs, it's best to pass parameters explicitly by name. For example:

job = glue.create_job(Name='sample', Role='Glue_DefaultRole', Command={'Name': 'glueetl', 'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'})

It is helpful to understand that Python creates a dictionary of the name/value tuples that you specify as arguments to an ETL script in a Job structure or JobRun structure. Boto 3 then passes them to AWS Glue in JSON format by way of a REST API call. This means that you cannot rely on the order of the arguments when you access them in your script.

For example, suppose that you're starting a JobRun in a Python Lambda handler function, and you want to specify several parameters. Your code might look something like the following:

from datetime import datetime, timedelta client = boto3.client('glue') def lambda_handler(event, context): last_hour_date_time = datetime.now() - timedelta(hours = 1) day_partition_value = last_hour_date_time.strftime("%Y-%m-%d") hour_partition_value = last_hour_date_time.strftime("%-H") response = client.start_job_run( JobName = 'my_test_Job', Arguments = { '--day_partition_key': 'partition_0', '--hour_partition_key': 'partition_1', '--day_partition_value': day_partition_value, '--hour_partition_value': hour_partition_value } )

To access these parameters reliably in your ETL script, specify them by name using AWS Glue's getResolvedOptions function and then access them from the resulting dictionary:

import sys from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ['JOB_NAME', 'day_partition_key', 'hour_partition_key', 'day_partition_value', 'hour_partition_value']) print "The day partition key is: ", args['day_partition_key'] print "and the day partition value is: ", args['day_partition_value']

If you want to pass an argument that is a nested JSON string, to preserve the parameter value as it gets passed to your AWS Glue ETL job, you must encode the parameter string before starting the job run, and then decode the parameter string before referencing it your job script. For example, consider the following argument string:

glue_client.start_job_run(JobName = "gluejobname", Arguments={ "--my_curly_braces_string": '{"a": {"b": {"c": [{"d": {"e": 42}}]}}}' })

To pass this parameter correctly, you should encode the argument as a Base64 encoded string.

import base64 ... sample_string='{"a": {"b": {"c": [{"d": {"e": 42}}]}}}' sample_string_bytes = sample_string.encode("ascii") base64_bytes = base64.b64encode(sample_string_bytes) base64_string = base64_bytes.decode("ascii") ... glue_client.start_job_run(JobName = "gluejobname", Arguments={ "--my_curly_braces_string": base64_bytes}) ... sample_string_bytes = base64.b64decode(base64_bytes) sample_string = sample_string_bytes.decode("ascii") print(f"Decoded string: {sample_string}") ...

Example: Create and run a job

The following example shows how call the AWS Glue APIs using Python, to create and run an ETL job.

To create and run a job
  1. Create an instance of the AWS Glue client:

    import boto3 glue = boto3.client(service_name='glue', region_name='us-east-1', endpoint_url='https://glue.us-east-1.amazonaws.com')
  2. Create a job. You must use glueetl as the name for the ETL command, as shown in the following code:

    myJob = glue.create_job(Name='sample', Role='Glue_DefaultRole', Command={'Name': 'glueetl', 'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'})
  3. Start a new run of the job that you created in the previous step:

    myNewJobRun = glue.start_job_run(JobName=myJob['Name'])
  4. Get the job status:

    status = glue.get_job_run(JobName=myJob['Name'], RunId=myNewJobRun['JobRunId'])
  5. Print the current state of the job run:

    print(status['JobRun']['JobRunState'])