Running Spark SQL scripts through the StartJobRun API - Amazon EMR

Running Spark SQL scripts through the StartJobRun API

Amazon EMR on EKS releases 6.7.0 and higher include a Spark SQL job driver so that you can run Spark SQL scripts through the StartJobRun API. You can supply SQL entry-point files to directly run Spark SQL queries on Amazon EMR on EKS with the StartJobRun API, without any modifications to existing Spark SQL scripts. The following table lists Spark parameters that are supported for the Spark SQL jobs through the StartJobRun API.

You can choose from the following Spark parameters to send to a Spark SQL job. Use these parameters to override default Spark properties.

Option Description

--name NAME

Application Name
--jars JARS Comma separated list of jars to be included with driver and execute classpath.
--packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts.
--repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with –packages.
--files FILES Comma-separated list of files to be placed in the working directory of each executor.
--conf PROP=VALUE Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties.
--driver-memory MEM Memory for driver. Default 1024MB.
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra classpath entries to pass to the driver.
--executor-memory MEM Memory per executor. Default 1GB.
--driver-cores NUM Number of cores used by the driver.
--total-executor-cores NUM Total cores for all executors.
--executor-cores NUM Number of cores used by each executor.
--num-executors NUM Number of executors to launch.
-hivevar <key=value> Variable substitution to apply to Hive commands, for example, -hivevar A=B
-hiveconf <property=value> Value to use for the given property.

For a Spark SQL job, create a start-job-run-request.json file and specify the required parameters for your job run, as in the following example:

{ "name": "myjob", "virtualClusterId": "123456", "executionRoleArn": "iam_role_name_for_job_execution", "releaseLabel": "emr-6.7.0-latest", "jobDriver": { "sparkSqlJobDriver": { "entryPoint": "entryPoint_location", "sparkSqlParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1" } }, "configurationOverrides": { "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.driver.memory":"2G" } } ], "monitoringConfiguration": { "persistentAppUI": "ENABLED", "cloudWatchMonitoringConfiguration": { "logGroupName": "my_log_group", "logStreamNamePrefix": "log_stream_prefix" }, "s3MonitoringConfiguration": { "logUri": "s3://my_s3_log_location" } } } }