Running Spark SQL scripts through the StartJobRun API

Amazon EMR on EKS releases 6.7.0 and higher include a Spark SQL job driver so that you can run Spark SQL scripts through the StartJobRun API. You can supply SQL entry-point files to directly run Spark SQL queries on Amazon EMR on EKS with the StartJobRun API, without any modifications to existing Spark SQL scripts. The following table lists Spark parameters that are supported for the Spark SQL jobs through the StartJobRun API.

You can choose from the following Spark parameters to send to a Spark SQL job. Use these parameters to override default Spark properties.

Option	Description
--name NAME	Application Name
--jars JARS	Comma separated list of jars to be included with driver and execute classpath.
--packages	Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.
--exclude-packages	Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts.
--repositories	Comma-separated list of additional remote repositories to search for the maven coordinates given with –packages.
--files FILES	Comma-separated list of files to be placed in the working directory of each executor.
--conf PROP=VALUE	Spark configuration property.
--properties-file FILE	Path to a file from which to load extra properties.
--driver-memory MEM	Memory for driver. Default 1024MB.
--driver-java-options	Extra Java options to pass to the driver.
--driver-library-path	Extra library path entries to pass to the driver.
--driver-class-path	Extra classpath entries to pass to the driver.
--executor-memory MEM	Memory per executor. Default 1GB.
--driver-cores NUM	Number of cores used by the driver.
--total-executor-cores NUM	Total cores for all executors.
--executor-cores NUM	Number of cores used by each executor.
--num-executors NUM	Number of executors to launch.
-hivevar <key=value>	Variable substitution to apply to Hive commands, for example, `-hivevar A=B`
-hiveconf <property=value>	Value to use for the given property.

For a Spark SQL job, create a start-job-run-request.json file and specify the required parameters for your job run, as in the following example:


{
  "name": "myjob", 
  "virtualClusterId": "123456",  
  "executionRoleArn": "iam_role_name_for_job_execution", 
  "releaseLabel": "emr-6.7.0-latest", 
  "jobDriver": {
    "sparkSqlJobDriver": {
      "entryPoint": "entryPoint_location",
       "sparkSqlParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
    }
  }, 
  "configurationOverrides": {
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "persistentAppUI": "ENABLED", 
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "my_log_group", 
        "logStreamNamePrefix": "log_stream_prefix"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "s3://my_s3_log_location"
      }
    }
  }
}

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Use CloudWatch Logs

Job run states