Table Of Contents

Feedback

User Guide

First time using the AWS CLI? See the User Guide for help getting started.

[ aws . emr ]

create-cluster

Description

Creates an Amazon EMR cluster with the specified configurations.

Quick start:

aws emr create-cluster --release-label <release-label> --instance-type <instance-type> --instance-count <instance-count>

Values for the following can be set in the AWS CLI config file using the "aws configure set" command: --service-role, --log-uri, and InstanceProfile and KeyName arguments under --ec2-attributes.

See 'aws help' for descriptions of global parameters.

Synopsis

 create-cluster
--release-label <value>   | --ami-version <value>
--instance-fleets <value> | --instance-groups <value> | --instance-type <value> --instance-count <value>
[--auto-terminate | --no-auto-terminate]
[--use-default-roles]
[--service-role <value>]
[--configurations <value>]
[--name <value>]
[--log-uri <value>]
[--additional-info <value>]
[--ec2-attributes <value>]
[--termination-protected | --no-termination-protected]
[--scale-down-behavior <value>]
[--visible-to-all-users | --no-visible-to-all-users]
[--enable-debugging | --no-enable-debugging]
[--tags <value>]
[--applications <value>]
[--emrfs <value>]
[--bootstrap-actions <value>]
[--steps <value>]
[--restore-from-hbase-backup <value>]
[--security-configuration <value>]
[--custom-ami-id <value>]
[--ebs-root-volume-size <value>]
[--repo-upgrade-on-boot <value>]
[--kerberos-attributes <value>]

Options

--release-label (string)

Specifies the Amazon EMR release version, which determines the versions of application software that are installed on the cluster. For example, --release-label emr-5.15.0 installs the application versions and features available in that version. For details about application versions and features available in each release, see the Amazon EMR Release Guide:

https://docs.aws.amazon.com/emr/ReleaseGuide

Use --release-label only for Amazon EMR release version 4.0 and later. Use --ami-version for earlier versions. You cannot specify both a release label and AMI version.

--ami-version (string)

Applies only to Amazon EMR release versions earlier than 4.0. Use --release-label for 4.0 and later. Specifies the version of Amazon Linux Amazon Machine Image (AMI) to use when launching Amazon EC2 instances in the cluster. For example, --ami-version 3.1.0 .

--instance-groups (list)

Specifies the number and type of Amazon EC2 instances to create for each node type in a cluster, using uniform instance groups. You can specify either --instance-groups or --instance-fleets but not both. For more information, see the following topic in the EMR Management Guide:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-group-configuration.html

You can specify arguments individually using multiple InstanceGroupType argument blocks, one for the MASTER instance group, one for a CORE instance group, and optional, multiple TASK instance groups.

If you specify inline JSON structures, enclose the entire InstanceGroupType argument block in single quotation marks.

Each InstanceGroupType block takes the following inline arguments. Optional arguments are shown in [square brackets].

  • [Name] - An optional friendly name for the instance group.
  • InstanceGroupType - MASTER , CORE , or TASK .
  • InstanceType - The type of EC2 instance, for example m4.large , to use for all nodes in the instance group.
  • InstanceCount - The number of EC2 instances to provision in the instance group.
  • [BidPrice] - If specified, indicates that the instance group uses Spot Instances. This is the maximum price you are willing to pay for Spot Instances. Specify OnDemandPrice to set the amount equal to the On-Demand price, or specify an amount in USD.
  • [EbsConfiguration] - Specifies additional Amazon EBS storage volumes attached to EC2 instances using an inline JSON structure.
  • [AutoScalingPolicy] - Specifies an automatic scaling policy for the instance group using an inline JSON structure.

JSON Syntax:

[
  {
    "InstanceCount": integer,
    "Name": "string",
    "InstanceGroupType": "MASTER"|"CORE"|"TASK",
    "AutoScalingPolicy": {
      "Rules": [
        {
          "Action": {
            "SimpleScalingPolicyConfiguration": {
              "ScalingAdjustment": integer,
              "CoolDown": integer,
              "AdjustmentType": "CHANGE_IN_CAPACITY"|"PERCENT_CHANGE_IN_CAPACITY"|"EXACT_CAPACITY"
            },
            "Market": "ON_DEMAND"|"SPOT"
          },
          "Trigger": {
            "CloudWatchAlarmDefinition": {
              "EvaluationPeriods": integer,
              "Dimensions": [
                {
                  "Key": "string",
                  "Value": "string"
                }
                ...
              ],
              "Namespace": "string",
              "Period": integer,
              "ComparisonOperator": "string",
              "Statistic": "string",
              "Threshold": double,
              "Unit": "string",
              "MetricName": "string"
            }
          },
          "Name": "string",
          "Description": "string"
        }
        ...
      ],
      "Constraints": {
        "MinCapacity": integer,
        "MaxCapacity": integer
      }
    },
    "EbsConfiguration": {
      "EbsOptimized": true|false,
      "EbsBlockDeviceConfigs": [
        {
          "VolumeSpecification": {
            "Iops": integer,
            "VolumeType": "string",
            "SizeInGB": integer
          },
          "VolumesPerInstance": integer
        }
        ...
      ]
    },
    "BidPrice": "string",
    "InstanceType": "string"
  }
  ...
]

--instance-type (string)

Shortcut parameter as an alternative to --instance-groups . Specifies the type of Amazon EC2 instance to use in a cluster. If used without the --instance-count parameter, the cluster consists of a single master node running on the EC2 instance type specified. When used together with --instance-count , one instance is used for the master node, and the remainder are used for the core node type.

--instance-count (string)

Shortcut parameter as an alternative to --instance-groups when used together with --instance-type . Specifies the number of Amazon EC2 instances to create for a cluster. One instance is used for the master node, and the remainder are used for the core node type.

--auto-terminate | --no-auto-terminate (boolean)

Specifies whether the cluster should terminate after completing all the steps. Auto termination is off by default.

--instance-fleets (list)

Applies only to Amazon EMR release version 5.0 and later. Specifies the number and type of Amazon EC2 instances to create for each node type in a cluster, using instance fleets. You can specify either --instance-fleets or --instance-groups but not both. For more information and examples, see the following topic in the Amazon EMR Management Guide:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html

You can specify arguments individually using multiple InstanceFleetType argument blocks, one for the MASTER instance fleet, one for a CORE instance fleet, and an optional TASK instance fleet.

The following arguments can be specified for each instance fleet. Optional arguments are shown in [square brackets].

  • [Name] - An optional friendly name for the instance fleet.
  • InstanceFleetType - MASTER , CORE , or TASK .
  • TargetOnDemandCapacity - The target capacity of On-Demand units for the instance fleet, which determines how many On-Demand Instances to provision. The WeightedCapacity specified for an instance type within InstanceTypeConfigs counts toward this total when an instance type with the On-Demand purchasing option launches.
  • TargetSpotCapacity - The target capacity of Spot units for the instance fleet, which determines how many Spot Instances to provision. The WeightedCapacity specified for an instance type within InstanceTypeConfigs counts toward this total when an instance type with the Spot purchasing option launches.
  • [LaunchSpecifications] - When TargetSpotCapacity is specified, specifies the block duration and timeout action for Spot Instances.
  • InstanceTypeConfigs - Specifies up to five EC2 instance types to use in the instance fleet, including details such as Spot price and Amazon EBS configuration.

JSON Syntax:

[
  {
    "Name": "string",
    "InstanceFleetType": "MASTER"|"CORE"|"TASK",
    "LaunchSpecifications": {
      "SpotSpecification": {
        "TimeoutDurationMinutes": integer,
        "BlockDurationMinutes": integer,
        "TimeoutAction": "TERMINATE_CLUSTER"|"SWITCH_TO_ONDEMAND"
      }
    },
    "TargetSpotCapacity": integer,
    "InstanceTypeConfigs": [
      {
        "WeightedCapacity": integer,
        "EbsConfiguration": {
          "EbsOptimized": true|false,
          "EbsBlockDeviceConfigs": [
            {
              "VolumeSpecification": {
                "Iops": integer,
                "VolumeType": "string",
                "SizeInGB": integer
              },
              "VolumesPerInstance": integer
            }
            ...
          ]
        },
        "BidPrice": "string",
        "BidPriceAsPercentageOfOnDemandPrice": double,
        "InstanceType": "string",
        "Configurations": "string"
      }
      ...
    ],
    "TargetOnDemandCapacity": integer
  }
  ...
]

--name (string)

The name of the cluster. If not provided, the default is "Development Cluster".

--log-uri (string)

Specifies the location in Amazon S3 to which log files are periodically written. If a value is not provided, logs files are not written to Amazon S3 from the master node and are lost if the master node terminates.

--service-role (string)

Specifies an IAM service role, which Amazon EMR requires to call other AWS services on your behalf during cluster operation. This parameter is usually specified when a customized service role is used. To specify the default service role, as well as the default instance profile, use the --use-default-roles parameter. If the role and instance profile do not already exist, use the aws emr create-default-roles command to create them.

--auto-scaling-role (string)

Specify --auto-scaling-role EMR_AutoScaling_DefaultRole if an automatic scaling policy is specified for an instance group using the --instance-groups parameter. This default IAM role allows the automatic scaling feature to launch and terminate Amazon EC2 instances during scaling operations.

--use-default-roles (boolean)

Specifies that the cluster should use the default service role (EMR_DefaultRole) and instance profile (EMR_EC2_DefaultRole) for permissions to access other AWS services.

Make sure that the role and instance profile exist first. To create them, use the create-default-roles command.

--configurations (string)

Specifies a JSON file that contains configuration classifications, which you can use to customize applications that Amazon EMR installs when cluster instances launch. Applies only to Amazon EMR 4.0 and later. The file referenced can either be stored locally (for example, --configurations file://configurations.json ) or stored in Amazon S3 (for example, --configurations https://s3.amazonaws.com/myBucket/configurations.json ). Each classification usually corresponds to the xml configuration file for an application, such as yarn-site for YARN. For a list of available configuration classifications and example JSON, see the following topic in the Amazon EMR Release Guide:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

--ec2-attributes (structure)

Configures cluster and Amazon EC2 instance configurations. Accepts the following arguments:

  • KeyName - Specifies the name of the AWS EC2 key pair that will be used for SSH connections to the master node and other instances on the cluster.
  • AvailabilityZone - Specifies the availability zone in which to launch the cluster. For example, us-west-1b .
  • SubnetId - Specifies the VPC subnet in which to create the cluster.
  • InstanceProfile - An IAM role that allows EC2 instances to access other AWS services, such as Amazon S3, that are required for operations.
  • EmrManagedMasterSecurityGroup - The security group ID of the Amazon EC2 security group for the master node.
  • EmrManagedSlaveSecurityGroup - The security group ID of the Amazon EC2 security group for the slave nodes.
  • ServiceAccessSecurityGroup - The security group ID of the Amazon EC2 security group for Amazon EMR access to clusters in VPC private subnets.
  • AdditionalMasterSecurityGroups - A list of additional Amazon EC2 security group IDs for the master node.
  • AdditionalSlaveSecurityGroups - A list of additional Amazon EC2 security group IDs for the slave nodes.

Shorthand Syntax:

ServiceAccessSecurityGroup=string,AvailabilityZone=string,AdditionalSlaveSecurityGroups=string,string,EmrManagedMasterSecurityGroup=string,SubnetIds=string,string,KeyName=string,InstanceProfile=string,SubnetId=string,AdditionalMasterSecurityGroups=string,string,AvailabilityZones=string,string,EmrManagedSlaveSecurityGroup=string

JSON Syntax:

{
  "ServiceAccessSecurityGroup": "string",
  "AvailabilityZone": "string",
  "AdditionalSlaveSecurityGroups": ["string", ...],
  "EmrManagedMasterSecurityGroup": "string",
  "SubnetIds": ["string", ...],
  "KeyName": "string",
  "InstanceProfile": "string",
  "SubnetId": "string",
  "AdditionalMasterSecurityGroups": ["string", ...],
  "AvailabilityZones": ["string", ...],
  "EmrManagedSlaveSecurityGroup": "string"
}

--termination-protected | --no-termination-protected (boolean)

Specifies whether to lock the cluster to prevent the Amazon EC2 instances from being terminated by API call, user intervention, or an error.

--scale-down-behavior (string)

Specifies the way that individual Amazon EC2 instances terminate when an automatic scale-in activity occurs or an instance group is resized.

Accepted values:

  • TERMINATE_AT_TASK_COMPLETION - Specifies that Amazon EMR blacklists and drains tasks from nodes before terminating the instance.
  • TERMINATE_AT_INSTANCE_HOUR - Specifies that Amazon EMR terminate EC2 instances at the instance-hour boundary, regardless of when the request to terminate was submitted.

--visible-to-all-users | --no-visible-to-all-users (boolean)

Specifies whether the cluster is visible to all IAM users of the AWS account associated with the cluster. If set to --visible-to-all-users , all IAM users of that AWS account can view it. If they have the proper policy permissions set, they can also manage the cluster. If it is set to --no-visible-to-all-users , only the IAM user that created the cluster can view and manage it. Clusters are visible by default.

--enable-debugging | --no-enable-debugging (boolean)

Specifies that the debugging tool is enabled for the cluster, which allows you to browse log files using the Amazon EMR console. Turning debugging on requires that you specify --log-uri because log files must be stored in Amazon S3 so that Amazon EMR can index them for viewing in the console.

--tags (list)

A list of tags to associate with a cluster, which apply to each Amazon EC2 instance in the cluster. Tags are key-value pairs that consist of a required key string with a maximum of 128 characters, and an optional value string with a maximum of 256 characters.

You can specify tags in key=value format or you can add a tag without a value using only the key name, for example key . Use a space to separate multiple tags.

Syntax:

"string" "string" ...

--bootstrap-actions (list)

Specifies a list of bootstrap actions to run on each EC2 instance when a cluster is created. Bootstrap actions run on each instance immediately after Amazon EMR provisions the EC2 instance and before Amazon EMR installs specified applications.

You can specify a bootstrap action as an inline JSON structure enclosed in single quotation marks, or you can use a shorthand syntax, specifying multiple bootstrap actions, each separated by a space. When using the shorthand syntax, each bootstrap action takes the following parameters, separated by commas with no trailing space. Optional parameters are shown in [square brackets].

  • Path - The path and file name of the script to run, which must be accessible to each instance in the cluster. For example, Path=s3://mybucket/myscript.sh .
  • [Name] - A friendly name to help you identify the bootstrap action. For example, Name=BootstrapAction1
  • [Args] - A comma-separated list of arguments to pass to the bootstrap action script. Arguments can be either a list of values (Args=arg1,arg2,arg3 ) or a list of key-value pairs, as well as optional values, enclosed in square brackets (Args=[arg1,arg2=arg2value,arg3]) .

Shorthand Syntax:

Path=string,Args=string,string,Name=string ...

JSON Syntax:

[
  {
    "Path": "string",
    "Args": ["string", ...],
    "Name": "string"
  }
  ...
]

--applications (list)

Specifies the applications to install on the cluster. Available applications and their respective versions vary by Amazon EMR release. For more information, see the Amazon EMR Release Guide:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/

When using versions of Amazon EMR earlier than 4.0, some applications take optional arguments for configuration. Arguments should either be a comma-separated list of values (Args=arg1,arg2,arg3 ) or a bracket-enclosed list of values and key-value pairs (Args=[arg1,arg2=arg3,arg4] ).

Shorthand Syntax:

Args=string,string,Name=string ...

JSON Syntax:

[
  {
    "Args": ["string", ...],
    "Name": "MapR"|"HUE"|"HIVE"|"PIG"|"HBASE"|"IMPALA"|"GANGLIA"|"HADOOP"|"SPARK"
  }
  ...
]

--emrfs (structure)

Specifies EMRFS configuration options, such as consistent view and Amazon S3 encryption parameters.

When you use Amazon EMR release version 4.8.0 or later, we recommend that you use the --configurations option together with the emrfs-site configuration classification to configure EMRFS, and use security configurations to configure encryption for EMRFS data in Amazon S3 instead. For more information, see the following topic in the Amazon EMR Management Guide:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-configure-consistent-view.html

Shorthand Syntax:

Args=string,string,Encryption=string,Consistent=boolean,ProviderType=string,KMSKeyId=string,CustomProviderLocation=string,SSE=boolean,RetryCount=integer,RetryPeriod=integer,CustomProviderClass=string

JSON Syntax:

{
  "Args": ["string", ...],
  "Encryption": "SERVERSIDE"|"CLIENTSIDE",
  "Consistent": true|false,
  "ProviderType": "KMS"|"CUSTOM",
  "KMSKeyId": "string",
  "CustomProviderLocation": "string",
  "SSE": true|false,
  "RetryCount": integer,
  "RetryPeriod": integer,
  "CustomProviderClass": "string"
}

--steps (list)

Specifies a list of steps to be executed by the cluster. Steps run only on the master node after applications are installed and are used to submit work to a cluster. A step can be specified using the shorthand syntax, by referencing a JSON file or by specifying an inline JSON structure. Args supplied with steps should be acomma-separated list of values (Args=arg1,arg2,arg3 ) or a bracket-enclosed list of values and key-value pairs (Args=[arg1,arg2=value,arg4 ).

Shorthand Syntax:

Name=string,Args=string,string,Jar=string,ActionOnFailure=string,MainClass=string,Type=string,Properties=string ...

JSON Syntax:

[
  {
    "Name": "string",
    "Args": ["string", ...],
    "Jar": "string",
    "ActionOnFailure": "TERMINATE_CLUSTER"|"CANCEL_AND_WAIT"|"CONTINUE",
    "MainClass": "string",
    "Type": "CUSTOM_JAR"|"STREAMING"|"HIVE"|"PIG"|"IMPALA",
    "Properties": "string"
  }
  ...
]

--additional-info (string)

Specifies additional information during cluster creation.

--restore-from-hbase-backup (structure)

Applies only when using Amazon EMR release versions earlier than 4.0. Launches a new HBase cluster and populates it with data from a previous backup of an HBase cluster. HBase must be installed using the --applications option.

Shorthand Syntax:

BackupVersion=string,Dir=string

JSON Syntax:

{
  "BackupVersion": "string",
  "Dir": "string"
}

--security-configuration (string)

Specifies the name of a security configuration to use for the cluster. A security configuration defines data encryption settings and other security options. For more information, see the following topic in the Amazon EMR Management Guide:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-enable-security-configuration.html

Use list-security-configurations to get a list of available security configurations in the active account.

--custom-ami-id (string)

Applies only to Amazon EMR release version 5.7.0 and later. Specifies the AMI ID of a custom AMI to use when Amazon EMR provisions EC2 instances. A custom AMI can be used to encrypt the Amazon EBS root volume. It can also be used instead of bootstrap actions to customize cluster node configurations. For more information, see the following topic in the Amazon EMR Management Guide:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-custom-ami.html

--ebs-root-volume-size (string)

Applies only to Amazon EMR release version 4.0 and earlier. Specifies the size, in GiB, of the EBS root device volume of the Amazon Linux AMI that is used for each EC2 instance in the cluster.

--repo-upgrade-on-boot (string)

Applies only when a --custom-ami-id is specified. On first boot, by default, Amazon Linux AMIs connect to package repositories to install security updates before other services start. You can set this parameter using --rep-upgrade-on-boot NONE to disable these updates. CAUTION: This creates additional security risks.

--kerberos-attributes (structure)

Specifies required cluster attributes for Kerberos when Kerberos authentication is enabled in the specified --security-configuration . Takes the following arguments:

  • Realm - Specifies the name of the Kerberos realm to which all nodes in a cluster belong. For example, Realm=EC2.INTERNAL .
  • KdcAdminPassword - Specifies the password used within the cluster for the kadmin service, which maintains Kerberos principals, password policies, and keytabs for the cluster.
  • CrossRealmTrustPrincipalPassword - Required when establishing a cross-realm trust with a KDC in a different realm. This is the cross-realm principal password, which must be identical across realms.
  • ADDomainJoinUser - Required when establishing trust with an Active Directory domain. This is the User logon name of an AD account with sufficient privileges to join resouces to the domain.
  • ADDomainJoinPassword - The AD password for ADDomainJoinUser .

Shorthand Syntax:

Realm=string,KdcAdminPassword=string,ADDomainJoinPassword=string,CrossRealmTrustPrincipalPassword=string,ADDomainJoinUser=string

JSON Syntax:

{
  "Realm": "string",
  "KdcAdminPassword": "string",
  "ADDomainJoinPassword": "string",
  "CrossRealmTrustPrincipalPassword": "string",
  "ADDomainJoinUser": "string"
}

See 'aws help' for descriptions of global parameters.

Examples

Most of these examples assume that you specified your Amazon EMR service role and Amazon EC2 instance profile. If you have not done this, you must specify each required IAM role or use the --use-default-roles parameter when creating your cluster. For more information about specifying IAM roles, see the following topic in the Amazon EMR Management Guide:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles-launch-jobflow.html

Quick start: create a cluster

Command:

aws emr create-cluster --release-label emr-5.14.0 --instance-type m4.large --instance-count 2

Create an Amazon EMR cluster with default ServiceRole and InstanceProfile roles

Create an Amazon EMR cluster that uses the --instance-groups configuration.

Command:

aws emr create-cluster --release-label emr-5.14.0 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

Create an Amazon EMR cluster that uses the --instance-fleets configuration, specifying two instance types for each fleet and two EC2 Subnets.

Command:

aws emr create-cluster --release-label emr-5.14.0 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c','subnet-de67890f'] --instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m4.large}'] InstanceFleetType=CORE,TargetSpotCapacity=11,InstanceTypeConfigs=['{InstanceType=m4.large,BidPrice=0.5,WeightedCapacity=3}','{InstanceType=m4.2xlarge,BidPrice=0.9,WeightedCapacity=5}'],LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'}

Create a cluster with default roles

The following example uses the --use-default-roles parameter to specify the default service role and instance profile.

Command:

aws emr create-cluster --release-label emr-5.9.0 --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

Create a cluster and specify the applications to install

Use the --applications parameter to specify the applications that Amazon EMR installs. The following example installs Hadoop, Hive and Pig.

Command:

aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Pig --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The following example installs Spark.

Command:

aws emr create-cluster --release-label emr-5.9.0 --applications Name=Spark --ec2-attributes KeyName=myKey --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

Specify a custom AMI to use for cluster instances

The following example creates a cluster instance based on the Amazon Linux AMI with ID ami-a518e6df.

Command:

aws emr create-cluster --name "Cluster with My Custom AMI" --custom-ami-id ami-a518e6df --ebs-root-volume-size 20 --release-label emr-5.9.0 --use-default-roles --instance-count 2 --instance-type m4.large

Customize application configurations

The following examples use the --configurations parameter to specify a JSON configuration file that contains application customizations for Hadoop. For more information, see http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html.

The following example specifies configurations.json as a local file.

Command:

aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The following example specifies configurations.json as a file in Amazon S3.

Command:

aws emr create-cluster --configurations https://s3.amazonaws.com/myBucket/configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The following demonstrates example contents of configurations.json.

[
 {
   "Classification": "mapred-site",
   "Properties": {
       "mapred.tasktracker.map.tasks.maximum": 2
   }
 },
 {
   "Classification": "hadoop-env",
   "Properties": {},
   "Configurations": [
       {
         "Classification": "export",
         "Properties": {
             "HADOOP_DATANODE_HEAPSIZE": 2048,
             "HADOOP_NAMENODE_OPTS": "-XX:GCTimeRatio=19"
         }
       }
   ]
 }
]

Create a cluster with master, core, and task instance groups

The following example creates a cluster, using --instance-groups to specify the type and number of EC2 instances to use for master, core, and task instance groups

Command:

aws emr create-cluster --release-label emr-5.9.0 --instance-groups Name=Master,InstanceGroupType=MASTER,InstanceType=m4.large,InstanceCount=1 Name=Core,InstanceGroupType=CORE,InstanceType=m4.large,InstanceCount=2 Name=Task,InstanceGroupType=TASK,InstanceType=m4.large,InstanceCount=2

Specify that a cluster should terminate after completing all steps

The following example uses --auto-terminate to specify that the cluster should shut down automatically after completing all steps.

Command:

aws emr create-cluster --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large  InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

Specify cluster configuration details such as the Amazon EC2 key pair, network configuration, and security groups

The following example creates a cluster with the Amazon EC2 key pair named myKey and a customized instance profile named myProfile. Key pairs are used to authorize SSH connections to cluster nodes, most often the master node. For more information, see http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-access-ssh.html.

Command:

aws emr create-cluster --ec2-attributes KeyName=myKey,InstanceProfile=myProfile --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The following example creates a cluster in an Amazon VPC subnet.

Command:

aws emr create-cluster --ec2-attributes SubnetId=subnet-xxxxx --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The following example creates a cluster in the us-east-1b availability zone.

Command:

aws emr create-cluster --ec2-attributes AvailabilityZone=us-east-1b --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

The following example creates a cluster and specifies only the Amazon EMR-managed security groups.

Command:

aws emr create-cluster --release-label emr-5.9.0 --service-role myServiceRole --ec2-attributes InstanceProfile=myRole,EmrManagedMasterSecurityGroup=sg-master1,EmrManagedSlaveSecurityGroup=sg-slave1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

The following example creates a cluster and specifies only additional Amazon EC2 security groups.

Command:

aws emr create-cluster --release-label emr-5.9.0 --service-role myServiceRole --ec2-attributes InstanceProfile=myRole,AdditionalMasterSecurityGroups=[sg-addMaster1,sg-addMaster2,sg-addMaster3,sg-addMaster4],AdditionalSlaveSecurityGroups=[sg-addSlave1,sg-addSlave2,sg-addSlave3,sg-addSlave4] --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

The following example creates a cluster and specifies the EMR-Managed security groups, as well as additional security groups.

Command:

aws emr create-cluster --release-label emr-5.9.0 --service-role myServiceRole --ec2-attributes InstanceProfile=myRole,EmrManagedMasterSecurityGroup=sg-master1,EmrManagedSlaveSecurityGroup=sg-slave1,AdditionalMasterSecurityGroups=[sg-addMaster1,sg-addMaster2,sg-addMaster3,sg-addMaster4],AdditionalSlaveSecurityGroups=[sg-addSlave1,sg-addSlave2,sg-addSlave3,sg-addSlave4] --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

The following example creates a cluster in a VPC private subnet and use a specific Amazon EC2 security group to enable Amazon EMR service access, which is required for clusters in private subnets.

Command:

aws  emr create-cluster --release-label emr-5.9.0 --service-role myServiceRole --ec2-attributes InstanceProfile=myRole,ServiceAccessSecurityGroup=sg-service-access,EmrManagedMasterSecurityGroup=sg-master,EmrManagedSlaveSecurityGroup=sg-slave --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

The following example specifies security group configuration parameters within a JSON file, ec2_attributes.json, that is stored locally.

Command:

aws emr create-cluster --release-label emr-5.9.0 --service-role myServiceRole --ec2-attributes file://ec2_attributes.json  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

The following example demonstrates the contents of ec2_attributes.json.

 [
  {
    "SubnetId": "subnet-xxxxx",
    "KeyName": "myKey",
    "InstanceProfile":"myRole",
    "EmrManagedMasterSecurityGroup": "sg-master1",
    "EmrManagedSlaveSecurityGroup": "sg-slave1",
    "ServiceAccessSecurityGroup": "sg-service-access"
    "AdditionalMasterSecurityGroups": ["sg-addMaster1","sg-addMaster2","sg-addMaster3","sg-addMaster4"],
    "AdditionalSlaveSecurityGroups": ["sg-addSlave1","sg-addSlave2","sg-addSlave3","sg-addSlave4"]
  }
]

NOTE: JSON arguments must include options and values as their own items in the list.

Enable debugging and specify a log URI

The following example uses the --enable-debugging parameter, which allows you to view log files more easily using the debugging tool in the Amazon EMR console. The --log-uri parameter is required with --enable-debugging.

Command:

aws emr create-cluster --enable-debugging --log-uri s3://myBucket/myLog --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

Add tags when creating a cluster

Tags are key-value pairs that help you identify and manage clusters. The following example uses the --tags parameter to create two tags for a cluster, one with the key name name and the value Shirley Rodriguez and the other with the key name address and the value 123 Maple Street, Anytown, USA.

Command:

aws emr create-cluster --tags name="Shirley Rodriguez" age=29 department="Analytics" --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The following example lists the tags applied to a cluster.

Command:

aws emr describe-cluster --cluster-id j-XXXXXXYY --query Cluster.Tags

Use a security configuration to enable encryption and other security features

The following example uses the --security-configuration parameter to specify a security configuration for an EMR cluster. You can use security configurations with Amazon EMR version 4.8.0 or later.

Command:

aws emr create-cluster --instance-type m4.large --release-label emr-5.9.0 --security-configuration mySecurityConfiguration

Create a cluster with additional EBS storage volumes configured for the instance groups

Wnen specifying additional EBS volumes, the following arguments are required: VolumeType, SizeInGB if EbsBlockDeviceConfigs is specified.

The following example creates a cluster with multiple EBS volumes attached to EC2 instances in the core instance group.

Command:

aws emr create-cluster --release-label emr-5.9.0  --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=d2.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=d2.xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=4}]}' --auto-terminate

The following example creates a cluster with multiple EBS volumes attached to EC2 instances in the master instance group.

Command:

aws emr create-cluster --release-label emr-5.9.0 --use-default-roles --instance-groups 'InstanceGroupType=MASTER, InstanceCount=1, InstanceType=d2.xlarge, EbsConfiguration={EbsOptimized=true, EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=io1, SizeInGB=100, Iops=100}},{VolumeSpecification={VolumeType=standard,SizeInGB=50},VolumesPerInstance=3}]}' InstanceGroupType=CORE,InstanceCount=2,InstanceType=d2.xlarge --auto-terminate

Create a cluster with an automatic scaling policy

You can attach automatic scaling policies to core and task instance groups using Amazon EMR version 4.0 and later. The automatic scaling policy dynamically adds and removes EC2 instances in response to an Amazon CloudWatch metric. For more information, see http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.html.

When attaching an automatic scaling policy, you must also specify the default role for automatic scaling using --auto-scaling-role EMR_AutoScaling_DefaultRole.

The following example specifies the automatic scaling policy for the CORE instance group using the AutoScalingPolicy argument with an embedded JSON structure, which specifies the scaling policy configuration. Instance groups with an embedded JSON structure must have the entire collection of arguments enclosed in single quotes. Using single quotes is optional for instance groups without an embedded JSON structure.

Command:

aws emr create-cluster --release-label emr-5.9.0 --use-default-roles --auto-scaling-role EMR_AutoScaling_DefaultRole --instance-groups InstanceGroupType=MASTER,InstanceType=d2.xlarge,InstanceCount=1 'InstanceGroupType=CORE,InstanceType=d2.xlarge,InstanceCount=2,AutoScalingPolicy={Constraints={MinCapacity=1,MaxCapacity=5},Rules=[{Name=TestRule,Description=TestDescription,Action={Market=ON_DEMAND,SimpleScalingPolicyConfiguration={AdjustmentType=EXACT_CAPACITY,ScalingAdjustment=2}},Trigger={CloudWatchAlarmDefinition={ComparisonOperator=GREATER_THAN,EvaluationPeriods=5,MetricName=TestMetric,Namespace=EMR,Period=3,Statistic=MAXIMUM,Threshold=4.5,Unit=NONE,Dimensions=[{Key=TestKey,Value=TestValue}]}}}]}'

The following example uses a JSON file, instancegroupconfig.json, to specify the configuration of all instance groups in a cluster. The JSON file specifies the automatic scaling policy configuration for the core instance group.

Command:

aws emr create-cluster --release-label emr-5.9.0 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --instance-groups s3://mybucket/instancegroupconfig.json --auto-scaling-role EMR_AutoScaling_DefaultRole

The following example shows the contents of instancegroupconfig.json.

[
  {
      "InstanceCount": 1,
      "Name": "MyMasterIG",
      "InstanceGroupType": "MASTER",
      "InstanceType": "m4.large"
  },
  {
      "InstanceCount": 2,
      "Name": "MyCoreIG",
      "InstanceGroupType": "CORE",
      "InstanceType": "m4.large",
      "AutoScalingPolicy": {
          "Constraints": {
              "MinCapacity": 2,
              "MaxCapacity": 10
          },
          "Rules": [
              {
                  "Name": "Default-scale-out",
                  "Description": "Replicates the default scale-out rule in the console for YARN memory.",
                  "Action": {
                      "SimpleScalingPolicyConfiguration": {
                          "AdjustmentType": "CHANGE_IN_CAPACITY",
                          "ScalingAdjustment": 1,
                          "CoolDown": 300
                      }
                  },
                  "Trigger": {
                      "CloudWatchAlarmDefinition": {
                          "ComparisonOperator": "LESS_THAN",
                          "EvaluationPeriods": 1,
                          "MetricName": "YARNMemoryAvailablePercentage",
                          "Namespace": "AWS/ElasticMapReduce",
                          "Period": 300,
                          "Threshold": 15,
                          "Statistic": "AVERAGE",
                          "Unit": "PERCENT",
                          "Dimensions": [
                              {
                                  "Key": "JobFlowId",
                                  "Value": "${emr.clusterId}"
                              }
                          ]
                      }
                  }
              }
          ]
      }
  }
 ]

Add custom JAR steps when creating a cluster

The following example adds steps by specifying a JAR file stored in Amazon S3. Steps submit work to a cluster. The main function defined in the JAR file executes after EC2 instances are provisioned, any bootstrap actions have executed, and applications are installed. The steps are specified using Type=CUSTOM_JAR.

Custom JAR steps required the Jar= parameter, which specifies the path and file name of the JAR. Optional parameters are the following.

Type, Name, ActionOnFailure, Args, MainClass

If main class is not specified, the JAR file should specify Main-Class in its manifest file.

Command:

aws emr create-cluster --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://myBucket/mytest.jar,Args=arg1,arg2,arg3 Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://myBucket/mytest.jar,MainClass=mymainclass,Args=arg1,arg2,arg3  --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

Add streaming steps when creating a cluster

The following examples add a streaming step to a cluster that terminates after all steps run.

Streaming steps required parameters.

Type, Args

Streaming steps optional parameters.

Name, ActionOnFailure

The following example adds specifies the step inline.

Command:

aws emr create-cluster --steps Type=STREAMING,Name='Streaming Program',ActionOnFailure=CONTINUE,Args=[-files,s3://elasticmapreduce/samples/wordcount/wordSplitter.py,-mapper,wordSplitter.py,-reducer,aggregate,-input,s3://elasticmapreduce/samples/wordcount/input,-output,s3://mybucket/wordcount/output] --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The following example uses a JSON configuration file, multiplefiles.json, which is stored locally. The JSON configuration specifies multiple files. To specify multiple files within a step, you must use a JSON configuration file to specify the step.

Command:

aws emr create-cluster --steps file://./multiplefiles.json --release-label emr-5.9.0  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The following example demonstrates the contents of multiplefiles.json.

[
  {
      "Name": "JSON Streaming Step",
      "Args": [
          "-files",
          "s3://elasticmapreduce/samples/wordcount/wordSplitter.py",
          "-mapper",
          "wordSplitter.py",
          "-reducer",
          "aggregate",
          "-input",
          "s3://elasticmapreduce/samples/wordcount/input",
          "-output",
          "s3://mybucket/wordcount/output"
      ],
      "ActionOnFailure": "CONTINUE",
      "Type": "STREAMING"
  }
]

NOTE: JSON arguments must include options and values as their own items in the list.

Add Hive steps when creating a cluster

Command:

aws emr create-cluster --steps Type=HIVE,Name='Hive program',ActionOnFailure=CONTINUE,ActionOnFailure=TERMINATE_CLUSTER,Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/model-build.q,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/2014-04-18/11-07-32,-d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs] --applications Name=Hive --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

Hive steps required parameters.

Type, Args

Hive steps optional parameters.

Name, ActionOnFailure

Add Pig steps when creating a cluster

Command:

aws emr create-cluster --steps Type=PIG,Name='Pig program',ActionOnFailure=CONTINUE,Args=[-f,s3://elasticmapreduce/samples/pig-apache/do-reports2.pig,-p,INPUT=s3://elasticmapreduce/samples/pig-apache/input,-p,OUTPUT=s3://mybucket/pig-apache/output] --applications Name=Pig --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large

Pig steps required parameters.

Type, Args

Pig steps optional parameters.

Name, ActionOnFailure

Add bootstrap actions

The following example runs two bootstrap actions defined as scripts that are stored in Amazon S3.

Command:

aws emr create-cluster --bootstrap-actions Path=s3://mybucket/myscript1,Name=BootstrapAction1,Args=[arg1,arg2] Path=s3://mybucket/myscript2,Name=BootstrapAction2,Args=[arg1,arg2] --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

Enable EMRFS consistent view and customize the RetryCount and RetryPeriod settings

The following example specifies the retry count and retry period for EMRFS consistent view. The Consistent=true argument is required.

Command:

aws emr create-cluster --instance-type m4.large --release-label emr-5.9.0 --emrfs Consistent=true,RetryCount=6,RetryPeriod=30

The following example specifies the same EMRFS configuration as the previous example, using a JSON configuration file, emrfsconfig.json, stored locally.

Command:

aws emr create-cluster --instance-type m4.large --release-label emr-5.9.0 --emrfs file://emrfsconfig.json

The following example demonstrates the contents of emrfsconfig.json.

{
  "Consistent": true,
  "RetryCount": 6,
  "RetryPeriod": 30
}

Create a cluster with Kerberos configured

The following examples create a cluster using a security configuration with Kerberos enabled, and establishes Kerberos parameters for the cluster using --kerberos-attributes.

The following command specifies Kerberos attributes for the cluster inline.

Command:

aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.10.0 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --security-configuration mySecurityConfiguration --kerberos-attributes Realm=EC2.INTERNAL,KdcAdminPassword=123,CrossRealmTrustPrincipalPassword=123

The following command specifies the same attributes, but references a JSON file, kerberos_attributes.json, for the properties and values. In this example, the file is saved in the same directory where you run the command. You can also reference a configuration file saved in Amazon S3.

Command:

aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.10.0 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --security-configuration mySecurityConfiguration --kerberos-attributes file://kerberos_attributes.json

The contents of kerberos_attributes.json are shown below:

{
  "Realm": "EC2.INTERNAL",
  "KdcAdminPassword": "123",
  "CrossRealmTrustPrincipalPassword": "123",
}