Table Of Contents

Feedback

User Guide

First time using the AWS CLI? See the User Guide for help getting started.

[ aws . emr ]

create-cluster

Description

Creates an Amazon EMR cluster with the specified configurations.

Quick start:

aws emr create-cluster --release-label <release-label> --instance-type <instance-type> --instance-count <instance-count>

Values for the following can be set in the AWS CLI config file using the "aws configure set" command: --service-role, --log-uri, and InstanceProfile and KeyName arguments under --ec2-attributes.

Synopsis

 create-cluster
--release-label <value>   | --ami-version <value>
--instance-fleets <value> | --instance-groups <value> | --instance-type <value> --instance-count <value>
[--auto-terminate | --no-auto-terminate]
[--use-default-roles]
[--service-role <value>]
[--configurations <value>]
[--name <value>]
[--log-uri <value>]
[--additional-info <value>]
[--ec2-attributes <value>]
[--termination-protected | --no-termination-protected]
[--visible-to-all-users | --no-visible-to-all-users]
[--enable-debugging | --no-enable-debugging]
[--tags <value>]
[--applications <value>]
[--emrfs <value>]
[--bootstrap-actions <value>]
[--steps <value>]
[--restore-from-hbase-backup <value>]
[--security-configuration <value>]
[--custom-ami-id <value>]
[--ebs-root-volume-size <value>]
[--repo-upgrade-on-boot <value>]
[--kerberos--attributes <value>]

Options

--release-label (string)

Specifies the Amazon EMR release version, which determines the versions of application software that is installed on the cluster. For example, --release-label emr-5.7.0 installs the application versions and features available in that version. For details about application versions and features available in each Amazon EMR release, see the Amazon EMR Release Guide (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html). Use --release-label only for Amazon EMR version 4.x and later. Use --ami-version for earlier versions. You cannot specify both a release label and AMI version.

--ami-version (string)

Applies only to Amazon EMR versions earlier than 4.0. Use --release-label for version 4.x and later. Specifies the version of Amazon Linux Amazon Machine Image (AMI) to use when launching Amazon EC2 instances in the cluster. For example, --ami-version 3.1.0 . For more information about compatible AMIs in earlier versions of Amazon EMR, see the Amazon EMR Developer Guide (http://docs.aws.amazon.com/emr/latest/DeveloperGuide/emr-dg.pdf).

--instance-groups (list)

Specifies the number and type of Amazon EC2 instances to create for each node type in a cluster, using uniform instance groups. You can specify either --instance-groups or --instance-fleets but not both. For more information, see Create a Cluster with Instance Fleets or Instance Groups (http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-group-configuration.html).

You can configure instance group parameters in a JSON file that is stored locally or in Amazon S3 and then reference the JSON file as the sole parameter, for example, --instance-groups s3://mybucket/instancegroupconfig.json . Alternatively, you can specify arguments individually using multiple InstanceGroupType argument blocks, one for the MASTER instance group, one for a CORE instance group, and optional, multiple TASK instance groups.

If you specify inline JSON structures, enclose the entire InstanceGroupType argument block in single quotation marks.

Each InstanceGroupType block takes the following inline arguments. Optional arguments are shown in [square brackets].

  • [Name] - An optional friendly name for the instance group.
  • InstanceGroupType - MASTER , CORE , or TASK .
  • InstanceType - The type of EC2 instance, for example m3.xlarge , to use for all nodes in the instance group.
  • InstanceCount - The number of EC2 instances to provision in the instance group.
  • [BidPrice] - If specified, indicates that the instance group uses Spot Instances, and establishes the bid price for the instances.
  • [EbsConfiguration] - Specifies additional Amazon EBS storage volumes attached to EC2 instances using an inline JSON structure.
  • [AutoScalingPolicy] - Specifies an automatic scaling policy for the instance group using an inline JSON structure.

JSON Syntax:

[
  {
    "InstanceCount": integer,
    "Name": "string",
    "InstanceGroupType": "MASTER"|"CORE"|"TASK",
    "AutoScalingPolicy": {
      "Rules": [
        {
          "Action": {
            "SimpleScalingPolicyConfiguration": {
              "ScalingAdjustment": integer,
              "CoolDown": integer,
              "AdjustmentType": "CHANGE_IN_CAPACITY"|"PERCENT_CHANGE_IN_CAPACITY"|"EXACT_CAPACITY"
            },
            "Market": "ON_DEMAND"|"SPOT"
          },
          "Trigger": {
            "CloudWatchAlarmDefinition": {
              "EvaluationPeriods": integer,
              "Dimensions": [
                {
                  "Key": "string",
                  "Value": "string"
                }
                ...
              ],
              "Namespace": "string",
              "Period": integer,
              "ComparisonOperator": "string",
              "Statistic": "string",
              "Threshold": double,
              "Unit": "string",
              "MetricName": "string"
            }
          },
          "Name": "string",
          "Description": "string"
        }
        ...
      ],
      "Constraints": {
        "MinCapacity": integer,
        "MaxCapacity": integer
      }
    },
    "EbsConfiguration": {
      "EbsOptimized": true|false,
      "EbsBlockDeviceConfigs": [
        {
          "VolumeSpecification": {
            "Iops": integer,
            "VolumeType": "string",
            "SizeInGB": integer
          },
          "VolumesPerInstance": integer
        }
        ...
      ]
    },
    "BidPrice": "string",
    "InstanceType": "string"
  }
  ...
]

--instance-type (string)

Shortcut parameter as an alternative to --instance-groups . Specifies the the type of Amazon EC2 instance to use in a cluster. If used without the --instance-count parameter, the cluster consists of a single master node running on the EC2 instance type specified. When used together with --instance-count , one instance is used for the master node, and the remainder are used for the core node type.

--instance-count (string)

Shortcut parameter as an alternative to --instance-groups , when used together with --instance-type . Specifies the number of Amazon EC2 instances to create for a cluster. One instance is used for the master node, and the remainder are used for the core node type.

--auto-terminate | --no-auto-terminate (boolean)

Specifies whether the cluster should terminate after completing all the steps. Auto termination is off by default.

--instance-fleets (list)

Available in Amazon EMR version 5.0 and later. Specifies the number and type of Amazon EC2 instances to create for each node type in a cluster, using instance fleets. You can specify either --instance-fleets or --instance-groups but not both. For more information and examples, see Configure Instance Fleets (http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html).

You can configure instance fleet parameters in a JSON file that is stored locally or in Amazon S3 and then reference the JSON file as the sole parameter (for example, --instance-fleets s3://mybucket/instancefleetconfig.json ). Alternatively, you can specify arguments individually using multiple InstanceFleetType argument blocks, one for the MASTER instance fleet, one for a CORE instance fleet, and an optional TASK instance fleet.

The following arguments can be specified for each instance fleet. Optional arguments are shown in [square brackets].

  • [Name] - An optional friendly name for the instance fleet.
  • InstanceFleetType - MASTER , CORE , or TASK .
  • TargetOnDemandCapacity - The target capacity of On-Demand units for the instance fleet, which determines how many On-Demand Instances to provision. The WeightedCapacity specified for an instance type within InstanceTypeConfigs counts toward this total when an instance type with the On-Demand purchasing option launches.
  • TargetSpotCapacity - The target capacity of Spot units for the instance fleet, which determines how many Spot Instances to provision. The WeightedCapacity specified for an instance type within InstanceTypeConfigs counts toward this total when an instance type with the Spot purchasing option launches.
  • [LaunchSpecifications] - When TargetSpotCapacity is specified, specifies the block duration and timeout action for Spot Instances.
  • InstanceTypeConfigs - Specifies up to five EC2 instance types to use in the instance fleet, including details such as Spot price and Amazon EBS configuration.

JSON Syntax:

[
  {
    "Name": "string",
    "InstanceFleetType": "MASTER"|"CORE"|"TASK",
    "LaunchSpecifications": {
      "SpotSpecification": {
        "TimeoutDurationMinutes": integer,
        "BlockDurationMinutes": integer,
        "TimeoutAction": "TERMINATE_CLUSTER"|"SWITCH_TO_ONDEMAND"
      }
    },
    "TargetSpotCapacity": integer,
    "InstanceTypeConfigs": [
      {
        "WeightedCapacity": integer,
        "EbsConfiguration": {
          "EbsOptimized": true|false,
          "EbsBlockDeviceConfigs": [
            {
              "VolumeSpecification": {
                "Iops": integer,
                "VolumeType": "string",
                "SizeInGB": integer
              },
              "VolumesPerInstance": integer
            }
            ...
          ]
        },
        "BidPrice": "string",
        "BidPriceAsPercentageOfOnDemandPrice": double,
        "InstanceType": "string",
        "Configurations": "string"
      }
      ...
    ],
    "TargetOnDemandCapacity": integer
  }
  ...
]

--name (string)

The name of the cluster. If not provided, the default is "Development Cluster".

--log-uri (string)

Specifies the location in Amazon S3 to which log files are periodically written. If a value is not provided, logs files are not written to Amazon S3 from the master node and are lost if the master node terminates.

--service-role (string)

Specifies an IAM service role, which Amazon EMR requires to call other AWS services such as EC2 on your behalf during cluster operation. This parameter is usually specified when a customized service role is used. To specify the default service role, as well as the default instance profile, use the --use-default-roles parameter. If the role and instance profile do not already exist, use the aws emr create-default-roles command to create them.

--auto-scaling-role (string)

Specify --auto-scaling-role EMR_AutoScaling_DefaultRole if an automatic scaling policy is specified for an instance group using the --instance-groups parameter. This default IAM role allows the automatic scaling feature to launch and terminate Amazon EC2 instances during scaling operations.

--use-default-roles (boolean)

Specifies that the cluster should use the default service role (EMR_DefaultRole) and instance profile (EMR_EC2_DefaultRole) for permissions to access other AWS services.

Make sure that the role and instance profile exist first. To create them, use the create-default-roles command.

Specifying --use-default-roles does not include the EMR_AutoScaling_DefaultRole . If you use an automatic scaling policy with instance groups in Amazon EMR, use --autoscaling-role =EMR_AutoScaling_DefaultRole to specify the role individually.

--configurations (string)

Specifies a JSON file that contains configuration classifications, which you can use to customize applications that EMR installs when cluster instances launch. Applies only to EMR 4.x and later. The file referenced can either be stored locally (for example, --configurations file://configurations.json ) or stored in Amazon S3 (for example, --configurations https://s3.amazonaws.com/myBucket/configurations.json ). Each classification usually corresponds to the xml configuration file for an application, such as yarn-site for YARN. For a list of available configuration classifications and example JSON, see Configuring Applications (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html).

--ec2-attributes (structure)

Configures cluster and Amazon EC2 instance configurations. Accepts the following arguments:

  • KeyName - Specifies the name of the AWS EC2 key pair that will be used for SSH connections to the master node and other instances on the cluster.
  • AvailabilityZone - Specifies the availability zone in which to launch the cluster. For example, us-west-1b .
  • SubnetId - Specifies the VPC subnet in which to create the cluster.
  • InstanceProfile - An IAM role that allows EC2 instances to access other AWS services, such as Amazon S3, that are required for operations.
  • EmrManagedMasterSecurityGroup - The security group ID of the Amazon EC2 security group for the master node.
  • EmrManagedSlaveSecurityGroup - The security group ID of the Amazon EC2 security group for the slave nodes.
  • ServiceAccessSecurityGroup - The security group ID of the Amazon EC2 security group for Amazon EMR access to clusters in VPC private subnets.
  • AdditionalMasterSecurityGroups - A list of additional Amazon EC2 security group IDs for the master node.
  • AdditionalSlaveSecurityGroups - A list of additional Amazon EC2 security group IDs for the slave nodes.

Shorthand Syntax:

ServiceAccessSecurityGroup=string,AvailabilityZone=string,AdditionalSlaveSecurityGroups=string,string,EmrManagedMasterSecurityGroup=string,SubnetIds=string,string,KeyName=string,InstanceProfile=string,SubnetId=string,AdditionalMasterSecurityGroups=string,string,AvailabilityZones=string,string,EmrManagedSlaveSecurityGroup=string

JSON Syntax:

{
  "ServiceAccessSecurityGroup": "string",
  "AvailabilityZone": "string",
  "AdditionalSlaveSecurityGroups": ["string", ...],
  "EmrManagedMasterSecurityGroup": "string",
  "SubnetIds": ["string", ...],
  "KeyName": "string",
  "InstanceProfile": "string",
  "SubnetId": "string",
  "AdditionalMasterSecurityGroups": ["string", ...],
  "AvailabilityZones": ["string", ...],
  "EmrManagedSlaveSecurityGroup": "string"
}

--termination-protected | --no-termination-protected (boolean)

Specifies whether to lock the cluster to prevent the Amazon EC2 instances from being terminated by API call, user intervention, or an error.

--scale-down-behavior (string)

Specifies the way that individual Amazon EC2 instances terminate when an automatic scale-in activity occurs or an instance group is resized.

Accepted values:

  • TERMINATE_AT_TASK_COMPLETION - Specifies that Amazon EMR blacklists and drains tasks from nodes before terminating the instance.
  • TERMINATE_AT_INSTANCE_HOUR - Specifies that Amazon EMR terminate EC2 instances at the instance-hour boundary, regardless of when the request to terminate was submitted.

--visible-to-all-users | --no-visible-to-all-users (boolean)

Specifies whether the cluster is visible to all IAM users of the AWS account associated with the cluster. If set to --visible-to-all-users , all IAM users of that AWS account can view it. If they have the proper policy permissions set, they can also manage the cluster. If it is set to --no-visible-to-all-users , only the IAM user that created the cluster can view and manage it. Clusters are visible by default.

--enable-debugging | --no-enable-debugging (boolean)

Specifies that the debugging tool is enabled for the cluster, which allows you to browse log files using the Amazon EMR console. Turning debugging on requires that you specify --log-uri because log files must be stored in Amazon S3 so that Amazon EMR can index them for viewing in the console.

--tags (list)

A list of tags to associate with a cluster, which apply to each Amazon EC2 instance in the cluster. Tags are key-value pairs that consist of a required key string with a maximum of 128 characters, and an optional value string with a maximum of 256 characters.

You can specify tags in key=value format or you can add a tag without a value using only the key name, for example key . Use a space to separate multiple tags.

Syntax:

"string" "string" ...

--bootstrap-actions (list)

Specifies a list of bootstrap actions to run when creating a cluster. Bootstrap actions are scripts that run on each cluster node immediately after Amazon EMR provisions the EC2 instance. Bootstrap actions run before applications are installed and before nodes begin running steps and processing data.

You can specify the bootstrap action as an inline JSON structure enclosed in single quotation marks, or you can use a shorthand syntax, specifying multiple bootstrap actions, each separated by a space. When using the shorthand syntax, each bootstrap action takes the following parameters, separated by commas with no trailing space. Optional parameters are shown in [square brackets].

  • Path - The path and file name of the script to run, which must be accessible to each instance in the cluster. For example, Path=s3://mybucket/myscript.sh .
  • [Name] - A friendly name to help you identify the bootstrap action. For example, Name=BootstrapAction1
  • [Args] - A comma-separated list of arguments to pass to the bootstrap action script. Arguments can be either a list of values (Args=arg1,arg2,arg3 ) or a list of key-value pairs, as well as optional values, enclosed in square brackets (Args=[arg1,arg2=arg2value,arg3]) .

Shorthand Syntax:

Path=string,Args=string,string,Name=string ...

JSON Syntax:

[
  {
    "Path": "string",
    "Args": ["string", ...],
    "Name": "string"
  }
  ...
]

--applications (list)

Specifies the applications to install on the cluster. Available applications and their respective versions vary by Amazon EMR release. For more information, see the Amazon EMR Release Guide (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/).

When using versions of Amazon EMR earlier than 4.0, some applications take optional arguments for configuration. Arguments should either be a comma-separated list of values (Args=arg1,arg2,arg3

) or a bracket-enclosed list of values and/or key-value pairs (Args=[arg1,arg2=arg3,arg4] ). For more information, see the Amazon EMR Developer Guide (http://docs.aws.amazon.com/emr/latest/DeveloperGuide/emr-dg.pdf).

Shorthand Syntax:

Args=string,string,Name=string ...

JSON Syntax:

[
  {
    "Args": ["string", ...],
    "Name": "MapR"|"HUE"|"HIVE"|"PIG"|"HBASE"|"IMPALA"|"GANGLIA"|"HADOOP"|"SPARK"
  }
  ...
]

--emrfs (structure)

Specifies EMRFS configuration options, such as consistent view and Amazon S3 encryption parameters. For more information, see Configuring Consistent View (http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-configure-consistent-view.html). When using Amazon EMR version 4.8.0 or later, we recommend configuring data encryption using security configurations instead. For more information about using --emrfs to configure data encryption, see Specifying Encryption Options Using a Security Configuration (http://integ-docs-aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-enable-security-configuration.html.

Shorthand Syntax:

Args=string,string,Encryption=string,Consistent=boolean,ProviderType=string,KMSKeyId=string,CustomProviderLocation=string,SSE=boolean,RetryCount=integer,RetryPeriod=integer,CustomProviderClass=string

JSON Syntax:

{
  "Args": ["string", ...],
  "Encryption": "SERVERSIDE"|"CLIENTSIDE",
  "Consistent": true|false,
  "ProviderType": "KMS"|"CUSTOM",
  "KMSKeyId": "string",
  "CustomProviderLocation": "string",
  "SSE": true|false,
  "RetryCount": integer,
  "RetryPeriod": integer,
  "CustomProviderClass": "string"
}

--steps (list)

A list of steps to be executed by the cluster. A step can be specified using the shorthand syntax, by referencing a JSON file or by specifying an inline JSON structure. Args supplied with steps should be acomma-separated list of values (Args=arg1,arg2,arg3 ) or a bracket-enclosed list of values and key-value pairs (Args=[arg1,arg2=value,arg4 ).

Shorthand Syntax:

Name=string,Args=string,string,Jar=string,ActionOnFailure=string,MainClass=string,Type=string,Properties=string ...

JSON Syntax:

[
  {
    "Name": "string",
    "Args": ["string", ...],
    "Jar": "string",
    "ActionOnFailure": "TERMINATE_CLUSTER"|"CANCEL_AND_WAIT"|"CONTINUE",
    "MainClass": "string",
    "Type": "CUSTOM_JAR"|"STREAMING"|"HIVE"|"PIG"|"IMPALA",
    "Properties": "string"
  }
  ...
]

--additional-info (string)

Specifies additional information during cluster creation.

--restore-from-hbase-backup (structure)

Available only when using Amazon EMR versions earlier than 4.x. Launches a new HBase cluster and populates it with data from a previous backup of an HBase cluster. HBase must be installed using the --applications option.

Shorthand Syntax:

BackupVersion=string,Dir=string

JSON Syntax:

{
  "BackupVersion": "string",
  "Dir": "string"
}

--security-configuration (string)

Specifies the name of a security configuration to use for the cluster. A security configuration defines data encryption settings and other security options. For more information, see Specifying Encryption Options Using a Security Configuration (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-enable-security-configuration.html). Use list-security-configurations to get a list of available security configurations in the active account.

--custom-ami-id (string)

Available in Amazon EMR version 5.7.0 and later. Specifies the AMI ID of a custom AMI to use when Amazon EMR provisions EC2 instances. A custom AMI can be used to encrypt the Amazon EBS root volume. It can also be used instead of bootstrap actions to customize cluster node configurations. For more information, see Using a Custom AMI (http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-custom-ami.html).

--ebs-root-volume-size (string)

Available in Amazon EMR version 4.x and later. Specifies the size, in GiB, of the EBS root device volume of the Amazon Linux AMI that is used for each EC2 instance in the cluster.

--repo-upgrade-on-boot (string)

Applies only when a --custom-ami-id is specified. On first boot, by default, Amazon Linux AMIs connect to package repositories to install security updates before other services start. You can set this parameter using --rep-upgrade-on-boot NONE to disable these updates. CAUTION: This creates additional security risks.

--kerberos-attributes (structure)

Shorthand Syntax:

Realm=string,KdcAdminPassword=string,ADDomainJoinPassword=string,CrossRealmTrustPrincipalPassword=string,ADDomainJoinUser=string

JSON Syntax:

{
  "Realm": "string",
  "KdcAdminPassword": "string",
  "ADDomainJoinPassword": "string",
  "CrossRealmTrustPrincipalPassword": "string",
  "ADDomainJoinUser": "string"
}

Examples

Note: some of these examples assume that you have specified your Amazon EMR service role and Amazon EC2 instance profile in the AWS CLI configuration file. If you have not done this, you must specify each required IAM role or use the --use-default-roles parameter when creating your cluster. You can learn more about specifying parameter values for Amazon EMR commands here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-aws-cli-config.html

1. Quick start: to create an Amazon EMR cluster

  • Command:

    aws emr create-cluster --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

2. Create an Amazon EMR cluster with default ServiceRole and InstanceProfile roles

  • Create an Amazon EMR cluster that uses the --instance-groups configuration:

    aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

  • Create an Amazon EMR cluster that uses the --instance-fleets configuration, specifying two instance types for each fleet and two EC2 Subnets:

    aws emr create-cluster --release-label emr-5.3.1 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetIds=['subnet-ab12345c','subnet-de67890f'] --instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m3.xlarge}'] InstanceFleetType=CORE,TargetSpotCapacity=11,InstanceTypeConfigs=['{InstanceType=m3.xlarge,BidPrice=0.5,WeightedCapacity=3}','{InstanceType=m4.2xlarge,BidPrice=0.9,WeightedCapacity=5}'],LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'}

3. Create an Amazon EMR cluster with default roles

  • Command:

    aws emr create-cluster --release-label emr-5.3.1  --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

4. Create an Amazon EMR cluster with applications

  • Create an Amazon EMR cluster with Hadoop, Hive and Pig installed:

    aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Pig --release-label emr-5.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

  • Create an Amazon EMR cluster with Spark installed:

    aws emr create-cluster --release-label emr-5.3.1 --applications Name=Spark --ec2-attributes KeyName=myKey --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

5. Change configuration for Hadoop MapReduce

The following example changes the maximum number of map tasks and sets the NameNode heap size:

  • Specifying configurations from a local file:

    aws emr create-cluster --configurations file://configurations.json --release-label emr-5.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
  • Specifying configurations from a file in Amazon S3:

    aws emr create-cluster --configurations https://s3.amazonaws.com/myBucket/configurations.json --release-label emr-5.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
  • Contents of configurations.json:

    [
     {
       "Classification": "mapred-site",
       "Properties": {
           "mapred.tasktracker.map.tasks.maximum": 2
       }
     },
     {
       "Classification": "hadoop-env",
       "Properties": {},
       "Configurations": [
           {
             "Classification": "export",
             "Properties": {
                 "HADOOP_DATANODE_HEAPSIZE": 2048,
                 "HADOOP_NAMENODE_OPTS": "-XX:GCTimeRatio=19"
             }
           }
       ]
     }
    ]
    

6. Create an Amazon EMR cluster with MASTER, CORE, and TASK instance groups

  • Command:

    aws emr create-cluster --release-label emr-5.3.1  --auto-terminate --instance-groups Name=Master,InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1 Name=Core,InstanceGroupType=CORE,InstanceType=m3.xlarge,InstanceCount=2 Name=Task,InstanceGroupType=TASK,InstanceType=m3.xlarge,InstanceCount=2

7. Specify whether the cluster should terminate after completing all the steps

  • Create an Amazon EMR cluster that terminates after completing all the steps:

    aws emr create-cluster --release-label emr-5.3.1   --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge  InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

8. Specify EC2 Attributes

  • Create an Amazon EMR cluster with the Amazon EC2 key pair "myKey" and instance profile "myProfile":

    aws emr create-cluster --ec2-attributes KeyName=myKey,InstanceProfile=myProfile --release-label emr-5.3.1   --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
  • Create an Amazon EMR cluster in an Amazon VPC subnet:

    aws emr create-cluster --ec2-attributes SubnetId=subnet-xxxxx --release-label emr-5.3.1   --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
  • Create an Amazon EMR cluster in an Availability Zone. For example, us-east-1b:

    aws emr create-cluster --ec2-attributes AvailabilityZone=us-east-1b --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
  • Create an Amazon EMR cluster specifying the Amazon EC2 security groups:

    aws emr create-cluster --release-label emr-5.3.1 --service-role myServiceRole --ec2-attributes InstanceProfile=myRole,EmrManagedMasterSecurityGroup=sg-master1,EmrManagedSlaveSecurityGroup=sg-slave1,AdditionalMasterSecurityGroups=[sg-addMaster1,sg-addMaster2,sg-addMaster3,sg-addMaster4],AdditionalSlaveSecurityGroups=[sg-addSlave1,sg-addSlave2,sg-addSlave3,sg-addSlave4] --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
  • Create an Amazon EMR cluster specifying only the Amazon EMR-managed Amazon EC2 security groups:

    aws emr create-cluster --release-label emr-5.3.1 --service-role myServiceRole --ec2-attributes InstanceProfile=myRole,EmrManagedMasterSecurityGroup=sg-master1,EmrManagedSlaveSecurityGroup=sg-slave1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
  • Create an Amazon EMR cluster specifying only the additional Amazon EC2 security groups:

    aws emr create-cluster --release-label emr-5.3.1 --service-role myServiceRole --ec2-attributes InstanceProfile=myRole,AdditionalMasterSecurityGroups=[sg-addMaster1,sg-addMaster2,sg-addMaster3,sg-addMaster4],AdditionalSlaveSecurityGroups=[sg-addSlave1,sg-addSlave2,sg-addSlave3,sg-addSlave4] --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
  • Create an Amazon EMR cluster in a VPC private subnet and use a specific Amazon EC2 security group to enable the Amazon EMR service access (required for clusters in private subnets):

    aws  emr create-cluster --release-label emr-5.3.1 --service-role myServiceRole --ec2-attributes InstanceProfile=myRole,ServiceAccessSecurityGroup=sg-service-access,EmrManagedMasterSecurityGroup=sg-master,EmrManagedSlaveSecurityGroup=sg-slave --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
  • JSON equivalent (contents of ec2_attributes.json):

     [
      {
        "SubnetId": "subnet-xxxxx",
        "KeyName": "myKey",
        "InstanceProfile":"myRole",
        "EmrManagedMasterSecurityGroup": "sg-master1",
        "EmrManagedSlaveSecurityGroup": "sg-slave1",
        "ServiceAccessSecurityGroup": "sg-service-access"
        "AdditionalMasterSecurityGroups": ["sg-addMaster1","sg-addMaster2","sg-addMaster3","sg-addMaster4"],
        "AdditionalSlaveSecurityGroups": ["sg-addSlave1","sg-addSlave2","sg-addSlave3","sg-addSlave4"]
      }
    ]

NOTE: JSON arguments must include options and values as their own items in the list.

  • Command (using ec2_attributes.json):

    aws emr create-cluster --release-label emr-5.3.1 --service-role myServiceRole --ec2-attributes file://./ec2_attributes.json  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

9. Enable debugging and specify a Log URI

  • Command:

    aws emr create-cluster --enable-debugging --log-uri s3://myBucket/myLog  --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

10. Add tags when creating an Amazon EMR cluster

  • Add a list of tags:

    aws emr create-cluster --tags name="John Doe" age=29 address="123 East NW Seattle" --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
  • List tags of an Amazon EMR cluster:

    aws emr describe-cluster --cluster-id j-XXXXXXYY --query Cluster.Tags

11. Use a security configuration to enable encryption

  • Command:

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.3.1 --security-configuration mySecurityConfiguration

12. To create an Amazon EMR cluster with EBS volumes configured to the instance groups

  • Create a cluster with multiple EBS volumes attached to the CORE instance group. EBS volumes can be attached to MASTER, CORE, and TASK instance groups. For instance groups with EBS configurations, which have an embedded JSON structure, you should enclose the entire instance group argument with single quotes. For instance groups with no EBS configuration, using single quotes is optional.

  • Command:

    aws emr create-cluster --release-label emr-5.3.1  --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=d2.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=d2.xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=4}]}' --auto-terminate
  • Create a cluster with multiple EBS volumes attached to the MASTER instance group.

  • Command:

    aws emr create-cluster --release-label emr-5.3.1 --use-default-roles --instance-groups 'InstanceGroupType=MASTER, InstanceCount=1, InstanceType=d2.xlarge, EbsConfiguration={EbsOptimized=true, EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=io1, SizeInGB=100, Iops=100}},{VolumeSpecification={VolumeType=standard,SizeInGB=50},VolumesPerInstance=3}]}' InstanceGroupType=CORE,InstanceCount=2,InstanceType=d2.xlarge --auto-terminate
  • Required parameters:

    VolumeType, SizeInGB if EbsBlockDeviceConfigs specified
  • Create a cluster with an Auto Scaling policy attached to the CORE instance group. The Auto Scaling policy can be attached to CORE and TASK instance groups. For instance groups with an Auto Scaling policy attached, you should enclose the entire instance group argument with single quotes. For instance groups with no Auto Scaling policy, using single quotes is optional.

  • Command:

    aws emr create-cluster --release-label emr-5.3.1 --use-default-roles --auto-scaling-role EMR_AutoScaling_DefaultRole --instance-groups InstanceGroupType=MASTER,InstanceType=d2.xlarge,InstanceCount=1 'InstanceGroupType=CORE,InstanceType=d2.xlarge,InstanceCount=2,AutoScalingPolicy={Constraints={MinCapacity=1,MaxCapacity=5},Rules=[{Name=TestRule,Description=TestDescription,Action={Market=ON_DEMAND,SimpleScalingPolicyConfiguration={AdjustmentType=EXACT_CAPACITY,ScalingAdjustment=2}},Trigger={CloudWatchAlarmDefinition={ComparisonOperator=GREATER_THAN,EvaluationPeriods=5,MetricName=TestMetric,Namespace=EMR,Period=3,Statistic=MAXIMUM,Threshold=4.5,Unit=NONE,Dimensions=[{Key=TestKey,Value=TestValue}]}}}]}'

13. To add custom JAR steps to a cluster when creating an Amazon EMR cluster

  • Command:

    aws emr create-cluster --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://myBucket/mytest.jar,Args=arg1,arg2,arg3 Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://myBucket/mytest.jar,MainClass=mymainclass,Args=arg1,arg2,arg3  --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
  • Custom JAR steps required parameters:

    Jar
    
  • Custom JAR steps optional parameters:

    Type, Name, ActionOnFailure, Args
    

14. To add streaming steps when creating an Amazon EMR cluster

  • Command:

    aws emr create-cluster --steps Type=STREAMING,Name='Streaming Program',ActionOnFailure=CONTINUE,Args=[-files,s3://elasticmapreduce/samples/wordcount/wordSplitter.py,-mapper,wordSplitter.py,-reducer,aggregate,-input,s3://elasticmapreduce/samples/wordcount/input,-output,s3://mybucket/wordcount/output] --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
  • Streaming steps required parameters:

    Type, Args
    
  • Streaming steps optional parameters:

    Name, ActionOnFailure
    
  • JSON equivalent (contents of step.json):

     [
      {
        "Name": "JSON Streaming Step",
        "Args": ["-files","s3://elasticmapreduce/samples/wordcount/wordSplitter.py","-mapper","wordSplitter.py","-reducer","aggregate","-input","s3://elasticmapreduce/samples/wordcount/input","-output","s3://mybucket/wordcount/output"],
        "ActionOnFailure": "CONTINUE",
        "Type": "STREAMING"
      }
    ]

NOTE: JSON arguments must include options and values as their own items in the list.

  • Command (using step.json):

    aws emr create-cluster --steps file://./step.json --release-label emr-4.0.0  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

15. To use multiple files in a streaming step (JSON only)

  • JSON (multiplefiles.json):

    [
      {
         "Name": "JSON Streaming Step",
         "Type": "STREAMING",
         "ActionOnFailure": "CONTINUE",
         "Args": [
             "-files",
             "s3://mybucket/mapper.py,s3://mybucket/reducer.py",
             "-mapper",
             "mapper.py",
             "-reducer",
             "reducer.py",
             "-input",
             "s3://mybucket/input",
             "-output",
             "s3://mybucket/output"]
      }
    ]
    
  • Command:

    aws emr create-cluster --steps file://./multiplefiles.json --release-label emr-5.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

16. To add Hive steps when creating an Amazon EMR cluster

  • Command:

    aws emr create-cluster --steps Type=HIVE,Name='Hive program',ActionOnFailure=CONTINUE,ActionOnFailure=TERMINATE_CLUSTER,Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/model-build.q,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/2014-04-18/11-07-32,-d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs] --applications Name=Hive --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
  • Hive steps required parameters:

    Type, Args
    
  • Hive steps optional parameters:

    Name, ActionOnFailure
    

17. To add Pig steps when creating an Amazon EMR cluster

  • Command:

    aws emr create-cluster --steps Type=PIG,Name='Pig program',ActionOnFailure=CONTINUE,Args=[-f,s3://elasticmapreduce/samples/pig-apache/do-reports2.pig,-p,INPUT=s3://elasticmapreduce/samples/pig-apache/input,-p,OUTPUT=s3://mybucket/pig-apache/output] --applications Name=Pig --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
  • Pig steps required parameters:

    Type, Args
    
  • Pig steps optional parameters:

    Name, ActionOnFailure
    

18. Add a list of bootstrap actions when creating an Amazon EMR Cluster

  • Command:

    aws emr create-cluster --bootstrap-actions Path=s3://mybucket/myscript1,Name=BootstrapAction1,Args=[arg1,arg2] Path=s3://mybucket/myscript2,Name=BootstrapAction2,Args=[arg1,arg2] --release-label emr-5.3.1  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

19. To enable consistent view in EMRFS and change the RetryCount and Retry Period settings when creating an Amazon EMR cluster

  • Command:

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.3.1 --emrfs Consistent=true,RetryCount=5,RetryPeriod=30
  • Required parameters:

    Consistent=true
    
  • JSON equivalent (contents of emrfs.json):

    {
      "Consistent": true,
      "RetryCount": 5,
      "RetryPeriod": 30
    }
    
  • Command (Using emrfs.json):

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.3.1 --emrfs file://emrfs.json

20. To enable consistent view with arguments e.g. change the DynamoDB read and write capacity when creating an Amazon EMR cluster

  • Command:

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.3.1 --emrfs Consistent=true,RetryCount=5,RetryPeriod=30,Args=[fs.s3.consistent.metadata.read.capacity=600,fs.s3.consistent.metadata.write.capacity=300]
  • Required parameters:

    Consistent=true
    
  • JSON equivalent (contents of emrfs.json):

    {
      "Consistent": true,
      "RetryCount": 5,
      "RetryPeriod": 30,
      "Args":["fs.s3.consistent.metadata.read.capacity=600", "fs.s3.consistent.metadata.write.capacity=300"]
    }
    
  • Command (Using emrfs.json):

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.3.1 --emrfs file://emrfs.json
  • Command (Using custom ami id):

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.3.1 --custom-ami-id ami-9be6f38c
  • Command (Using custom EBS root volume):

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.3.1 --ebs-root-volume-size 20
  • Command (Repo upgrade option on instance boot. This can be used only with custom AMIs):

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.3.1 --repo-upgrade-on-boot ${RepoUpgrade}
    
    RepoUpgrade {
       SECURITY,
       NONE
    }

21. To create an Amazon EMR cluster with Kerberos configured

  • Command:

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.10.0 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --security-configuration mySecurityConfiguration --kerberos-attributes Realm=EC2.INTERNAL,KdcAdminPassword=123,CrossRealmTrustPrincipalPassword=123
  • JSON equivalent (contents of kerberos_attributes.json):

    {
      "Realm": "EC2.INTERNAL",
      "KdcAdminPassword": "123",
      "CrossRealmTrustPrincipalPassword": "123",
    }
    
  • Command (Using kerberos_attributes.json):

    aws emr create-cluster --instance-type m3.xlarge --release-label emr-5.10.0 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --security-configuration mySecurityConfiguration --kerberos-attributes file://kerberos_attributes.json