Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Import files with Distributed Cache

Distributed Cache is a Hadoop feature that can boost efficiency when a map or a reduce task needs access to common data. If your cluster depends on existing applications or binaries that are not installed when the cluster is created, you can use Distributed Cache to import these files. This feature lets a cluster node read the imported files from its local file system, instead of retrieving the files from other cluster nodes.

For more information, go to http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html.

You invoke Distributed Cache when you create the cluster. The files are cached just before starting the Hadoop job and the files remain cached for the duration of the job. You can cache files stored on any Hadoop-compatible file system, for example HDFS or Amazon S3. The default size of the file cache is 10GB. To change the size of the cache, reconfigure the Hadoop parameter, local.cache.size using the (Optional) Create Bootstrap Actions to Install Additional Software bootstrap action.

Supported File Types

Distributed Cache allows both single files and archives. Individual files are cached as read only. Executables and binary files have execution permissions set.

Archives are one or more files packaged using a utility, such as gzip. Distributed Cache passes the compressed files to each slave node and decompresses the archive as part of caching. Distributed Cache supports the following compression formats:

  • zip

  • tgz

  • tar.gz

  • tar

  • jar

Location of Cached Files

Distributed Cache copies files to slave nodes only. If there are no slave nodes in the cluster, Distributed Cache copies the files to the master node.

Distributed Cache associates the cache files to the current working directory of the mapper and reducer using symlinks. A symlink is an alias to a file location, not the actual file location. The value of the Hadoop parameter, mapred.local.dir, specifies the location of temporary files. Amazon Elastic MapReduce (Amazon EMR) sets this parameter to /mnt/var/lib/hadoop/mapred/. Cache files are located in a subdirectory of the temporary file location at /mnt/var/lib/hadoop/mapred/taskTracker/archive/.

If you cache a single file, Distributed Cache puts the file in the archive directory. If you cache an archive, Distributed Cache decompresses the file, creates a subdirectory in /archive with the same name as the archive file name. The individual files are located in the new subdirectory.

You can use Distributed Cache only when using Streaming.

Access Cached Files From Streaming Applications

To access the cached files from your mapper or reducer applications, make sure that you have added the current working directory (./) into your application path and referenced the cached files as though they are present in the current working directory.

For more information, go to http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache.

Access Cached Files From Streaming Applications Using the Amazon EMR Console

You can use the Amazon EMR console to create clusters that use Distributed Cache.

To specify Distributed Cache files using the console

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Steps section, in the Add step field, choose Streaming program from the list and click Configure and add.

  4. In the Arguments field, include the files and archives to save to the cache and click Add.

    The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.

    If you want to ... ActionExample
    Add an individual file to the Distributed CacheSpecify -cacheFile followed by the name and location of the file, the pound (#) sign, and then the name you want to give the file when it's placed in the local cache.
    –cacheFile \
    s3://bucket_name/file_name#cache_file_name
    Add an archive file to the Distributed CacheEnter -cacheArchive followed by the location of the files in Amazon S3, the pound (#) sign, and then the name you want to give the collection of files in the local cache.
    –cacheArchive \
    s3://bucket_name/archive_name#cache_archive_name
    Distributed Cache Configuration
  5. Proceed with configuring and launching your cluster. Your cluster copies the files to the cache location before processing any cluster steps.

Access Cached Files From Streaming Applications Using the AWS CLI or the Amazon EMR CLI

You can use the CLI to create clusters that use Distributed Cache.

To specify Distributed Cache files using the AWS CLI

  • To submit a Streaming step when a cluster is created, type the create-cluster command with the --steps parameter. To specify Distributed Cache files using the AWS CLI, specify the appropriate arguments when submitting a Streaming step.

    If you want to ... Add the following parameter to the cluster ...
    add an individual file to the Distributed Cachespecify -cacheFile followed by the name and location of the file, the pound (#) sign, and then the name you want to give the file when it's placed in the local cache.
    add an archive file to the Distributed Cacheenter -cacheArchive followed by the location of the files in Amazon S3, the pound (#) sign, and then the name you want to give the collection of files in the local cache.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

Example 1

Type the following command to launch a cluster and submit a Streaming step that uses -cacheFile to add one file, sample_dataset_cached.dat, to the cache. The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x.

For Hadoop 2.x, type the following command and replace myKey with the name of your EC2 key pair.

  • Linux, UNIX, and Mac OS X users:

    aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig \
    --use-default-roles --ec2-attributes KeyName=myKey \
    --instance-type m3.xlarge --instance-count 3 \
    --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,\
    Args=["--files","s3://my_bucket/my_mapper.py s3://my_bucket/my_reducer.py","-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheFile","s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat"]
  • Windows users:

    aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,Args=["--files","s3://my_bucket/my_mapper.py s3://my_bucket/my_reducer.py","-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheFile","s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat"]

For Hadoop 1.x, use the following command and replace myKey with the name of your EC2 key pair.

  • Linux, UNIX, and Mac OS X users:

    aws emr create-cluster --name "Test cluster" --ami-version 2.4 --applications Name=Hue Name=Hive Name=Pig \
    --use-default-roles --ec2-attributes KeyName=myKey \
    --instance-type m3.xlarge --instance-count 3 \
    --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,\
    Args=["-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheFile","s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat"]
  • Windows users:

    aws emr create-cluster --name "Test cluster" --ami-version 2.4 --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,Args=["-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheFile","s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat"]

When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

Note

If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.


Example 2

The following command shows the creation of a streaming cluster and uses -cacheArchive to add an archive of files to the cache. The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x.

For Hadoop 2.x, use the following command and replace myKey with the name of your EC2 key pair.

  • Linux, UNIX, and Mac OS X users:

    aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig \
    --use-default-roles --ec2-attributes KeyName=myKey \
    --instance-type m3.xlarge --instance-count 3 \
    --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,\
    Args=["--files","s3://my_bucket/my_mapper.py s3://my_bucket/my_reducer.py","-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheArchive","s3://my_bucket/sample_dataset.tgz#sample_dataset_cached"]
  • Windows users:

    aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,Args=["--files","s3://my_bucket/my_mapper.py s3://my_bucket/my_reducer.py","-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheArchive","s3://my_bucket/sample_dataset.tgz#sample_dataset_cached"]

For Hadoop 1.x, use the following command and replace myKey with the name of your EC2 key pair.

  • Linux, UNIX, and Mac OS X users:

    aws emr create-cluster --name "Test cluster" --ami-version 2.4 --applications Name=Hue Name=Hive Name=Pig \
    --use-default-roles --ec2-attributes KeyName=myKey \
    --instance-type m3.xlarge --instance-count 3 \
    --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,\
    Args=["-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheArchive","s3://my_bucket/sample_dataset.tgz#sample_dataset_cached"]
  • Windows users:

    aws emr create-cluster --name "Test cluster" --ami-version 2.4 --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=STREAMING,Name="Streaming program",ActionOnFailure=CONTINUE,Args=["-mapper","my_mapper.py","-reducer","my_reducer.py,"-input","s3://my_bucket/my_input","-output","s3://my_bucket/my_output", "-cacheArchive","s3://my_bucket/sample_dataset.tgz#sample_dataset_cached"]

When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

Note

If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.


To specify Distributed Cache files using the Amazon EMR CLI

You can use the Amazon EMR CLI to create clusters that use Distributed Cache. To add files or archives to the Distributed Cache using the CLI, you specify the options –-cache or --cache-archive to the CLI command line.

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Create a cluster and add the following parameters:

    For information on how to create a cluster using the Amazon EMR CLI, go to Launch a Cluster and Submit a Streaming Step.

    The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.

    If you want to ... Add the following parameter to the cluster ...
    add an individual file to the Distributed Cachespecify --cache followed by the name and location of the file, the pound (#) sign, and then the name you want to give the file when it's placed in the local cache.
    add an archive file to the Distributed Cacheenter --cache-archive followed by the location of the files in Amazon S3, the pound (#) sign, and then the name you want to give the collection of files in the local cache.

    Your cluster copies the files to the cache location before processing any job flow steps.

Example 1

The following command shows the creation of a streaming cluster and uses --cache to add one file, sample_dataset_cached.dat, to the cache.

In the directory where you installed the Amazon EMR CLI, type the following command. For more information, see the Command Line Interface Reference for Amazon EMR.

The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x.

For Hadoop 2.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
       --arg "--files" --arg "s3://my_bucket/my_mapper.py,s3://my_bucket/my_reducer.py" \
       --input s3://my_bucket/my_input \
       --output s3://my_bucket/my_output \
       --mapper my_mapper.py \
       --reducer my_reducer.py \
       --cache s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
  • Windows users:

    ruby elastic-mapreduce --create --stream --arg "-files" --arg "s3://my_bucket/my_mapper.py,s3://my_bucket/my_reducer.py" --input s3://my_bucket/my_input --output s3://my_bucket/my_output --mapper my_mapper.py --reducer my_reducer.py --cache s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat

For Hadoop 1.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
       --input s3://my_bucket/my_input \
       --output s3://my_bucket/my_output \
       --mapper s3://my_bucket/my_mapper.py \
       --reducer s3://my_bucket/my_reducer.py \
       --cache s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
  • Windows users:

    ruby elastic-mapreduce --create --stream --input s3://my_bucket/my_input --output s3://my_bucket/my_output --mapper s3://my_bucket/my_mapper.py --reducer s3://my_bucket/my_reducer.py --cache s3://my_bucket/sample_dataset.dat#sample_dataset_cached.dat

Example 2

The following command shows the creation of a streaming cluster and uses --cache-archive to add an archive of files to the cache.

The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x.

For Hadoop 2.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
      --arg "--files" --arg "s3://my_bucket/my_mapper.py,s3://my_bucket/my_reducer.py" \
      --input s3://my_bucket/my_input \
      --output s3://my_bucket/my_output \
      --mapper my_mapper.py \
      --reducer my_reducer.py \
      --cache-archive s3://my_bucket/sample_dataset.tgz#sample_dataset_cached
  • Windows users:

    ruby elastic-mapreduce --create --stream --arg "-files" --arg "s3://my_bucket/my_mapper.py,s3://my_bucket/my_reducer.py" --input s3://my_bucket/my_input --output s3://my_bucket/my_output --mapper my_mapper.py --reducer my_reducer.py --cache-archive s3://my_bucket/sample_dataset.tgz#sample_dataset_cached

For Hadoop 1.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
      --input s3://my_bucket/my_input \
      --output s3://my_bucket/my_output \
      --mapper s3://my_bucket/my_mapper.py \
      --reducer s3://my_bucket/my_reducer.py \
      --cache-archive s3://my_bucket/sample_dataset.tgz#sample_dataset_cached
  • Windows users:

    ruby elastic-mapreduce --create --stream --input s3://my_bucket/my_input --output s3://my_bucket/my_output --mapper s3://my_bucket/my_mapper.py --reducer s3://my_bucket/my_reducer.py --cache-archive s3://my_bucket/sample_dataset.tgz#sample_dataset_cached