Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Import files using Distributed Cache

Distributed Cache is a Hadoop feature that can boost efficiency when a map or a reduce task needs access to common data. If your cluster depends on existing applications or binaries that are not installed when the cluster is created, you can use Distributed Cache to import these files. This feature lets a cluster node read the imported files from its local file system, instead of retrieving the files from other cluster nodes.

For more information, go to http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html.

You invoke Distributed Cache when you create the cluster. The files are cached just before starting the Hadoop job and the files remain cached for the duration of the job. You can cache files stored on any Hadoop-compatible file system, for example HDFS or S3 native. The default size of the file cache is 10GB. To change the size of the cache, reconfigure the Hadoop parameter, local.cache.size using the Create Bootstrap Actions to Install Additional Software (Optional) bootstrap action.

Supported File Types

Distributed Cache allows both single files and archives. Individual files are cached as read only. Executables and binary files have execution permissions set.

Archives are one or more files packaged using a utility, such as gzip. Distributed Cache passes the compressed files to each slave node and decompresses the archive as part of caching. Distributed Cache supports the following compression formats:

  • zip

  • tgz

  • tar.gz

  • tar

  • jar

Location of Cached Files

Distributed Cache copies files to slave nodes only. If there are no slave nodes in the cluster, Distributed Cache copies the files to the master node.

Distributed Cache associates the cache files to the current working directory of the mapper and reducer using symlinks. A symlink is an alias to a file location, not the actual file location. The value of the Hadoop parameter, mapred.local.dir, specifies the location of temporary files. Amazon Elastic MapReduce (Amazon EMR) sets this parameter to /mnt/var/lib/hadoop/mapred/. Cache files are located in a subdirectory of the temporary file location at /mnt/var/lib/hadoop/mapred/taskTracker/archive/.

If you cache a single file, Distributed Cache puts the file in the archive directory. If you cache an archive, Distributed Cache decompresses the file, creates a subdirectory in /archive with the same name as the archive file name. The individual files are located in the new subdirectory.

You can use Distributed Cache only when creating streaming clusters.

Access Cached Files From Mapper and Reducer Applications

To access the cached files from your mapper or reducer applications, make sure that you have added the current working directory (./) into your application path and referenced the cached files as though they are present in the current working directory.

For more information, go to http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache.

Amazon EMR Console

You can use the Amazon EMR console to create clusters that use Distributed Cache.

To specify Distributed Cache files

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Steps section, in the Add step field, choose Streaming program from the list and click Configure and add.

  4. In the Arguments field, include the files and archives to save to the cache and click Add.

    The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.

    If you want to ... ActionExample
    Add an individual file to the Distributed CacheSpecify -cacheFile followed by the name and location of the file, the pound (#) sign, and then the name you want to give the file when it's placed in the local cache.
    –cacheFile \
    s3n://bucket_name/file_name#cache_file_name
    Add an archive file to the Distributed CacheEnter -cacheArchive followed by the location of the files in Amazon S3, the pound (#) sign, and then the name you want to give the collection of files in the local cache.
    –cacheArchive \
    s3n://bucket_name/archive_name#cache_archive_name
    Distributed Cache Configuration
  5. Proceed with configuring and launching your streaming cluster. Your cluster copies the files to the cache location before processing any cluster steps.

CLI

You can use the Amazon EMR console to create clusters that use Distributed Cache. To add files or archives to the Distributed Cache using the CLI, you specify the options –-cache or --cache-archive to the CLI command line.

To specify Distributed Cache files

  • Create a streaming cluster and add the following parameters:

    For information on how to create a streaming cluster using the CLI, go to Launch a Streaming Cluster.

    The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.

    If you want to ... Add the following parameter to the cluster ...
    add an individual file to the Distributed Cachespecify -cache followed by the name and location of the file, the pound (#) sign, and then the name you want to give the file when it's placed in the local cache.
    add an archive file to the Distributed Cacheenter -cache-archive followed by the location of the files in Amazon S3, the pound (#) sign, and then the name you want to give the collection of files in the local cache.

    The output looks similar to the following.

    Created jobflow JobFlowID

    Your cluster copies the files to the cache location before processing any job flow steps.

Example 1

The following command shows the creation of a streaming cluster and uses --cache to add one file, sample_dataset_cached.dat, to the cache.

In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x.

For Hadoop 2.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
       --arg "-files" --arg "s3n://my_bucket/my_mapper.py,s3n://my_bucket/my_reducer.py" \
       --input s3n://my_bucket/my_input \
       --output s3n://my_bucket/my_output \
       --mapper my_mapper.py \
       --reducer my_reducer.py \
       --cache s3n://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
  • Windows users:

    ruby elastic-mapreduce --create --stream --arg "-files" --arg "s3n://my_bucket/my_mapper.py,s3n://my_bucket/my_reducer.py" --input s3n://my_bucket/my_input --output s3n://my_bucket/my_output --mapper my_mapper.py --reducer my_reducer.py --cache s3n://my_bucket/sample_dataset.dat#sample_dataset_cached.dat

For Hadoop 1.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
       --input s3n://my_bucket/my_input \
       --output s3n://my_bucket/my_output \
       --mapper s3n://my_bucket/my_mapper.py \
       --reducer s3n://my_bucket/my_reducer.py \
       --cache s3n://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
  • Windows users:

    ruby elastic-mapreduce --create --stream --input s3n://my_bucket/my_input --output s3n://my_bucket/my_output --mapper s3n://my_bucket/my_mapper.py --reducer s3n://my_bucket/my_reducer.py --cache s3n://my_bucket/sample_dataset.dat#sample_dataset_cached.dat

Example 2

The following command shows the creation of a streaming cluster and uses --cache-archive to add an archive of files to the cache.

The Hadoop streaming syntax is different between Hadoop 1.x and Hadoop 2.x.

For Hadoop 2.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
      --arg "-files" --arg "s3n://my_bucket/my_mapper.py,s3n://my_bucket/my_reducer.py \
      --input s3n://my_bucket/my_input \
      --output s3n://my_bucket/my_output \
      --mapper my_mapper.py \
      --reducer my_reducer.py \
      --cache-archive s3n://my_bucket/sample_dataset.tgz#sample_dataset_cached
  • Windows users:

    ruby elastic-mapreduce --create --stream --arg "-files" --arg "s3n://my_bucket/my_mapper.py,s3n://my_bucket/my_reducer.py --input s3n://my_bucket/my_input --output s3n://my_bucket/my_output --mapper my_mapper.py --reducer my_reducer.py --cache-archive s3n://my_bucket/sample_dataset.tgz#sample_dataset_cached

For Hadoop 1.x, use the following command:

  • Linux, UNIX, and Mac OS X users:

    ./elastic-mapreduce --create --stream \
      --input s3n://my_bucket/my_input \
      --output s3n://my_bucket/my_output \
      --mapper s3n://my_bucket/my_mapper.py \
      --reducer s3n://my_bucket/my_reducer.py \
      --cache-archive s3n://my_bucket/sample_dataset.tgz#sample_dataset_cached
  • Windows users:

    ruby elastic-mapreduce --create --stream --input s3n://my_bucket/my_input --output s3n://my_bucket/my_output --mapper s3n://my_bucket/my_mapper.py --reducer s3n://my_bucket/my_reducer.py --cache-archive s3n://my_bucket/sample_dataset.tgz#sample_dataset_cached