| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
Topics
Distributed Cache is a Hadoop feature that allow you to transfer files from a distributed file system to the local file system. It can distribute data and text files as well as more complex types such as archives and jars. If your cluster depends on applications or binaries that are not installed when the cluster is created, you can use Distributed Cache to import these files. Using Distributed Cache can boost efficiency when a map or a reduce task needs access to common data. A cluster node can read files from its local file system, instead of retrieving the files from other cluster nodes.
You invoke Distributed Cache when you create the cluster. The files are cached just
before starting the Hadoop job and the files remain cached for the duration of the job. You
can cache files stored on any Hadoop-compatible file system, for example HDFS or S3 native.
The default size of the file cache is 10GB. To change the size of the cache, reconfigure
the Hadoop parameter, local.cache.size using the Create Bootstrap Actions to Install Additional Software (Optional)
bootstrap action.
Distributed Cache allows both single files and archives. Individual files are cached as read only. Executables and binary files have execution permissions set.
Archives are one or more files packaged using a utility, such as gzip.
Distributed Cache passes the compressed files to each slave node and decompresses the
archive as part of caching. Distributed Cache supports the following compression
formats:
zip
tgz
tar.gz
tar
jar
Distributed Cache copies files to slave nodes only. If there are no slave nodes in the cluster, Distributed Cache copies the files to the master node.
Distributed Cache associates the cache files to the current working directory of the
mapper and reducer using symlinks. A symlink is an alias to a file location, not the
actual file location. The value of the Hadoop parameter,
mapred.local.dir, specifies the location of temporary files.
Amazon Elastic MapReduce (Amazon EMR) sets this parameter to
/mnt/var/lib/hadoop/mapred/. Cache files are located in a
subdirectory of the temporary file location at
/mnt/var/lib/hadoop/mapred/taskTracker/archive/.
If you cache a single file, Distributed Cache puts the file in the
archive directory. If you cache an archive, Distributed Cache
decompresses the file, creates a subdirectory in /archive with the same
name as the archive file name. The individual files are located in the new
subdirectory.
You can use Distributed Cache only when creating streaming clusters.
To access the cached files from your mapper or reducer applications, make sure that you have added the current working directory (./) into your application path and referenced the cached files as though they are present in the current working directory.
For more information, go to http://hadoop.apache.org/docs/stable/mapred_tutorial.html#DistributedCache.
You can use the Amazon EMR console to create clusters that use Distributed Cache.
To specify Distributed Cache files
Launch the Create New Job Flow wizard, specify a streaming cluster, and click Continue.
For information on how to launch the Create New Job Flow wizard and specify a streaming cluster go to Launch a Streaming Cluster.
The Specify Parameters page opens.
In the Extra Args field, include the files and archives to save to the cache.
The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.
| If you want to ... | Action | Example |
|---|---|---|
| Add an individual file to the Distributed Cache | Specify -cacheFile followed by the name and location
of the file, the pound (#) sign, and then the name you want to give
the file when it's placed in the local cache. |
–cacheFile \ s3n://bucket_name/file_name#cache_file_name |
| Add an archive file to the Distributed Cache | Enter -cacheArchive followed by the location of the
files in Amazon S3, the pound (#) sign, and then the name you want to
give the collection of files in the local cache. |
–cacheArchive \ s3n://bucket_name/archive_name#cache_archive_name |

Proceed with configuring and launching your streaming cluster.
Your cluster copies the files to the cache location before processing any cluster steps.
You can use the Amazon EMR console to create clusters that use Distributed
Cache. To add files or archives to the Distributed Cache using the CLI, you specify the
options –-cache or --cache-archive to the
CLI command line.
To specify Distributed Cache files
Create a streaming cluster and add the following parameters:
For information on how to create a streaming cluster using the CLI, go to Launch a Streaming Cluster.
The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.
| If you want to ... | Add the following parameter to the cluster ... |
|---|---|
| add an individual file to the Distributed Cache | specify -cache followed by the name and location of
the file, the pound (#) sign, and then the name you want to give the
file when it's placed in the local cache. |
| add an archive file to the Distributed Cache | enter -cache-archive followed by the location of the
files in Amazon S3, the pound (#) sign, and then the name you want to
give the collection of files in the local cache. |
The output looks similar to the following.
Created jobflow JobFlowIDYour cluster copies the files to the cache location before processing any job flow steps.
Example 1
The following command shows the creation of a streaming cluster and uses
--cache to add one file,
sample_dataset_cached.dat, to the cache.
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --stream \ --input s3n://my_bucket/my_input \ --output s3n://my_bucket/my_output \ --mapper s3n://my_bucket/my_mapper.py \ --reducer s3n://my_bucket/my_reducer.py \ --cache s3n://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
Windows users:
ruby elastic-mapreduce --create --stream --input s3n://my_bucket/my_input --output s3n://my_bucket/my_output --mapper s3n://my_bucket/my_mapper.py --reducer s3n://my_bucket/my_reducer.py --cache s3n://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
Example 2
The following command shows the creation of a streaming cluster and uses
--cache-archive to add an archive of files to the cache.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --stream \ --input s3n://my_bucket/my_input \ --output s3n://my_bucket/my_output \ --mapper s3n://my_bucket/my_mapper.py \ --reducer s3n://my_bucket/my_reducer.py \ --cache-archive s3n://my_bucket/sample_dataset.tgz#sample_dataset_cached
Windows users:
ruby elastic-mapreduce --create --stream --input s3n://my_bucket/my_input --output s3n://my_bucket/my_output --mapper s3n://my_bucket/my_mapper.py --reducer s3n://my_bucket/my_reducer.py --cache-archive s3n://my_bucket/sample_dataset.tgz#sample_dataset_cached
This section describes the Amazon EMR API Query request parameters needed to use Distributed Cache.
To specify Distributed Cache files
Create a streaming cluster and add the following parameters:
For information on how to create a streaming cluster using the CLI, go to Launch a Streaming Cluster.
The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.
| If you want to ... | Add the following parameter to the cluster ... | |
|---|---|---|
| add an individual file to the Distributed Cache | specify -cache followed by the name and location of
the file, the pound (#) sign, and then the name you want to give the
file when it's placed in the local cache. | |
| add an archive file to the Distributed Cache | enter -cache-archive followed by the location of the
files in Amazon S3, the pound (#) sign, and then the name you want to
give the collection of files in the local cache. |
The following JSON example describes a simple streaming cluster that uses the
Distributed Cache to store the file sample_data.dat.
[
{ "Name": "streaming cluster referencing distributed cache",
"HadoopJarStep":
{
"Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
"Args":
[
"-input", "s3n://elasticmapreduce/samples/wordcount/input",
"-output", "s3n://myawsbucket",
"-mapper", "s3://elasticmapreduce/samples/wordcount/wordSplitter.py",
"-reducer", "aggregate",
"-cache", "s3n://my_bucket/sample_data.dat#sample_data_cached.dat"
]
}}
]All paths are prefixed with their location. “s3://” refers to the Amazon S3 file
system. “s3n://” refers to the Amazon S3 native file system. If you use HDFS, prepend
the path with hdfs:///. Make sure to use three slashes (///), as in
hdfs:///home/hadoop/sampleInput2/.