HDFS configuration
The following table describes the default Hadoop Distributed File System (HDFS)
parameters and their settings. You can change these values using the
hdfs-site
configuration classification. For more information, see Configure applications.
Warning
-
Setting
dfs.replication
to 1 on clusters with fewer than four nodes can lead to HDFS data loss if a single node goes down. If your cluster has HDFS storage, we recommend that you configure the cluster with at least four core nodes for production workloads to avoid data loss. -
Amazon EMR will not allow clusters to scale core nodes below
dfs.replication
. For example, ifdfs.replication = 2
, the minimum number of core nodes is 2. -
When you use managed Scaling, auto-scaling, or choose to manually resize your cluster, we recommend that you set
dfs.replication
to2
or higher.
Parameter | Definition | Default value |
---|---|---|
dfs.block.size |
The size of HDFS blocks. When operating on data stored in HDFS,
the split size is generally the size of an HDFS block. Larger
numbers provide less task granularity, but also put less strain on
the cluster NameNode . |
134217728 (128 MB) |
dfs.replication |
The number of copies of each block to store for durability. Amazon EMR
sets this value based on the number of core nodes the cluster is
provisioned with. Adjust the value to meet your needs. To overwrite
the default value, use the hdfs-site
classification. |
|