EMRFS consistent view metadata - Amazon EMR

EMRFS consistent view metadata

EMRFS consistent view tracks consistency using a DynamoDB table to track objects in Amazon S3 that have been synced with or created by EMRFS. The metadata is used to track all operations (read, write, update, and copy), and no actual content is stored in it. This metadata is used to validate whether the objects or metadata received from Amazon S3 matches what is expected. This confirmation gives EMRFS the ability to check list consistency and read-after-write consistency for new objects EMRFS writes to Amazon S3 or objects synced with EMRFS. Multiple clusters can share the same metadata.

How to add entries to metadata

You can use the sync or import subcommands to add entries to metadata. sync reflects the state of the Amazon S3 objects in a path, while import is used strictly to add new entries to the metadata. For more information, see EMRFS CLI Command Reference.

How to check differences between metadata and objects in Amazon S3

To check for differences between the metadata and Amazon S3, use the diff subcommand of the EMRFS CLI. For more information, see EMRFS CLI Command Reference.

How to know if metadata operations are being throttled

EMRFS sets default throughput capacity limits on the metadata for its read and write operations at 500 and 100 units, respectively. Large numbers of objects or buckets may cause operations to exceed this capacity, at which point DynamoDB will throttle operations. For example, an application may cause EMRFS to throw a ProvisionedThroughputExceededException if you perform an operation that exceeds these capacity limits. Upon throttling, the EMRFS CLI tool attempts to retry writing to the DynamoDB table using exponential backoff until the operation finishes or when it reaches the maximum retry value for writing objects from Amazon EMR to Amazon S3.

You can configure your own throughput capacity limits. However, DynamoDB has strict partition limits of 3000 read capacity units (RCUs) and 1000 write capacity units (WCUs) per second for read and write operations. To avoid sync failures caused by throttling, we recommend you limit throughput for read operations to fewer than 3000 RCUs and write operations to fewer than 1000 WCUs. For instructions on setting custom throughput capacity limits, see Configure consistent view.

You can also view Amazon CloudWatch metrics for your EMRFS metadata in the DynamoDB console where you can see the number of throttled read and write requests. If you do have a non-zero value for throttled requests, your application may potentially benefit from increasing allocated throughput capacity for read or write operations. You may also realize a performance benefit if you see that your operations are approaching the maximum allocated throughput capacity in reads or writes for an extended period of time.

Throughput characteristics for notable EMRFS operations

The default for read and write operations is 400 and 100 throughput capacity units, respectively. The following performance characteristics give you an idea of what throughput is required for certain operations. These tests were performed using a single-node m3.large cluster. All operations were single threaded. Performance differs greatly based on particular application characteristics and it may take experimentation to optimize file system operations.

Operation Average read-per-second Average write-per-second
create (object) 26.79 6.70
delete (object) 10.79 10.79
delete (directory containing 1000 objects) 21.79 338.40
getFileStatus (object) 34.70 0
getFileStatus (directory) 19.96 0
listStatus (directory containing 1 object) 43.31 0
listStatus (directory containing 10 objects) 44.34 0
listStatus (directory containing 100 objects) 84.44 0
listStatus (directory containing 1,000 objects) 308.81 0
listStatus (directory containing 10,000 objects) 416.05 0
listStatus (directory containing 100,000 objects) 823.56 0
listStatus (directory containing 1M objects) 882.36 0
mkdir (continuous for 120 seconds) 24.18 4.03
mkdir 12.59 0
rename (object) 19.53 4.88
rename (directory containing 1000 objects) 23.22 339.34
To submit a step that purges old data from your metadata store

Users may wish to remove particular entries in the DynamoDB-based metadata. This can help reduce storage costs associated with the table. Users have the ability to manually or programmatically purge particular entries by using the EMRFS CLI delete subcommand. However, if you delete entries from the metadata, EMRFS no longer makes any checks for consistency.

Programmatically purging after the completion of a job can be done by submitting a final step to your cluster, which executes a command on the EMRFS CLI. For instance, type the following command to submit a step to your cluster to delete all entries older than two days.

aws emr add-steps --cluster-id j-2AL4XXXXXX5T9 --steps Name="emrfsCLI",Jar="command-runner.jar",Args=["emrfs","delete","--time","2","--time-unit","days"] { "StepIds": [ "s-B12345678902" ] }

Use the StepId value returned to check the logs for the result of the operation.