SUS04-BP05 Remove unneeded or redundant data
Remove unneeded or redundant data to minimize the storage resources required to store your datasets.
Common anti-patterns:
-
You duplicate data that can be easily obtained or recreated.
-
You back up all data without considering its criticality.
-
You only delete data irregularly, on operational events, or not at all.
-
You store data redundantly irrespective of the storage service's durability.
-
You turn on Amazon S3 versioning without any business justification.
Benefits of establishing this best practice: Removing unneeded data reduces the storage size required for your workload and the workload environmental impact.
Level of risk exposed if this best practice is not established: Medium
Implementation guidance
Do not store data that you do not need. Automate the deletion of unneeded data. Use technologies that deduplicate data at the file and block level. Leverage native data replication and redundancy features of services.
Implementation steps
-
Evaluate if you can avoid storing data by using existing publicly available datasets in AWS Data Exchange
and Open Data on AWS . -
Use mechanisms that can deduplicate data at the block and object level. Here are some examples of how to deduplicate data on AWS:
Storage service Deduplication mechanism Use AWS Lake Formation FindMatches
to find matching records across a dataset (including ones without identifiers) by using the new FindMatches ML Transform. Use data deduplication on Amazon FSx for Windows.
Snapshots are incremental backups, which means that only the blocks on the device that have changed after your most recent snapshot are saved.
-
Analyze the data access to identify unneeded data. Automate lifecycle policies. Leverage native service features like Amazon DynamoDB Time To Live, Amazon S3 Lifecycle, or Amazon CloudWatch log retention for deletion.
-
Use data virtualization capabilities on AWS to maintain data at its source and avoid data duplication.
-
Use backup technology that can make incremental backups.
-
Leverage the durability of Amazon S3 and replication of Amazon EBS to meet your durability goals instead of self-managed technologies (such as a redundant array of independent disks (RAID)).
-
Centralize log and trace data, deduplicate identical log entries, and establish mechanisms to tune verbosity when needed.
-
Pre-populate caches only where justified.
-
Establish cache monitoring and automation to resize the cache accordingly.
-
Remove out-of-date deployments and assets from object stores and edge caches when pushing new versions of your workload.
Resources
Related documents:
Related videos:
Related examples: