SUS04-BP05 Remove unneeded or redundant data

Remove unneeded or redundant data to minimize the storage resources required to store your datasets.

Common anti-patterns:

You duplicate data that can be easily obtained or recreated.
You back up all data without considering its criticality.
You only delete data irregularly, on operational events, or not at all.
You store data redundantly irrespective of the storage service's durability.
You turn on Amazon S3 versioning without any business justification.

Benefits of establishing this best practice: Removing unneeded data reduces the storage size required for your workload and the workload environmental impact.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Do not store data that you do not need. Automate the deletion of unneeded data. Use technologies that deduplicate data at the file and block level. Leverage native data replication and redundancy features of services.

Implementation steps

Evaluate if you can avoid storing data by using existing publicly available datasets in AWS Data Exchange and Open Data on AWS.

Use mechanisms that can deduplicate data at the block and object level. Here are some examples of how to deduplicate data on AWS:

Storage service	Deduplication mechanism
Amazon S3	Use AWS Lake Formation FindMatches to find matching records across a dataset (including ones without identifiers) by using the new FindMatches ML Transform.
Amazon FSx	Use data deduplication on Amazon FSx for Windows.
Amazon Elastic Block Store snapshots	Snapshots are incremental backups, which means that only the blocks on the device that have changed after your most recent snapshot are saved.

Analyze the data access to identify unneeded data. Automate lifecycle policies. Leverage native service features like Amazon DynamoDB Time To Live, Amazon S3 Lifecycle, or Amazon CloudWatch log retention for deletion.
Use data virtualization capabilities on AWS to maintain data at its source and avoid data duplication.
- Cloud Native Data Virtualization on AWS
- Optimize Data Pattern Using Amazon Redshift Data Sharing
Use backup technology that can make incremental backups.
Leverage the durability of Amazon S3 and replication of Amazon EBS to meet your durability goals instead of self-managed technologies (such as a redundant array of independent disks (RAID)).
Centralize log and trace data, deduplicate identical log entries, and establish mechanisms to tune verbosity when needed.
Pre-populate caches only where justified.
Establish cache monitoring and automation to resize the cache accordingly.
Remove out-of-date deployments and assets from object stores and edge caches when pushing new versions of your workload.

Resources

Related documents:

Related videos:

Amazon Redshift Data Sharing Use Cases

Related examples:

How do I analyze my Amazon S3 server access logs using Amazon Athena?

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

SUS04-BP04 Use elasticity and automation to expand block storage or file system

SUS04-BP06 Use shared file systems or storage to access common data