Appendix F: Optimizing storage cost and data lifecycle management - Genomics Data Transfer, Analytics, and Machine Learning using AWS Services

Appendix F: Optimizing storage cost and data lifecycle management

Filtering the data files to be transferred to Amazon Simple Storage Service (Amazon S3) minimizes storage cost. In a production laboratory, unnecessary data can account for up to 10% of the total data produced. Consider optimizing storage by writing instrument run data to an Amazon S3 bucket configured for Infrequent Access (IA) then archive the data to Glacier Deep Archive.

Enable data lifecycle management to optimize storage costs. Identify your Amazon S3 storage access patterns to optimally configure your S3 bucket lifecycle policy. Use Amazon S3 analytics storage class analysis to analyze your storage access patterns. After storage class analysis observes the access patterns of a filtered set of data over a period of time, you can use the analysis results to improve your lifecycle policies. For genomics data, the key metric is identifying the amount of data retrieved for an observation period of at least 30 days. Knowing the amount of data that was retrieved for a given observation period helps you decide the length of time to keep data in infrequent access before archiving.

For example, a customer performs genomics sequencing on-premises where each run contains six samples for a given scientific study. The run produces 600GB of Binary Base Call (BCL) file data that is transferred, using AWS DataSync, from on-premises to a bucket in Amazon S3, where objects are written using the Amazon S3 Standard-Infrequent Access storage tier. Demultiplexing and secondary analysis are run producing six 100GB FastQ files and one 1GB Variant Call File (VCF), all stored in the same S3 bucket. The BCL files, FastQ files and VCF files are tarred and archived to Glacier Deep Archive after 90 days. A copy of the VCF files remain in the infrequent access tier for twelve months since VCF files are frequently used for a year. After a year, they are deleted from the S3 bucket. Upon request, an entire study is restored from Glacier to Infrequent Access, making the data available in the original S3 bucket, and through the Storage Gateway. To request a restore of a study, the customer attempts to retrieve the data through the Storage Gateway which triggers a restore action. An email is sent to the person that requested the restore when the data is available in the Infrequent Access bucket and through the Storage Gateway. You can learn more about automating data restore from AWS Glacier in Automate restore of archived objects through AWS Storage Gateway.