Overview - Genomics Tertiary Analysis and Data Lakes Using AWS Glue

Overview

The Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena solution creates a scalable environment in AWS to prepare genomic data for large-scale analysis and perform interactive queries against a genomics data lake. This solution demonstrates how to 1) build, package, and deploy libraries used for genomics data conversion, 2) provision data ingestion pipelines for genomics data preparation and cataloging, 3) run interactive queries against a genomics data lake.

Build, package, and deploy libraries used for genomics data conversion

Note:

Hail is an open-source library from the Broad Institute for scalable genomic data exploration.

The solution uses AWS CodeBuild and AWS CodePipeline to build, package and deploy Hail as a jar file to an Amazon Simple Storage Service (Amazon S3) bucket. Hail is used to load Variant Call Files (VCFs) into Apache Spark data frames for data processing and format conversion. You can learn more about adding or modifying solution CodeBuild jobs to build, package, and deploy library dependencies in the Genomics Tertiary Analysis and Data Lakes using AWS Glue and Amazon Athena Developer Guide.

Provision data ingestion pipelines for genomics data preparation and cataloging

During the solution setup, a clinvar.tsv.gz file, an example VCF file, and a subset of the 1000 genomes dataset (in Parquet format and partitioned by sample ID) are copied into the solution data lake bucket. Parquet versions of each dataset are also included to make running the following Extract, Transform, and Load (ETL) jobs optional.

Note:

Apache Parquet and Apache ORC are popular columnar data stores that are optimized for best performance and cost-savings when querying data in Amazon S3. They provide features that store data efficiently by employing column-wise compression, different encoding, compression based on data type, and predicate pushdown. They can also be split. Better compression ratios or skipping blocks of data means reading fewer bytes from Amazon S3, leading to better query performance.

AWS Glue jobs are provided to prepare data. The vcf-to-parquet job converts variant call data in the Variant Call File (VCF) format into Apache Parquet format. The clinvar-to-parquet job converts ClinVar data in a Tab Separated Value (TSV) format into Apache Parquet format.

AWS Glue crawlers are provided to catalog data. Crawlers crawl data files to infer their data schema and use that schema to create or update tables in a data catalog. The variants crawler crawls the 1000 genomes Parquet data files and adds or updates a variants table to the data catalog. The clinvar crawler crawls the clinvar Parquet data files and adds or updates a clinvar table to the data catalog. The sample crawler crawls the sample Parquet data files and adds or updates a sample table to the data catalog. You can learn more about adding or modifying solution jobs and crawlers to prepare and catalog datasets, in the Genomics Tertiary Analysis and Data Lakes using AWS Glue and Amazon Athena Developer Guide.

Run big data queries against a genomics data lake

Note:

PyAthena is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena.

An Amazon SageMaker notebook instance is provisioned with an example Jupyter notebook that demonstrates how to work with data in a genomics data lake. The solution notebook uses Amazon Athena to identify genomic variants related to drug response for a given cohort of individuals. The below query is run against data in the data lake using the PyAthena driver to 1) filter by samples in a subpopulation, 2) aggregate variant frequencies for the subpopulation-of-interest, 3) join on the ClinVar dataset, 4) filter by variants that have been implicated in drug-response, 5) order by highest frequency variants. The query can also be run in the Amazon Athena console.

SELECT count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency ,cv.rsid ,cv.phenotypelist ,sv.chromosome ,sv.startposition ,sv.endposition ,sv.referenceallele ,sv.alternateallele ,sv.genotype0 ,sv.genotype1 FROM genomicsanalysis.onekg_chr22_by_sample sv CROSS JOIN (SELECT count(1) AS numsamples FROM (SELECT DISTINCT sampleid FROM genomicsanalysis.onekg_chr22_by_sample WHERE sampleid LIKE 'NA12%')) JOIN genomicsanalysis.clinvar cv ON sv.chromosome = cv.chromosome AND sv.startposition = cv.start - 1 AND sv.endposition = cv.stop AND sv.referenceallele = cv.referenceallele AND sv.alternateallele = cv.alternateallele WHERE assembly='GRCh37' AND cv.clinicalsignificance LIKE '%response%' AND sampleid LIKE 'NA12%' GROUP BY sv.chromosome ,sv.startposition ,sv.endposition ,sv.referenceallele ,sv.alternateallele ,sv.genotype0 ,sv.genotype1 ,cv.clinicalsignificance ,cv.phenotypelist ,cv.rsid ,numsamples ORDER BY genotypefrequency DESC LIMIT 50

The solution includes continuous integration and continuous delivery (CI/CD) using AWS CodeCommit source code repositories and AWS CodePipeline for building and deploying updates to the data preparation jobs, crawlers, data analysis notebooks, and the data lake infrastructure. This solution fully leverages infrastructure as code principles and best practices that enable you to rapidly evolve the solution. After deployment, you can modify the solution to fit your particular needs, for example, by adding new data preparation jobs and crawlers. Each change is tracked by the CI/CD pipeline, facilitating change control management, rollbacks, and auditing.

Cost

You are responsible for the cost of the AWS services used while running this reference deployment. As of the date of publication, the cost for running this solution with default settings in the US East (N. Virginia) Region is approximately $0.45 during setup to run the three crawlers, $0.05 per month to run the Amazon SageMaker Notebook Instance, and $0.00025 each drug response query execution for interpreting the data with Amazon Athena. Prices are subject to change. For full details, see the pricing webpage for each AWS service used in this solution.

Note:

AWS Glue job execution: 2 DPUs * 1/6 hour at $0.44 per DPU-Hour or $0.15

AWS Glue crawler execution: 2 DPUs * 1/6 hour at $0.44 per DPU-Hour or $0.15

Drug response query execution: Less than 0.0005 TB scanned * 0.0005 * $5/TB = $0.0025

If you customize the solution to analyze your genomics dataset, the cost factors include the storage size of the data being analyzed, the number of Extract Transform and Load (ETL) jobs and crawlers being used, compute resources required for each job, number of notebook instances provisioned and volume of data scanned when using Athena. For a more accurate estimate of cost, we recommend working with a sample dataset of your choosing as a benchmark. Prices are subject to change. For full details, see the pricing webpage for each AWS service used in this solution.