Solution components - Genomics Tertiary Analysis and Data Lakes Using AWS Glue

Solution components

CI/CD pipeline

A complete continuous integration and continuous deployment (CI/CD) pipeline is created when you launch the solution. This pipeline is built using AWS CodeCommit as the source code, AWS CodeBuild projects to build the solution’s artifacts (for example, hail.jar), and an AWS CodePipeline pipeline to run a build project and automate deployment (using AWS CloudFormation) after the updated source code is published.

The AWS resources that compose the CI/CD pipeline are defined in the Pipe/template_cfn.yml file. Changes to the solution requiring new artifacts to be built require an update to the CI/CD pipeline definition.

Solution demonstration datasets

This solution copies the following datasets into your solution data lake bucket.

  • 1000 Genomes, Chromosome 22 (cohort) – A portion of the 1000 genomes public dataset of human genomic variant data, partitioned by sample ID and in Apache Parquet format. This dataset is used as our cohort dataset for creating the drug response report.

  • ClinVar (annotation) – The public dataset that aggregates information about genomic variation and its relationship to human health. Two copies of this dataset are copied into your data lake bucket, one in Tab Separated Values (TSV) format and another in Apache Parquet format. The TSV file is used as an input to the clinvar-to-parquet job which produces the dataset in Apache Parquet format. We copy the parquet version of the dataset into the data lake bucket after the solution is set up so that running the AWS Glue job crawler is optional.

  • Individual Sample Variants (sample) – An individual sample Variant Call File (VCF) dataset used to demonstrate VCF to Apache Parquet conversion. Two copies (one in VCF and format and one in Parquet format) of this dataset are copied into your data lake bucket. The VCF copy is used as an input to the vcf-to-parquet AWS Glue job, which produces the dataset in Apache Parquet format. The Parquet version of the dataset is copied into the data lake bucket so that running the AWS Glue Job crawler is optional.

AWS Glue jobs

This solution creates the following AWS Glue jobs.

  • vcf-to-parquet – Transforms variant calls in a Variant Call File (VCF) format into Apache Parquet format using Hail from the Broad Institute and writes the resulting files to the solution data lake.

  • clinvar-to-parquet – Transforms Clinical Variant (ClinVar) data in a Tab Separated Values (TSV) format into Apache Parquet format and writes the resulting files to the solution data lake.

    Hail is built from open source code and is publicly available under open source license. For more information, see Appendix A.

AWS Glue crawlers

This solution creates the following AWS Glue crawlers.

  • variants-crawler – Creates/Updates the variants table in the solution’s AWS Glue data catalog to reflect the data schema of the 1000 Genomes example cohort variant data in the solution data lake.

  • clinvar-crawler – Creates/Updates the clinvar table in the solution’s AWS Glue data catalog to reflect the data schema of the ClinVar dataset in the solution data lake.

AWS Glue data catalog

This solution creates an AWS Glue data catalog with a genomicsanalysis database that contains variants and clinvar tables. AWS Glue is configured to encrypt the metadata stored in the data catalog, data files stored in Amazon S3 buckets, and all logs stored in Amazon CloudWatch.

SageMaker notebook instance

This solution creates an Amazon SageMaker notebook instance that demonstrates how to use AWS Glue and Amazon Athena to identify variants related to drug response.

Amazon Simple Storage Service buckets

This solution creates the following Amazon Simple Storage Service (Amazon S3) buckets. Each bucket has encryption and logging enabled.

  • Data Lake Bucket – Stores genomic variant and ClinVar variant annotation data.

  • Resources Bucket – Stores notebooks and shell scripts.

  • Build Bucket – Stores build artifacts deployed through the pipeline.