Architecture - Genomics Tertiary Analysis and Data Lakes Using AWS Glue

Architecture

The following diagram describes the overall data lake architecture; how data is ingested, curated, cataloged, and queried.



Genomics Tertiary Analysis and Data Lakes Using AWS Glue
      and Amazon Athena data lake architecture

Figure 1: Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena data lake architecture

This solution demonstrates how to ingest common genomics data sets into a centralized data lake and work with that data using Amazon Athena and Jupyter notebooks. There is an example ingestion pipeline for genomic variant calls data (1000 Genomes), annotation data (ClinVar) and an individual Variant Call File (VCF) data.

A portion of the 1000 Genomes dataset (Chromosome 22) is copied into the solution data lake bucket during solution setup. The dataset is in Apache Parquet format and partitioned by sample ID. The variants crawler is provided to crawl the dataset, infer the data schema, and add a table to the solution AWS Glue data catalog.

A ClinVar file, in Tab Separated Values (TSV) format, is copied into the solution data lake bucket during solution setup. The clinvar-to-parquet job is provided to convert the dataset to Parquet format. The clinvar crawler is provided to crawl the dataset, infer the data schema, and add a table to the solution AWS Glue data catalog.

An example Variant Call File (VCF) is copied into the solution data lake bucket during solution setup. The vcf-to-parquet job is provided to convert the dataset to Parquet format. The sample crawler is provided to crawl the dataset, infer the data schema, and add a table to the solution AWS Glue data catalog.

Note:

A version of each dataset in Parquet format are copied into the solution data lake bucket during setup. Also, the variants, clinvar and sample crawlers are run during solution setup. This allows users immediately work with the data once the setup is complete.

Deploying this solution with the default parameters builds the following environment in the AWS Cloud.

Figure 2: Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena deployment architecture

The AWS CloudFormation template creates four CloudFormation stacks in your AWS account including a setup stack to install the solution. The other stacks include a landing zone (zone) stack containing the common solution resources and artifacts, a deployment pipeline (pipe) stack defining the solution's CI/CD pipeline, and a codebase (code) stack providing the ETL scripts, jobs, crawlers, a data catalog, and notebook resources.

  1. The setup stack creates an AWS CodeBuild project containing the setup.sh script. This script creates the remaining CloudFormation stacks and provides the source code for both the AWS CodeCommit pipe repository and the code repository.

  2. The landing zone (zone) stack creates the CodeCommit pipe repository. After the landing zone (zone) stack completes its setup, the setup.sh script pushes source code to the CodeCommit pipe repository.

  3. The deployment pipeline (pipe) stack creates the CodeCommit code repository, an Amazon CloudWatch event, and the CodePipeline code pipeline. After the deployment pipeline (pipe) stack completes its setup, the setup.sh script pushes source code to the CodeCommit code repository.

  4. The CodePipeline (code) pipeline deploys the codebase (code) CloudFormation stack. After the AWS CodePipeline pipelines complete their setup, the resources deployed in your account include Amazon Simple Storage Service (Amazon S3) buckets for storing object access logs, build artifacts, and data in your data lake; CodeCommit repositories for source code; an AWS CodeBuild project for building code artifacts (for example, third-party libraries used for data processing); an AWS CodePipeline pipeline for automating builds and deployment of resources; example AWS Glue jobs, crawlers, and a data catalog; and an Amazon SageMaker Jupyter notebook instance.

The example code includes the resources needed to prepare genomic data for large-scale analysis and perform interactive queries against a genomics data lake.