Components - Guidance for Multi-Omics and Multi-Modal Data Integration and Analysis on AWS

Components

CI/CD pipeline

A complete continuous integration and continuous deployment (CI/CD) pipeline is created when you launch the guidance. This pipeline is built using AWS CodeCommit as the source code, AWS CodeBuild projects to build the guidance’s artifacts (for example, hail.jar), and an AWS CodePipeline pipeline to run a build project and automate deployment (using AWS CloudFormation) after the updated source code is published.

The AWS resources that compose the CI/CD pipeline are defined in the Pipe/template_cfn.yml file. Changes to the guidance requiring new artifacts to be built require an update to the CI/CD pipeline definition.

Demonstration datasets

This guidance copies the following datasets into your data lake bucket.

  • The Cancer Genome Atlas, Lung Adenocarcinoma and Lung Squamous Cell Carcinoma (TCGA-LUAD and TCGA-LUSC) – Two public cancer studies consisting of clinical, mutation, gene expression, and copy number data, partitioned by study ID and in Apache Parquet format. These studies are retrieved directly from The Cancer Genome Atlas (TCGA) APIs via a set of AWS Glue jobs. These datasets are used as the basis of cohort exploration and selection.

  • The Cancer Imaging Archive, Lung Adenocarcinoma and Lung Squamous Cell Carcinoma (TCIA) – Imaging metadata for subjects in the TCGA studies listed above, partitioned by study ID and in Apache Parquet format. This metadata is retrieved directly from The Cancer Imaging Archive (TCIA) APIs via an AWS Glue job. This dataset can be used to filter patients by availability of imaging data, as well as retrieve individual images for high-level exploration.

  • 1000 Genomes, Chromosome 22 (genome) – A portion of the 1000 genomes public dataset of human genomic variant data in VCF format. This dataset is used as our cohort dataset for creating the drug response report.

  • ClinVar (annotation) – The public dataset that aggregates information about genomic variation and its relationship to human health. We use the VCF format to be used with Amazon Omics Variant Store.

  • Individual Sample Variants (sample) – An individual sample Variant Call File (VCF) dataset used to demonstrate queryability with Athena.

AWS Glue jobs

This guidance creates the following AWS Glue jobs:

  • Clinical, Cnv, Expression, and Mutation – Retrieves data from TCGA, filters and transforms it to Apache Parquet format, and writes the resulting files to the data lake. For each of the four data types, there are two jobs, one for each of the two source TCGA projects (LUAD and LUSC). For a total of 8 jobs.

  • ImagingMetadata – Retrieves imaging metadata from The Cancer Imaging Archive (TCIA), transforms them to Apache Parquet format, and writes the resulting files to the data lake. There is one job for each of the two source TCGA projects (LUAD and LUSC).

  • TcgaSummary – Invokes Amazon Athena to generate a new database table containing summary metrics over all of the TCGA and TCIA data tables, saving the results in Apache Parquet format within the data lake, and registering the table with the Glue data catalog.

AWS Glue crawlers

This guidance creates the following AWS Glue crawlers:

  • tcga-clin – Creates/Updates the clinical_patient and other clinical tables in the AWS Glue data catalog to reflect the data schema of the TCGA clinical data in the data lake.

  • tcga-cnv - Creates/Updates the tcga_cnv table in the AWS Glue data catalog to reflect the data schema of the TCGA copy number data in the data lake.

  • tcga-exp – Creates/Updates the expression_tcga_luad and expression_tcga_lusc tables in the AWS Glue data catalog to reflect the data schema of the TCGA expression data in the data lake.

  • tcga-mut – Creates/Updates the tcga_mutation table in the AWS Glue data catalog to reflect the data schema of the TCGA mutation data in the data lake.

  • tcga-img – Creates/Updates the tcia_patients and tcia_image_series tables in the AWS Glue data catalog to reflect the data schema of the TCIA image metadata in the data lake.

  • tcga-sum – Creates/Updates the tcga_summary table in the AWS Glue data catalog to reflect the data schema of the TCGA summary data in the data lake.

AWS Glue workflow

This guidance creates an AWS Glue workflow which sequences and coordinates the AWS Glue jobs and crawlers as part of the TCGA and TCIA data sets. For each TCGA data type and for the TCIA data, two Glue jobs are invoked, followed by a trigger that signals the corresponding Glue crawler to run. Once all Glue crawlers are complete, the TcgaSummary job is invoked to create the summary table.

AWS Glue data catalog

This guidance creates an AWS Glue data catalog with a genomicsanalysis database that contains all the tables used by this guidance. AWS Glue is configured to encrypt the metadata stored in the data catalog, data files stored in Amazon S3 buckets, and all logs stored in Amazon CloudWatch.

SageMaker AI notebook instance

This guidance creates an Amazon SageMaker AI notebook instance that demonstrates how to use AWS Glue and Amazon Athena to identify variants related to drug response.

Amazon S3 buckets

This guidance creates the following Amazon S3 buckets. Each bucket has encryption and logging activated:

  • Data Lake Bucket – Stores genomic variant and ClinVar variant annotation data.

  • Resources Bucket – Stores notebooks and shell scripts.

  • Build Bucket – Stores build artifacts deployed through the pipeline.

  • Logs Bucket – Stores S3 access logs related to the three buckets listed above.

QuickSight dataset

This solution creates an QuickSight dataset derived from the Amazon Athena clinical_patient and tcga_summary tables, joined by the patient ID field, and filtered for key columns. The dataset is shared with the guidance owner and available for creating interactive analyses on clinical data. Before using QuickSight to explore the data in this guidance, sign up for an QuickSight subscription. For details, refer to Signing up for an QuickSight subscription in the QuickSight User Guide.

Amazon Omics

This solution creates the following Amazon Omics resources:

  • Reference Store – A data store for the storage of reference genomes in FASTA format.

  • Variant Store – A data store that stores variant data at a population scale in VCF format.

  • Annotation Store – A data store that stores annotation data in VCF, GFF3 and TSV/CSV formats downstream queryability.

  • In addition, the example datasets (1000 Genomes, example VCF and ClinVar VCF) are ingested into these stores as part of the solution setup.