What is AWS HealthOmics? - AWS HealthOmics

What is AWS HealthOmics?

AWS HealthOmics is an AWS service that helps users such as bioinformaticians, researchers, and scientists to store, query, analyze, and generate insights from genomics and other biological data. It simplifies and accelerates the process of storing and analyzing genomic information for research and clinical organizations, and makes scientific discovery and insight generation faster.

HealthOmics has three primary components. HealthOmics Storage helps you store and share petabytes of genomics data efficiently and at low cost per gigabase. HealthOmics Analytics simplifies how you prepare genomics data for multiomics and multimodal analyses. HealthOmics Workflows automatically provisions and scales the underlying infrastructure for your bioinformatics computation.

Important notice

HealthOmics isn't a substitute for professional medical advice, diagnosis, or treatment, and isn't intended to cure, treat, mitigate, prevent, or diagnose any disease or health condition. You are responsible for instituting human review as part of any use of AWS HealthOmics, including in association with any third-party product intended to inform clinical decision-making.

HealthOmics is intended only for the transferring, storing, formatting, or displaying of data, and for the provision of infrastructure and configuration support for managing workflows. AWS HealthOmics isn't intended to directly perform variant calling or genomic analysis and interpretation. AWS HealthOmics isn't intended to interpret or analyze clinical laboratory tests or other device data, results, and findings, and isn't a substitute for third-party tools intended for use in genomic analyses.

HealthOmics concepts

This topic covers definitions for key concepts and terms that are specific to HealthOmics, to help you understand the terminology of HealthOmics used this guide.

Storage

Data storage is separated into sequence stores, for your genomics sequences and related information, and a reference store, for all of your reference genomes. The following terms describe the implementations that are specific to HealthOmics.

  • Sequence store – A data store for the storage of genomics files. You can have one or more sequence stores within HealthOmics. Access permissions and AWS KMS encryption can be set on a sequence store to control who has access to the data.

  • Read set – A read set is an abstraction of genomics reads, which are stored in FASTQ, BAM, or CRAM formats. Read sets can be imported into sequence stores and annotated with metadata. You can apply permissions to read sets using attribute based access control (ABAC).

  • Reference – A genome reference is used with reads to identify where in a genome a specific read, or group of reads, is mapped to. These are in FASTA format and stored in the reference store.

  • Reference store – A data store for the storage of reference genomes. You can have a single reference store in each account and region.

Analytics

You can transform and analyze your genomics data with HealthOmics Analytics. Create a variant store or annotation store to include additional information for your queries.

  • Variant store – data store that stores variant data at a population scale. Variant stores support both genomic Variant Call Format (gVCF) and VCF inputs.

  • Annotation store – A data store representing an annotation database, such as one from a TSV/CSV, VCF, or General Feature Format (GFF3) file. Annotation Stores are mapped to the same coordinate system as variant stores during an import.

Workflows

With HealthOmics Workflows, you can process and analyze your genomics data.

  • Workflow – The overall definition of an end to end process including parameters and references to tools. Workflow definitions can be expressed as WDL, Nextflow, or CWL. Each created workflow has a unique identifier.

  • Run/Workflow run – A single invocation of a workflow. An individual run uses your defined input data and produces an output. Each created run has a unique identifier.

  • Task – The individual processes within a run. HealthOmics Workflows use these defined compute specifications to run your task. Each task has a unique identifier.

  • Run group – A group of runs for which you can set the max vCPU, max duration, or max concurrent runs to help limit the compute resources used per run. You can specify and configure priorities for your workflow runs within a run group. For example, you can specify that a high priority run will be performed before one that's lower priority, creating a priority queue. It is optional to use a Run Group, and each Run Group has a unique identifier.

HealthOmics features

HealthOmics offers the following features.

  • HealthOmics Storage — helps you store and share petabytes of raw genomics data efficiently and at low cost per gigabase.

  • HealthOmics Analytics — simplifies how you prepare genomics data for multiomics and multimodal analyses.

  • HealthOmics Workflows — automatically provisions and scales the underlying infrastructure for your bioinformatics workflows.

You can use each component independently, or as part of an integrated end-to-end solution.

HealthOmics offers you the following benefits.

  • Securely store and combine genomic data — HealthOmics integrates with other AWS services such as AWS Lake Formation and Amazon Athena. You can securely store your genomics data and then query or combine it with medical history data for better diagnoses and personalized treatment plans.

  • Protect patient privacy — HealthOmics is HIPAA eligible. It also integrates with IAM and Amazon CloudWatch so that you can control and log data access, and track how the data is used in analyses.

  • Built to scale — Support large population data analyses with simplified billing and new collaboration tools.

  • Maximize efficiency — Use automated workflows and integrated tools to streamline data processing and analysis.

You can use HealthOmics for the following biomedical applications:

  • Population sequencing — Query thousands of genomes at once to understand how genomic variation maps to phenotypes across a population.

  • Clinical genomics — Build reproducible genomics workflows from sequencer output to reportable data. You can also optimize for high volume throughput and set the compute requirements for high-priority clinical samples to reduce turnaround time.

  • Clinical trials — Integrate genome analysis into clinical trials to better understand the efficacy of new drug candidates. Simplify and accelerate clinical trials with long-term cost savings and data provenance to meet regulations from governing bodies.

  • Enhance research and innovation — Streamline and control storage, access, and analysis of anonymized genomics data with built-in row and column-based access control.

The following services work with HealthOmics.

  • Amazon Elastic Container Registry – Each private workflow uses an Amazon ECR image (in a private Amazon ECR repository) to contain all executables, libraries, and scripts required to run the workflow.

  • Amazon Simple Storage Service – Amazon S3 provides file storage for Store and Workflow data.

  • AWS Lake Formation – Lake Formation manages data access to your Analytics data stores.

  • Amazon Athena – Use Athena to perform queries on your Variant stores.

  • Amazon SageMaker – Use SageMaker to run HealthOmics tasks using Jupyter notebooks.

Regions and endpoints for AWS HealthOmics

For a full list of regions and endpoints, see the AWS General Reference.

In addition to the AWS regions that are active by default, there are also Opt-in Regions which need to be activated. To learn more about how to activate or deactivate a Region, see Specify which AWS Regions your account can use in the AWS Account Management guide.

How to access HealthOmics

You can access AWS HealthOmics features using the management console, CLI, SDKs or API.

  • AWS Management Console – Provides a web interface that you can use to access HealthOmics.

  • AWS Command Line Interface (AWS CLI) – Provides commands for a broad set of AWS services, including AWS HealthOmics, and is supported on Windows, macOS, and Linux. For more information about installing the AWS CLI, see AWS Command Line Interface.

  • AWS SDKs – AWS provides SDKs (Software Development Kits) that consist of libraries and sample code for various programming languages and platforms (including Java, Python, Ruby, .NET, iOS, and Android). The SDKs provide a convenient way to use HealthOmics programmatically. For more information, see the AWS SDK Developer Center.

  • AWS API – You can use API operations to access and manage HealthOmics programmatically. For more information, see the HealthOmics API Reference.

Learn more

Learn more about HealthOmics from these workshops and tutorials:

Become familiar with additional HealthOmics tools that AWS provides: