選取您的 Cookie 偏好設定

我們使用提供自身網站和服務所需的基本 Cookie 和類似工具。我們使用效能 Cookie 收集匿名統計資料,以便了解客戶如何使用我們的網站並進行改進。基本 Cookie 無法停用,但可以按一下「自訂」或「拒絕」以拒絕效能 Cookie。

如果您同意,AWS 與經核准的第三方也會使用 Cookie 提供實用的網站功能、記住您的偏好設定,並顯示相關內容,包括相關廣告。若要接受或拒絕所有非必要 Cookie,請按一下「接受」或「拒絕」。若要進行更詳細的選擇,請按一下「自訂」。

Importing read sets into a HealthOmics sequence store

焦點模式
Importing read sets into a HealthOmics sequence store - AWS HealthOmics
此頁面尚未翻譯為您的語言。 請求翻譯

After you create your sequence store, create import jobs to upload read sets into the data store. You can upload your files from an Amazon S3 bucket, or you can upload directly by using the synchronous API operations. Your Amazon S3 bucket must be in the same Region as your sequence store.

You can upload any combination of aligned and unaligned read sets into your sequence store, however, if any of the read sets in your import are aligned, you must include a reference genome.

You can reuse the IAM access policy that you used to create the Reference store.

The following topics describe the major steps you follow to import a read set into you sequence store and then get information about the imported data.

Upload files to Amazon S3

The following example shows how to move files into your Amazon S3 bucket.

aws s3 cp s3://1000genomes/phase1/data/HG00100/alignment/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam s3://your-bucket aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_1.filt.fastq.gz s3://your-bucket aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_2.filt.fastq.gz s3://your-bucket aws s3 cp s3://1000genomes/data/HG00096/alignment/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram s3://your-bucket aws s3 cp s3://gatk-test-data/wgs_ubam/NA12878_20k/NA12878_A.bam s3://your-bucket

The sample BAM and CRAM used in this example require different genome references, Hg19 and Hg38. To learn more or to access these references, see The Broad Genome References in the Registry of Open Data on AWS.

Creating a manifest file

You must also create a manifest file in JSON to model the import job in import.json (see the following example). If you create a sequence store in the console, you don't have to specify the sequenceStoreId or roleARN, so your manifest file starts with the sources input.

API manifest

The following example imports three read sets by using the API: one FASTQ, one BAM, and one CRAM.

{ "sequenceStoreId": "3936421177", "roleArn": "arn:aws:iam::555555555555:role/OmicsImport", "sources": [ { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" }, "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001", "name": "HG00100", "description": "BAM for HG00100", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz", "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz" }, "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", // NOTE: there is no reference arn required here "name": "HG00146", "description": "FASTQ for HG00146", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram" }, "sourceFileType": "CRAM", "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001", "name": "HG00096", "description": "CRAM for HG00096", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam" }, "sourceFileType": "UBAM", "subjectId": "mySubject", "sampleId": "mySample", // NOTE: there is no reference arn required here "name": "NA12878_A", "description": "uBAM for NA12878", "generatedFrom": "GATK Test Data" } ] }
Console manifest

This example code is used to import a single read set by using the console.

[ { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" }, "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00100", "description": "BAM for HG00100", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz", "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz" }, "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00146", "description": "FASTQ for HG00146", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://your-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram" }, "sourceFileType": "CRAM", "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00096", "description": "CRAM for HG00096", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam" }, "sourceFileType": "UBAM", "subjectId": "mySubject", "sampleId": "mySample", "name": "NA12878_A", "description": "uBAM for NA12878", "generatedFrom": "GATK Test Data" } ]

The following example imports three read sets by using the API: one FASTQ, one BAM, and one CRAM.

{ "sequenceStoreId": "3936421177", "roleArn": "arn:aws:iam::555555555555:role/OmicsImport", "sources": [ { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" }, "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001", "name": "HG00100", "description": "BAM for HG00100", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz", "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz" }, "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", // NOTE: there is no reference arn required here "name": "HG00146", "description": "FASTQ for HG00146", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram" }, "sourceFileType": "CRAM", "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001", "name": "HG00096", "description": "CRAM for HG00096", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam" }, "sourceFileType": "UBAM", "subjectId": "mySubject", "sampleId": "mySample", // NOTE: there is no reference arn required here "name": "NA12878_A", "description": "uBAM for NA12878", "generatedFrom": "GATK Test Data" } ] }

Alternatively, you can upload the manifest file in YAML format.

Starting the import job

To start the import job, use the following AWS CLI command.

aws omics start-read-set-import-job --cli-input-json file://import.json

You receive the following response, which indicates successful job creation.

{ "id": "3660451514", "sequenceStoreId": "3936421177", "roleArn": "arn:aws:iam::111122223333:role/OmicsImport", "status": "CREATED", "creationTime": "2022-07-13T22:14:59.309Z" }

Monitor the import job

After the import job starts, you can monitor its progress with the following command. In the following example, replace sequence store id with your sequence store ID, and replace job import ID with the import ID.

aws omics get-read-set-import-job --sequence-store-id sequence store id --id job import ID

The following shows the statuses for all import jobs associated with the specified sequence store ID.

{ "id": "1234567890", "sequenceStoreId": "1234567890", "roleArn": "arn:aws:iam::111122223333:role/OmicsImport", "status": "RUNNING", "statusMessage": "The job is currently in progress.", "creationTime": "2022-07-13T22:14:59.309Z", "sources": [ { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" }, "sourceFileType": "BAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/8625408453", "name": "HG00100", "description": "BAM for HG00100", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz", "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz" }, "sourceFileType": "FASTQ", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00146", "description": "FASTQ for HG00146", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram" }, "sourceFileType": "CRAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/1234568870", "name": "HG00096", "description": "CRAM for HG00096", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam" }, "sourceFileType": "UBAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "name": "NA12878_A", "description": "uBAM for NA12878", "generatedFrom": "GATK Test Data", "readSetID": "1234567890" } ] }

Find the imported sequence files

After the job completes, you can use the list-read-sets API operation to find the imported sequence files. In the following example, replace sequence store id with your sequence store ID.

aws omics list-read-sets --sequence-store-id sequence store id

You receive the following response.

{ "readSets": [ { "id": "0000000001", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/01234567890/readSet/0000000001", "sequenceStoreId": "1234567890", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00100", "description": "BAM for HG00100", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/01234567890/reference/0000000001", "fileType": "BAM", "sequenceInformation": { "totalReadCount": 9194, "totalBaseCount": 928594, "generatedFrom": "1000 Genomes", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:25:20Z" "creationType": "IMPORT", "etag": { "algorithm": "BAM_MD5up", "source1": "d1d65429212d61d115bb19f510d4bd02" } }, { "id": "0000000002", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000002", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00146", "description": "FASTQ for HG00146", "fileType": "FASTQ", "sequenceInformation": { "totalReadCount": 8000000, "totalBaseCount": 1184000000, "generatedFrom": "1000 Genomes", "alignment": "UNALIGNED" }, "creationTime": "2022-07-13T23:26:43Z" "creationType": "IMPORT", "etag": { "algorithm": "FASTQ_MD5up", "source1": "ca78f685c26e7cc2bf3e28e3ec4d49cd" } }, { "id": "0000000003", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000003", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00096", "description": "CRAM for HG00096", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/0123456789/reference/0000000001", "fileType": "CRAM", "sequenceInformation": { "totalReadCount": 85466534, "totalBaseCount": 24000004881, "generatedFrom": "1000 Genomes", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:30:41Z" "creationType": "IMPORT", "etag": { "algorithm": "CRAM_MD5up", "source1": "66817940f3025a760e6da4652f3e927e" } }, { "id": "0000000004", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000004", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "NA12878_A", "description": "uBAM for NA12878", "fileType": "UBAM", "sequenceInformation": { "totalReadCount": 20000, "totalBaseCount": 5000000, "generatedFrom": "GATK Test Data", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:30:41Z" "creationType": "IMPORT", "etag": { "algorithm": "BAM_MD5up", "source1": "640eb686263e9f63bcda12c35b84f5c7" } } ] }

Get details about a read set

To view more details about a read set, use the GetReadSetMetadata API operation. In the following example, replace sequence store id with your sequence store ID, and replace read set id with your read set ID.

aws omics get-read-set-metadata --sequence-store-id sequence store id --id read set id

You receive the following response.

{ "arn": "arn:aws:omics:us-west-2:123456789012:sequenceStore/2015356892/readSet/9515444019", "creationTime": "2024-01-12T04:50:33.548Z", "creationType": "IMPORT", "creationJobId": "33222111", "description": null, "etag": { "algorithm": "FASTQ_MD5up", "source1": "00d0885ba3eeb211c8c84520d3fa26ec", "source2": "00d0885ba3eeb211c8c84520d3fa26ec" }, "fileType": "FASTQ", "files": { "index": null, "source1": { "contentLength": 10818, "partSize": 104857600, "s3Access": { "s3Uri": "s3://accountID-sequence store ID-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz" }, "totalParts": 1 }, "source2": { "contentLength": 10818, "partSize": 104857600, "s3Access": { "s3Uri": "s3://accountID-sequence store ID-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz" }, "totalParts": 1 } }, "id": "9515444019", "name": "paired-fastq-import", "sampleId": "sampleId-paired-fastq-import", "sequenceInformation": { "alignment": "UNALIGNED", "generatedFrom": null, "totalBaseCount": 30000, "totalReadCount": 200 }, "sequenceStoreId": "2015356892", "status": "ACTIVE", "statusMessage": null, "subjectId": "subjectId-paired-fastq-import" }

Download the read set data files

You can access the objects for an active read set using the Amazon S3 GetObject API operation. The URI for the object is returned in the GetReadSetMetadata API response. For more information, see Accessing HealthOmics read sets with Amazon S3 URIs.

Alternatively, use the HealthOmics GetReadSet API operation. You can use GetReadSet to download in parallel by downloading individual parts. These parts are similar to Amazon S3 parts. The following is an example of how to download part 1 from a read set. In the following example, replace sequence store id with your sequence store ID, and replace read set id with your read set ID.

aws omics get-read-set --sequence-store-id sequence store id --id read set id --part-number 1 outfile.bam

You can also use the HealthOmics Transfer Manager to download files for a HealthOmics reference or read set. You can download the HealthOmics Transfer Manager here. For more information about using and setting up the Transfer Manager, see this GitHub Repository.

隱私權網站條款Cookie 偏好設定
© 2025, Amazon Web Services, Inc.或其附屬公司。保留所有權利。