After you create your sequence store, create import jobs to upload read sets into the data store. You can upload your files from an Amazon S3 bucket, or you can upload directly by using the synchronous API operations. Your Amazon S3 bucket must be in the same Region as your sequence store.
You can upload any combination of aligned and unaligned read sets into your sequence store, however, if any of the read sets in your import are aligned, you must include a reference genome.
You can reuse the IAM access policy that you used to create the Reference store.
The following topics describe the major steps you follow to import a read set into you sequence store and then get information about the imported data.
Topics
Upload files to Amazon S3
The following example shows how to move files into your Amazon S3 bucket.
aws s3 cp s3://1000genomes/phase1/data/HG00100/alignment/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam s3://your-bucket aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_1.filt.fastq.gz s3://your-bucket aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_2.filt.fastq.gz s3://your-bucket aws s3 cp s3://1000genomes/data/HG00096/alignment/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram s3://your-bucket aws s3 cp s3://gatk-test-data/wgs_ubam/NA12878_20k/NA12878_A.bam s3://your-bucket
The sample BAM
and CRAM
used in this example require different genome
references, Hg19
and Hg38
. To learn more or to access these references, see The Broad Genome References
Creating a manifest file
You must also create a manifest file in JSON to model the import job in import.json
(see the following example). If you create a sequence store in the console, you don't have to specify the
sequenceStoreId
or roleARN
, so your manifest file starts with the sources
input.
The following example imports three read sets by using the API: one FASTQ
, one
BAM
, and one CRAM
.
{
"sequenceStoreId": "3936421177",
"roleArn": "arn:aws:iam::555555555555:role/OmicsImport",
"sources":
[
{
"sourceFiles":
{
"source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam"
},
"sourceFileType": "BAM",
"subjectId": "mySubject",
"sampleId": "mySample",
"referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001",
"name": "HG00100",
"description": "BAM for HG00100",
"generatedFrom": "1000 Genomes"
},
{
"sourceFiles":
{
"source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz",
"source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz"
},
"sourceFileType": "FASTQ",
"subjectId": "mySubject",
"sampleId": "mySample",
// NOTE: there is no reference arn required here
"name": "HG00146",
"description": "FASTQ for HG00146",
"generatedFrom": "1000 Genomes"
},
{
"sourceFiles":
{
"source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram"
},
"sourceFileType": "CRAM",
"subjectId": "mySubject",
"sampleId": "mySample",
"referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001",
"name": "HG00096",
"description": "CRAM for HG00096",
"generatedFrom": "1000 Genomes"
},
{
"sourceFiles":
{
"source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam"
},
"sourceFileType": "UBAM",
"subjectId": "mySubject",
"sampleId": "mySample",
// NOTE: there is no reference arn required here
"name": "NA12878_A",
"description": "uBAM for NA12878",
"generatedFrom": "GATK Test Data"
}
]
}
Alternatively, you can upload the manifest file in YAML format.
Starting the import job
To start the import job, use the following AWS CLI command.
aws omics start-read-set-import-job --cli-input-json file://import.json
You receive the following response, which indicates successful job creation.
{
"id": "3660451514",
"sequenceStoreId": "3936421177",
"roleArn": "arn:aws:iam::111122223333:role/OmicsImport",
"status": "CREATED",
"creationTime": "2022-07-13T22:14:59.309Z"
}
Monitor the import job
After the import job starts, you can monitor its progress with the following command. In the following
example, replace
with your sequence store ID, and
replace sequence store id
with the import ID.job import ID
aws omics get-read-set-import-job --sequence-store-id
--id
sequence store id
job import ID
The following shows the statuses for all import jobs associated with the specified sequence store ID.
{
"id": "1234567890",
"sequenceStoreId": "1234567890",
"roleArn": "arn:aws:iam::111122223333:role/OmicsImport",
"status": "RUNNING",
"statusMessage": "The job is currently in progress.",
"creationTime": "2022-07-13T22:14:59.309Z",
"sources": [
{
"sourceFiles":
{
"source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam"
},
"sourceFileType": "BAM",
"status": "IN_PROGRESS",
"statusMessage": "The job is currently in progress."
"subjectId": "mySubject",
"sampleId": "mySample",
"referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/8625408453",
"name": "HG00100",
"description": "BAM for HG00100",
"generatedFrom": "1000 Genomes",
"readSetID": "1234567890"
},
{
"sourceFiles":
{
"source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz",
"source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz"
},
"sourceFileType": "FASTQ",
"status": "IN_PROGRESS",
"statusMessage": "The job is currently in progress."
"subjectId": "mySubject",
"sampleId": "mySample",
"name": "HG00146",
"description": "FASTQ for HG00146",
"generatedFrom": "1000 Genomes",
"readSetID": "1234567890"
},
{
"sourceFiles":
{
"source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram"
},
"sourceFileType": "CRAM",
"status": "IN_PROGRESS",
"statusMessage": "The job is currently in progress."
"subjectId": "mySubject",
"sampleId": "mySample",
"referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/1234568870",
"name": "HG00096",
"description": "CRAM for HG00096",
"generatedFrom": "1000 Genomes",
"readSetID": "1234567890"
},
{
"sourceFiles":
{
"source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam"
},
"sourceFileType": "UBAM",
"status": "IN_PROGRESS",
"statusMessage": "The job is currently in progress."
"subjectId": "mySubject",
"sampleId": "mySample",
"name": "NA12878_A",
"description": "uBAM for NA12878",
"generatedFrom": "GATK Test Data",
"readSetID": "1234567890"
}
]
}
Find the imported sequence files
After the job completes, you can use the list-read-sets API operation to find the
imported sequence files. In the following example, replace
with your sequence store ID.sequence store
id
aws omics list-read-sets --sequence-store-id
sequence store id
You receive the following response.
{
"readSets": [
{
"id": "0000000001",
"arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/01234567890/readSet/0000000001",
"sequenceStoreId": "1234567890",
"subjectId": "mySubject",
"sampleId": "mySample",
"status": "ACTIVE",
"name": "HG00100",
"description": "BAM for HG00100",
"referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/01234567890/reference/0000000001",
"fileType": "BAM",
"sequenceInformation": {
"totalReadCount": 9194,
"totalBaseCount": 928594,
"generatedFrom": "1000 Genomes",
"alignment": "ALIGNED"
},
"creationTime": "2022-07-13T23:25:20Z"
"creationType": "IMPORT",
"etag": {
"algorithm": "BAM_MD5up",
"source1": "d1d65429212d61d115bb19f510d4bd02"
}
},
{
"id": "0000000002",
"arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000002",
"sequenceStoreId": "0123456789",
"subjectId": "mySubject",
"sampleId": "mySample",
"status": "ACTIVE",
"name": "HG00146",
"description": "FASTQ for HG00146",
"fileType": "FASTQ",
"sequenceInformation": {
"totalReadCount": 8000000,
"totalBaseCount": 1184000000,
"generatedFrom": "1000 Genomes",
"alignment": "UNALIGNED"
},
"creationTime": "2022-07-13T23:26:43Z"
"creationType": "IMPORT",
"etag": {
"algorithm": "FASTQ_MD5up",
"source1": "ca78f685c26e7cc2bf3e28e3ec4d49cd"
}
},
{
"id": "0000000003",
"arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000003",
"sequenceStoreId": "0123456789",
"subjectId": "mySubject",
"sampleId": "mySample",
"status": "ACTIVE",
"name": "HG00096",
"description": "CRAM for HG00096",
"referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/0123456789/reference/0000000001",
"fileType": "CRAM",
"sequenceInformation": {
"totalReadCount": 85466534,
"totalBaseCount": 24000004881,
"generatedFrom": "1000 Genomes",
"alignment": "ALIGNED"
},
"creationTime": "2022-07-13T23:30:41Z"
"creationType": "IMPORT",
"etag": {
"algorithm": "CRAM_MD5up",
"source1": "66817940f3025a760e6da4652f3e927e"
}
},
{
"id": "0000000004",
"arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000004",
"sequenceStoreId": "0123456789",
"subjectId": "mySubject",
"sampleId": "mySample",
"status": "ACTIVE",
"name": "NA12878_A",
"description": "uBAM for NA12878",
"fileType": "UBAM",
"sequenceInformation": {
"totalReadCount": 20000,
"totalBaseCount": 5000000,
"generatedFrom": "GATK Test Data",
"alignment": "ALIGNED"
},
"creationTime": "2022-07-13T23:30:41Z"
"creationType": "IMPORT",
"etag": {
"algorithm": "BAM_MD5up",
"source1": "640eb686263e9f63bcda12c35b84f5c7"
}
}
]
}
Get details about a read set
To view more details about a read set, use the GetReadSetMetadata API operation. In
the following example, replace
with your sequence store
ID, and replace sequence store id
with your read set ID.read set id
aws omics get-read-set-metadata --sequence-store-id
--id
sequence store id
read set id
You receive the following response.
{
"arn": "arn:aws:omics:us-west-2:123456789012:sequenceStore/2015356892/readSet/9515444019",
"creationTime": "2024-01-12T04:50:33.548Z",
"creationType": "IMPORT",
"creationJobId": "33222111",
"description": null,
"etag": {
"algorithm": "FASTQ_MD5up",
"source1": "00d0885ba3eeb211c8c84520d3fa26ec",
"source2": "00d0885ba3eeb211c8c84520d3fa26ec"
},
"fileType": "FASTQ",
"files": {
"index": null,
"source1": {
"contentLength": 10818,
"partSize": 104857600,
"s3Access": {
"s3Uri": "s3://accountID
-sequence store ID
-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz"
},
"totalParts": 1
},
"source2": {
"contentLength": 10818,
"partSize": 104857600,
"s3Access": {
"s3Uri": "s3://accountID
-sequence store ID
-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz"
},
"totalParts": 1
}
},
"id": "9515444019",
"name": "paired-fastq-import",
"sampleId": "sampleId-paired-fastq-import",
"sequenceInformation": {
"alignment": "UNALIGNED",
"generatedFrom": null,
"totalBaseCount": 30000,
"totalReadCount": 200
},
"sequenceStoreId": "2015356892",
"status": "ACTIVE",
"statusMessage": null,
"subjectId": "subjectId-paired-fastq-import"
}
Download the read set data files
You can access the objects for an active read set using the Amazon S3 GetObject API operation. The URI for the object is returned in the GetReadSetMetadata API response. For more information, see Accessing HealthOmics read sets with Amazon S3 URIs.
Alternatively, use the HealthOmics GetReadSet API operation. You can use
GetReadSet to download in parallel by downloading individual parts.
These parts are similar to Amazon S3 parts. The following is an example of how to download part 1 from a read set. In
the following example, replace
with your sequence
store ID, and replace sequence store id
with your read set ID.read set id
aws omics get-read-set --sequence-store-id
--id
sequence store id
--part-number 1 outfile.bam
read set id
You can also use the HealthOmics Transfer Manager to download files for a HealthOmics reference or
read set. You can download the HealthOmics Transfer Manager here