Direct upload to a sequence store - AWS HealthOmics

Direct upload to a sequence store

The HealthOmics Transfer Manager is recommended for adding files to your sequence store. You can also upload your read sets directly to a sequence store through the direct upload API operations.

Direct upload read sets exist first in PROCESSING_UPLOAD state. This means that the file parts are currently being uploaded, and you can access the read set metadata. After the parts are uploaded and the checksums are validated, the read set becomes ACTIVE and behaves the same as an imported read set.

If the direct upload fails, the read set status is shown as UPLOAD_FAILED. You can configure an Amazon S3 bucket as a fallback location for any files that fail to upload. The file parts for those read sets are transferred to the fallback location. Fallback locations are available on sequence stores created after May 15, 2023. You must also have an IAM policy that grants you read access to that Amazon S3 location.

To begin, start a multipart upload. You can do this by using the AWS CLI, as shown in the following example.

First, you create the parts by separating your data, as shown in the following example.

split -b 100MiB SRR233106_1.filt.fastq.gz source1_part_

After your source files are in parts, you can then create a multipart read set upload, as shown in the following. In the following example, replace sequence store ID and the other parameters with your sequence store ID and other values.

aws omics create-multipart-read-set-upload \ --sequence-store-id sequence store ID \ --name upload name \ --source-file-type FASTQ \ --subject-id subject ID \ --sample-id sample ID \ --description "FASTQ for HG00146" "description of upload" \ --generated-from "1000 Genomes""source of imported files"

In the response, you will get the uploadID and other metadata. Use the uploadID for the next step of the upload process.

{ "sequenceStoreId": "1504776472", "uploadId": "7640892890", "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", "generatedFrom": "1000 Genomes", "name": "HG00146", "description": "FASTQ for HG00146", "creationTime": "2023-11-20T23:40:47.437522+00:00" }

Next, add your read sets to the upload. If your file is small enough, you only have to perform this step once. For larger files, you perform this step for each part of your file. If you upload a new part by using a previously used part number, it overwrites the previously uploaded part.

In the following example, replace sequence store ID, upload ID, and the other parameters with your values.

aws omics upload-read-set-part \ --sequence-store-id sequence store ID \ --upload-id upload ID \ --part-source SOURCE1 \ --part-number part number \ --payload source1/source1_part_aa.fastq.gz

The response is an ID that you can use to verify that the uploaded file matches the file you intended.

{ "checksum": "984979b9928ae8d8622286c4a9cd8e99d964a22d59ed0f5722e1733eb280e635" }

Continue uploading the parts of your file, if necessary. To verify that your read sets have been uploaded, use the list-read-set-upload-parts API operation, as shown in the following. In the following example, replace sequence store ID , upload ID, and the part source with your own input.

aws omics list-read-set-upload-parts \ --sequence-store-id sequence store ID \ --upload-id upload ID \ --part-source SOURCE1

The response returns the number of read sets, the size, and the timestamp for when it was most recently updated.

{ "parts": [ { "partNumber": 1, "partSize": 104857600, "partSource": "SOURCE1", "checksum": "MVMQk+vB9C3Ge8ADHkbKq752n3BCUzyl41qEkqlOD5M=", "creationTime": "2023-11-20T23:58:03.500823+00:00", "lastUpdatedTime": "2023-11-20T23:58:03.500831+00:00" }, { "partNumber": 2, "partSize": 104857600, "partSource": "SOURCE1", "checksum": "keZzVzJNChAqgOdZMvOmjBwrOPM0enPj1UAfs0nvRto=", "creationTime": "2023-11-21T00:02:03.813013+00:00", "lastUpdatedTime": "2023-11-21T00:02:03.813025+00:00" }, { "partNumber": 3, "partSize": 100339539, "partSource": "SOURCE1", "checksum": "TBkNfMsaeDpXzEf3ldlbi0ipFDPaohKHyZ+LF1J4CHk=", "creationTime": "2023-11-21T00:09:11.705198+00:00", "lastUpdatedTime": "2023-11-21T00:09:11.705208+00:00" } ] }

To view all active multipart read set uploads, use list-multipart-read-set-uploads, as shown in the following. Replace sequence store ID with the ID for your own sequence store.

aws omics list-multipart-read-set-uploads --sequence-store-id sequence store ID

This API only returns multipart read set uploads that are in progress. After the ingested read sets are ACTIVE, or if the upload has failed, the upload will not be returned in the response to the list-multipart-read-set-uploads API. To view active read sets, use the list-read-sets API. An example response for list-multipart-read-set-uploads is shown in the following.

{ "uploads": [ { "sequenceStoreId": "1234567890", "uploadId": "8749584421", "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", "generatedFrom": "1000 Genomes", "name": "HG00146", "description": "FASTQ for HG00146", "creationTime": "2023-11-29T19:22:51.349298+00:00" }, { "sequenceStoreId": "1234567890", "uploadId": "5290538638", "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "generatedFrom": "1000 Genomes", "referenceArn": "arn:aws:omics:us-west-2:123456789012:referenceStore/8168613728/reference/2190697383", "name": "HG00146", "description": "BAM for HG00146", "creationTime": "2023-11-29T19:23:33.116516+00:00" }, { "sequenceStoreId": "1234567890", "uploadId": "4174220862", "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "generatedFrom": "1000 Genomes", "referenceArn": "arn:aws:omics:us-west-2:123456789012:referenceStore/8168613728/reference/2190697383", "name": "HG00147", "description": "BAM for HG00147", "creationTime": "2023-11-29T19:23:47.007866+00:00" } ] }

After you upload all parts of your file, use complete-multipart-read-set-upload to conclude the upload process, as shown in the following. Replace sequence store ID, upload ID, and the parameter for parts with your own values.

aws omics complete-multipart-read-set-upload \ --sequence-store-id sequence store ID \ --upload-id upload ID \ --parts '[{"checksum":"gaCBQMe+rpCFZxLpoP6gydBoXaKKDA/Vobh5zBDb4W4=","partNumber":1,"partSource":"SOURCE1"}]'

The response for complete-multipart-read-set-upload is the read set IDs for your imported read sets.

{ "readSetId": "0000000001" }

To stop the upload, use abort-multipart-read-set-upload with the upload ID to end the upload process. Replace sequence store ID and upload ID with your own parameter values.

aws omics abort-multipart-read-set-upload \ --sequence-store-id sequence store ID \ --upload-id upload ID

After the upload is complete, you can retrieve your data from the read set by using get-read-set, as shown in the following. If the upload is still processing, get-read-set returns limited metadata, and the generated index files are unavailable. Replace sequence store ID and the other parameters with your own input.

aws omics get-read-set --sequence-store-id sequence store ID \ --id read set ID \ --file SOURCE1 \ --part-number 1 myfile.fastq.gz

To check the metadata, including the status of your upload, use the get-read-set-metadata API operation.

aws omics get-read-set-metadata --sequence-store-id sequence store ID --id read set ID

The response includes metadata details such as the file type, the reference ARN, the number of files, and the length of the sequences. It also includes the status. Possible statuses are PROCESSING_UPLOAD, ACTIVE, and UPLOAD_FAILED.

{ "id": "0000000001", "arn": "arn:aws:omics:us-west-2:555555555555:sequenceStore/0123456789/readSet/0000000001", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "PROCESSING_UPLOAD", "name": "HG00146", "description": "FASTQ for HG00146", "fileType": "FASTQ", "creationTime": "2022-07-13T23:25:20Z", "files": { "source1": { "totalParts": 5, "partSize": 123456789012, "contentLength": 6836725, }, "source2": { "totalParts": 5, "partSize": 123456789056, "contentLength": 6836726 } }, 'creationType": "UPLOAD" }

If a read set creation fails, the files can be moved to a fallback Amazon S3 location. This way, you can keep the files in Amazon S3 to re-import after the files issues are resolved. The fallback location can be configured for a sequence store from the console, the AWS CLI, or the API.