How ETags are calculated - AWS HealthOmics

How ETags are calculated

The ETag is generated from a hash of the ingested file contents. The ETag algorithm family is set to MD5up by default, but it can be configured differently during sequence store creation. When the ETag is calculated, the algorithm and the calculated hashes are added to the read set. The supported MD5 algorithms for file types are as follows.

  • FASTQ_MD5up – Calculates the MD5 hash of an uncompressed, complete FASTQ read set source.

  • BAM_MD5up – Calculates the MD5 hash of the alignment section of an uncompressed BAM or uBAM read set source as represented in the SAM, based on the linked reference, if one is available.

  • CRAM_MD5up – Calculates the MD5 hash of the alignment section of the uncompressed CRAM read set source as represented in the SAM, based on the linked reference.

Note

MD5 hashing is known to be vulnerable to collisions. Because of this, two different files might have the same ETag if they were manufactured to exploit the known collision.

The following algorithms are supported for the SHA256 family. The algorithms are calculated as follows:

  • FASTQ_SHA256up – Calculates the SHA-256 hash of an uncompressed, complete FASTQ read set source.

  • BAM_SHA256up – Calculates the SHA-256 hash of the alignment section of an uncompressed BAM or uBAM read set source as represented in the SAM, based on the linked reference, if one is available.

  • CRAM_SHA256up – Calculates the SHA-256 hash of the alignment section of an uncompressed CRAM read set source as represented in the SAM, based on the linked reference.

The following algorithms are supported for the SHA512 family. The algorithms are calculated as follows:

  • FASTQ_SHA512up – Calculates the SHA-512 hash of an uncompressed, complete FASTQ read set source.

  • BAM_SHA512up – Calculates the SHA-512 hash of the alignment section of an uncompressed BAM or uBAM read set source as represented in the SAM, based on the linked reference, if one is available.

  • CRAM_SHA512up – Calculates the SHA-512 hash of the alignment section of an uncompressed CRAM read set source as represented in the SAM, based on the linked reference.