Find reference fasta based on M5/MD5 string
1
0
Entering edit mode
3 months ago

I have downloaded cram files, but I don't know which exact version of hg38 was used to align the reads. Can you find the corresponding fasta if you have the M(D)5 strings? For now, it seems I can only test and see when it does. Of course the obvious solution is to ask the person who generated it, but that does not always work out. A part of the header looks like:

@SQ     SN:chr1 LN:248956422    M5:6aef897c3d6ff0c78aff06ac189178dd     UR:/scratch/hg38.fa
@SQ     SN:chr2 LN:242193529    M5:f98db672eb0993dcfdabafe2a882905c     UR:/scratch/hg38.fa
@SQ     SN:chr3 LN:198295559    M5:76635a41ea913a405ded820447d067b0     UR:/scratch/hg38.fa
@SQ     SN:chr4 LN:190214555    M5:3210fecf1eb92d5489da4346b3fddc6e     UR:/scratch/hg38.fa
@SQ     SN:chr5 LN:181538259    M5:a811b3dc9fe66af729dc0dddf7fa4f13     UR:/scratch/hg38.fa
@SQ     SN:chr6 LN:170805979    M5:5691468a67c7e7a7b5f2a3a683792c29     UR:/scratch/hg38.fa
@SQ     SN:chr7 LN:159345973    M5:cc044cc2256a1141212660fb07b6171e     UR:/scratch/hg38.fa
@SQ     SN:chr8 LN:145138636    M5:c67955b5f7815a9a1edfaa15893d3616     UR:/scratch/hg38.fa
@SQ     SN:chr9 LN:138394717    M5:1b79085d423b806957b7564497cac5e4     UR:/scratch/hg38.fa
@SQ     SN:chr10        LN:133797422    M5:c0eeee7acfdaf31b770a509bdaa6e51a     UR:/scratch/hg38.fa
@SQ     SN:chr11        LN:135086622    M5:1511375dc2dd1b633af8cf439ae90cec     UR:/scratch/hg38.fa
@SQ     SN:chr12        LN:133275309    M5:96e414eace405d8c27a6d35ba19df56f     UR:/scratch/hg38.fa
@SQ     SN:chr13        LN:114364328    M5:787e7eb2d9187bbc20334062332569d4     UR:/scratch/hg38.fa
@SQ     SN:chr14        LN:107043718    M5:e0f0eecc3bcab6178c62b6211565c807     UR:/scratch/hg38.fa
@SQ     SN:chr15        LN:101991189    M5:f036bd11158407596ca6bf3581454706     UR:/scratch/hg38.fa
@SQ     SN:chr16        LN:90338345     M5:9adbaf8ef0094c71470e87eb18e9b5d4     UR:/scratch/hg38.fa
@SQ     SN:chr17        LN:83257441     M5:f9a0fb01553adb183568e3eb9d8626db     UR:/scratch/hg38.fa
@SQ     SN:chr18        LN:80373285     M5:11eeaa801f6b0e2e36a1138616b8ee9a     UR:/scratch/hg38.fa
reference fasta • 508 views
ADD COMMENT
1
Entering edit mode

Googling for the checksum value leads to ENA Browser pages that also have the MD5 sums on the page for the relevant chromosomes (for example):

https://www.ebi.ac.uk/ena/browser/view/CM000664
https://www.ebi.ac.uk/ena/browser/view/CM000679

ADD REPLY
0
Entering edit mode

Great, that led me to GCA_000001405... However, it is not correct for all chromosomes. For example, that chromosome 13 (https://www.ebi.ac.uk/ena/browser/view/CM000675.2) has an MD5 checksum of a5437debe2ef9c9ef8f3ea2874ae1d82, while the cram I have has 787e7eb2d9187bbc20334062332569d4 :-(

I found someone on Twitter to point me to the right one (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1KG_ONT_VIENNA/reference/1KG_ONT_VIENNA_hg38.fa.gz).

Not sure if there could be a better way :)

ADD REPLY
1
Entering edit mode
3 months ago

"sometimes", the sequences are hosted at the EBI. For example your first sequence with md5 checksum = 6aef897c3d6ff0c78aff06ac189178dd is available (not fasta but plain string) at:

https://www.ebi.ac.uk/ena/cram/md5/6aef897c3d6ff0c78aff06ac189178dd

see REF_PATH and REF_CACHE in the samtools manual.

ADD COMMENT
0
Entering edit mode

Aha good start, but doesn't work for each of the chromosomes. Seems the fasta is "special", then. I found someone on Twitter to point me to the right one (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1KG_ONT_VIENNA/reference/1KG_ONT_VIENNA_hg38.fa.gz)

ADD REPLY

Login before adding your answer.

Traffic: 1481 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6