Biostar Beta. Not for public use.
Meaning of BWA-MEM MAPQ values
4
Entering edit mode
2.8 years ago
Santa Clara, CA

Does anyone know what the MAPQ values produced by BWA-MEM mean?

I'm looking for something similar to what Keith Bradnam discovered for TopHat v1.4.1, where he realized that:

0 = maps to 5 or more locations

1 = maps to 3-4 locations

3 = maps to 2 locations

255 = unique mapping

I'm familiar with the notion that MAPQ should be theoretically be related to the probability of an "incorrect" alignment (10^(-MAPQ/10)), although this is vaguely-specified enough that aligners in practice tend to actually just use something like the above.

Note that this is different than the BWA MAPQ scoring interpretation, because BWA-MEM gives MAPQ scores in the range [0,60], rather than [0,37] as has been established for BWA.

ADD COMMENTlink
2
Entering edit mode

My experience is that MAPQ in bwa-aln has almost no predictive value, but it is pretty useful in bwa-mem and seems to correlate with the stated definition.

Note that statements such as "maps to 2 locations" have little meaning. Any read can map to (or originate from) any location in any genome. Even if you add "...equally well", that's still subjective. So, ultimately, MAPQ is just the aligner's confidence that the read originated from the stated location, and Tophat apparently does not attempt to implement the specification. Other aligners typically do. No per-aligner decoding table is needed; if the MAPQ is 3 or less, the aligner has roughly 50% or lower confidence that the mapping is correct, implying (but not requiring) that there is at least one other location with similarly high confidence.

ADD REPLYlink
0
Entering edit mode

Very good point regarding the meaninglessness of "maps to n locations."

it is pretty useful in bwa-mem and seems to correlate with the stated definition.

Interesting. Do you have any more evidence on this? If so, might be worth a blog post like the above.

ADD REPLYlink
0
Entering edit mode

I generated synthetic data and mapped it to create a ROC curve a while back, and my recollection was that it looked reasonable. You can do this yourself if you want for any aligner with BBMap's randomreads.sh and samtoroc.sh. It only works with single-ended reads though because of the sam specification's requirement for renaming reads.

ADD REPLYlink
0
Entering edit mode

Just quibbling, but the blog post you linked found for Tophat 1.4.1:

0 = maps to 5 or more locations
1 = maps to 3-4 locations
3 = maps to 2 locations
255 = unique mapping

And for Tophat2:

0 = maps to 10 or more locations
1 = maps to 4-9 locations
2 = maps to 3 locations
3 = maps to 2 locations
50 = unique mapping
ADD REPLYlink
0
Entering edit mode

...herein lies the utter confusion and madness of how various programs interpret MAPQ. The SAM specification clearly states that

a value 255 indicates that the mapping quality is not available

ADD REPLYlink
1
Entering edit mode

Well, for Bowtie, it isn't available :)

ADD REPLYlink
0
Entering edit mode

Maybe someone should make an attempt to find a common standard^^

ADD REPLYlink
0
Entering edit mode

I was think of changing it from a number to just the word "CORRECT" or "WRONG" which seems more straightforward.

ADD REPLYlink
0
Entering edit mode

but incompatible with every downstream software out there, right?

ADD REPLYlink
0
Entering edit mode

Ah, you got me there =)

ADD REPLYlink
2
Entering edit mode
15 months ago
John 12k
Germany

MAPQ scores are not meaningful because BAM is not meaningful - or rather, the field has yet to define the difference between read-alignments (what BAM officially stores) and fragment-alignments (what most aligners produce).

The issue goes far deeper than MAPQ scores, but if you want to read about MAPQ this is a good resource: https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/

But attempting to create a common standard is impossible. People want different things out of the MAPQ score, and so there can not be a single standard. Some people want it to represent how well the alignment aligns, others how well the fragment in it's entirety aligns, others a combination of both, sometimes with and sometimes without taking into account how many other possible alignments there might be for the read/fragment. It's a mess, and it all stems from the fact that pretty much every aligner produces non-standard BAMs except BWA, which produces BAMs no one would use if they needed to store more than one alignment per sequencing event.

ADD COMMENTlink
0
Entering edit mode
11 months ago
Republic of Ireland

From what I can see, the SAM specification document (http://samtools.github.io/hts-specs/SAMv1.pdf) states that the mapping quality can be anything from 0 to (2^8)-1, i.e., 0-255. See the table in section 1.4 of the document.

Regarding the values used by TopHat, it's regrettable that it drifts from the standard specification for MAPQ, but things like this happen in bioinformatics.

Kevin

ADD COMMENTlink
0
Entering edit mode

Yes, I realize that anything from 0 to 255 is acceptable according to the spec, but the meaning (in terms of number of mapping locations, likely) of the very small subset of that range that's actually used in practice by BWA-MEM is what I'm interested in.

ADD REPLYlink
0
Entering edit mode

As far as I know, BWA mem assignes 0-60 and 255, with 0 = equally well mapping to > 1 location, 255 = MAPQ not available and 1-60 the transformed probability as you stated. Every aligner uses different MAPQ scoring. Google around a bit, there a plenty of blogs about the MAPQ mess with the common alignment tools.

ADD REPLYlink
1
Entering edit mode

Note that BWA-MEM is not the same as BWA algorithmically and thus in terms of MAPQs produced, and lack of finding anything on Google® for BWA-MEM specifically led to this question.

ADD REPLYlink
0
Entering edit mode

Yes BWA mem doesn't go above 60, a value of which, if interpreted literally and untransformed, means that the read has been aligned with a 1-in-1 million error rate. Over the genomic scale, I guess that this is still not sufficient.

Lately I have been choosing only uniquely mapping reads, and dealing with the lost coverage. However, validation work we did in clinical genetics shows that variants are best called at 30 read depth (not 100, not 1000, etc)., with a minimum 'comfortable' cutoff of 18.

Kevin

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1