What's the precise definition for sensitivity and specificity for alignment?
2
1
Entering edit mode
8.2 years ago
scchess ▴ 640

When we talk about the sensitivity and specificity for NGS read alignments, what do we really mean?

For example, in the BWA paper, it talks about sensitivity. How would we define the true-positives and false-negatives? My guess (relative to a known genome):

TP: Number of reads that is aligned exactly and correctly (no gap, no mismatch)

FN: Number of reads fails to map but should be mapped (it comes from the known genome)

Is my definition correct? Is this what we mean when we say alignment sensitivity? What about specificity? Can we define specificity for alignment (not mentioned in the BWA paper)?

In other words, my question is about what we really mean when we talk about sensitivity and specificity in alignments.

genome sequence bwa alignment • 5.0k views
ADD COMMENT
0
Entering edit mode

Forget about the bwa paper. It is not quite right. Sorry for the confusion.

ADD REPLY
3
Entering edit mode
8.2 years ago
lelle ▴ 830

This is indeed a tricky question, because even the definition of "aligned exactly and correctly" is not that easy.

If we simulated reads, than we know where they are from and how they were created, but what if (by chance) after introducing the random errors, mathematically the read maps better somewhere else? What if there are multiple mathematical best hits? This paper has some further thoughts.

ADD COMMENT
0
Entering edit mode

Thanks. But I'm still unsure how the author in the paper calculates the alignment sensitivity. There is no Methods section for a precise definition.

ADD REPLY
2
Entering edit mode
8.2 years ago

The wikipedia article on this is surprisingly good. The biggest issue with your definition is that for TP alignments, they can contain gaps and mismatches, since they're often simulated to contain them. Sensitivity and specificity are calculated with in silico generated datasets, so errors/variants are added in to see how an aligners output is affected. Consequently, you tend to get a break down of the numbers by MAPQ (at which point sensitivity and specificity aren't terribly useful terms).

ADD COMMENT
0
Entering edit mode

Thanks. Do you have any reference as to how sensitivity and specificity are calculated for an in-silicio dataset? I'm looking for a precise calculation, for example, what does a TP mean? Thanks.

ADD REPLY
0
Entering edit mode

There's not much to calculate, you typically want to ensure that a read overlaps where its original sequence was drawn from. When you generate in silico data, you put the mapping coordinates in the read name. After mapping, you see if there's an overlap. Ideally the coordinates would be exact, but since you typically add variants and indels into reads it's not terribly useful to be so strict. Just have a look through the source code of wgsim, or sherman or the rabema tool that lelle posted (I hadn't heard of that one before, it looks interesting) or one of the hundred other simulators out there. Almost all of them come with a function to check whether an alignment is correct.

ADD REPLY

Login before adding your answer.

Traffic: 3349 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6