Question

What Does Bwa'S '-N' Parameter Mean?

7

Entering edit mode

12.3 years ago

Panos ★ 1.8k

Although this parameter appears to be BWA's 'main' cutoff (like e-value in BLAST -- correct me if I'm wrong), I can't understand its meaning... Executing 'bwa aln' it gives a short description that says:

-n NUM    max #diff (int) or missing prob under 0.02 err rate (float) [0.04]

What I understand is that its default value is 0.04 but not a lot more than this!

Is the cutoff some kind of an e-value? Or is it based on the percentage of identities of the match?

bwa • 9.3k views

ADD COMMENT • link updated 8 months ago by Feng Tian ▴ 20 • written 12.3 years ago by Panos ★ 1.8k

0

Entering edit mode

Did you check the docs here?

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.3 years ago by brentp 24k

0

Entering edit mode

Yes, but I still don't understand the meaning of the '-n' cutoff! In BLAST, for example, an e-value of 1 means assigned to a hit means that you can get 1 hit with the same score just by chance. What does 0.04 in the -n cutoff mean? What exactly are the "missing alignments"?

ADD REPLY • link 12.3 years ago by Panos ★ 1.8k

0

Entering edit mode

I agree, the documentation is lacking here.

ADD REPLY • link 12.3 years ago by Michael 54k

score 9 · Answer 1 · 2012-01-14

9

Entering edit mode

12.3 years ago

Michael 54k

I think I got it, it is a guess still, please correct me: to achieve lambda, the base error rate has to be multiplied by read length. Then when random errors are modeled as Poisson distributed, e.g. given reads of length 100, this will yield an average error rate lambda of 2 per read. (For the Poisson distribution, lambda denotes the rate of discrete events occurring in an interval or unit volume of space, they are not divided by space size because the space volume is present on 'both sides of the equation' it cancels out).

Given the poisson-distribution and readlength=100, what is the probability of observing 2 or more mismatches (n.mismatches) by chance/ for reasons of sequencing error? This can be generally calculated like this (in R):

p = ppois(n.mismatches,lambda=0.02*readlength, lower.tail=FALSE)

ppois(2,lambda=0.02*100, lower.tail=FALSE)
[1] 0.3233236

This is acceptable because the probability to see this or a more extreme outcome is > 0.04. Now, for other values:

ppois(4,lambda=0.02*100, lower=F)
[1] 0.05265302 > 0.04

> ppois(5,lambda=2, lower=F)
[1] 0.01656361 < 0.04

Thus, 0 to 4 mismatches are ok in one read of length 100, but 5 or more are not. While for a read of length 200, that would be only 7 mismatches, because ppois(8,0.02*200) = 0.02136343 < 0.04.

ADD COMMENT • link 12.2 years ago by Michael 54k

2

Entering edit mode

I just google-plus-asked Heng Li to look at this, and you appear to be on the right track with this answer...

ADD REPLY • link 12.2 years ago by Madelaine Gogol 5.3k

1

Entering edit mode

According to the source code of bwa aln (https://github.com/lh3/bwa/blob/master/bwtaln.c#L42-L54), the max edit number is:
The minimum integer x which makes ppois(x,lambda=0.02*read_length, lower=F) < 0.04.
Thus, for the case you showed above (the read length of 100), the max edit number will be 5 (although the p-value < 0.04).

ADD REPLY • link 8 months ago by Feng Tian ▴ 20

0

Entering edit mode

Thanks to my colleague Pawel btw for the hint to the Poisson distribution.

ADD REPLY • link 12.3 years ago by Michael 54k

0

Entering edit mode

Thanks for accepting this, but as long as we look it up in the code or an author of BWA confirms, even I cannot be totally sure. Though, I believe this is the only reasonable method fitting the sentences used in the documentation.

ADD REPLY • link 12.3 years ago by Michael 54k

0

Entering edit mode

That's nice, Madelaine!

ADD REPLY • link 12.2 years ago by Michael 54k

score 4 · Answer 2 · 2012-01-13

4

Entering edit mode

12.3 years ago

Istvan Albert 100k

It is not an E-value - that concept does not apply to a short read aligner - it is a maximum edit distance that is allowed for an alignment to be reported:

... and maximum maxDiff differences are allowed in the whole sequence ...

The integer interpretation is straightforward (seen above)

I would agree that float interpretation is not entirely obvious - but it seems to be a way to specify a variable edit distance in case that the read lengths vary and a fixed distance is inappropriate. As a net effect it generates a different edit distance depending on the read length.

ADD COMMENT • link 12.3 years ago by Istvan Albert 100k

3

Entering edit mode

Actually, it is more complicated than I thought (the float interpretation in fact). Someone told me that they model the error as Poisson-distribution with uniform base error rate lambda=0.02. Why don't we ask the authors? They should definitely document this better.

ADD REPLY • link 12.3 years ago by Michael 54k

1

Entering edit mode

So if I have 100 bp reads, a 0.04 cutoff could mean that I can have at most 4 mismatches. Right?

ADD REPLY • link 12.3 years ago by Panos ★ 1.8k

1

Entering edit mode

sorry, I thought I understood it once, I tried to calculate, but the numeric values don't make sense. I thought the 0.04 is like a p-value cutoff (alpha value) to see this number of mismatches under the assumption that the base error is poisson distributed with lambda=0.02, meaning on average there are 2 in 100 mismatches.

ADD REPLY • link 12.3 years ago by Michael 54k

0

Entering edit mode

An E-value damn well applies tgo a short read aligner, it's just that so far nobody bothered to calculate it.

ADD REPLY • link 12.3 years ago by Marvin ▴ 890

0

Entering edit mode

What I meant that the use of E-value as we use it with Blast is not useful during the typical short read mapping since most tools will not generate short partial alignments. All matches would have E-Values far above that of random chance and the actual errors are dominated by various sequencing artifacts - ranking them by the E-values would actually be a bad idea. I suspect that this is why it is not computed in general.

ADD REPLY • link 12.3 years ago by Istvan Albert 100k

score 1 · Answer 3 · 2012-01-15

Bwa operates under the assumption that every read should actually align. It will, however, not look for very bad alignments. That is a tradeoff between computational effort and the number of missed alignments (the false negative rate). The float parameter to -n is simply the false negative rate you're willing to accept.

When given a false negative rate, bwa then selects appropriate limits on substitutions. To do so, it needs to know how likely a substitution is, and it simply assumes 2%. That is what ends up being described as "the fraction of alignments lost assuming an error rate of 0.02".