Question

Keeping Tags That Are Mapped To The Same Place By Two Aligners

1

Entering edit mode

11.3 years ago

KCC ★ 4.1k

I have been thinking for a while about the advantages of doing something like aligning with bowtie2 and bwa and then keeping only the tags while are aligned to the same positions by both programs.

One reservation I have about this idea is I don't think I've ever seen it anywhere and since it's a somewhat simple thing to do, I was wondering if anybody has tried it and what are the pluses and minuses to an approach like this?

alignment bwa bowtie • 2.2k views

ADD COMMENT • link updated 6.9 years ago by Biostar 20 • written 11.3 years ago by KCC ★ 4.1k

score 2 · Answer 1 · 2012-12-16

2

Entering edit mode

11.3 years ago

Gabriel R. ★ 2.9k

My guess is that you would downsample your set towards reads with low divergence to the reference. Those reads are probably those will little edit distance. Try the following:

1) Keep 3 sets a) BWA intersect with Bowtie b) BWA minus Bowtie c) Bowtie minus BWA 2) Measure the average edit distance for each (using the NM field).

Post the results for fun :-)

ADD COMMENT • link 11.3 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Good suggestion, I'd be interested in seeing the results as well

ADD REPLY • link 11.3 years ago by Mikael Huss 4.8k

score 0 · Answer 2 · 2017-06-04

That is a thing, although not to the read-level. I've seen it used in bisulphite mapping where Bismark, PASH, and one other (but i forget the name) where used to map reads and the combined methylation calls were used.

The problem is that it doesn't always statistically make sense. It depends on the power/sensitivity of the tools, which is often unknown, and the size of the data. To help explain this a bit better i wrote you a little tool - although it might have a bug. i dont know. it's a sunday and i seem to write the most bugs on sundays.

gist.github.com/JohnLonginotto/bd1a846544b31ebe6f96190c1cd84a83

And you provide it arguments like:

python ./should_i_intersect_it.py --true_positives 100 --true_negatives 60000 --true_positives_A 34 --false_positives_A 10 --true_positives_B 40 --false_positives_B 13

And then it will tell you True Positive, True Negative, False Positive, and False Negative for each tool, and the intersect/union:

                 Tool A                                   Tool B                                 Intersect                                  Union
       TP      TN      FP      FN               TP      TN      FP      FN               TP      TN      FP      FN               TP      TN      FP      FN
      34.0    59990.  10.0    66.0             40.0    59987.  13.0    60.0             13.62   59977.  0.0     39.62            60.38   60000.  23.0    86.38
      34.0    59990.  10.0    66.0             40.0    59987.  13.0    60.0             13.16   59977.  0.0     39.16            60.84   60000.  23.0    86.84
      34.0    59990.  10.0    66.0             40.0    59987.  13.0    60.0             13.74   59977.  0.0     39.74            60.26   60000.  23.0    86.26
      34.0    59990.  10.0    66.0             40.0    59987.  13.0    60.0             13.1    59977.  0.0     39.1             60.9    60000.  23.0    86.9
      34.0    59990.  10.0    66.0             40.0    59987.  13.0    60.0             13.82   59977.  0.02    39.82            60.18   59999.  22.98   86.18

The above example values i chose because that sort-of reflects RNA-seq differentially expressed genes. Where you have 100 truly DE genes, 60000 genes that don't actually change, and then two tools, A and B, where A gets 34 right but 10 wrong, and B gets 40 right but 13 wrong.

The result is that neither the intersect nor the union look particularly appealing. It depends on what you're after really. There are values where an intersect or union look very appealing, but meh. Play with the values to figure it out :)

Also run it in pypy because i didn't optimise anything and it's real slow.

EDIT:

GAH Biostars bot you **er, i just burnt 2-3 hours writing a stupid tool for a 4.5 year old question D':