Question

BWA-MEM output for two species

0

Entering edit mode

6.6 years ago

maheetha.b ▴ 70

Hello.

I have a fastq file that contains targeted panel DNA from two species.

The current methodology is to map it to a combined reference genome.

I know that BWA-MEM gives a mapping quality 0 for for reads that have been mapped to multiple places, and if it's 2 locations, the MAPQ score is 3.

But I'm wondering if this means that it maps EQUALLY to all locations. I believe it is, and in that case that it maps EQUALLY to all locations, it will at random choose one location to report. However, if it maps better to one location than another, I understand that BWA-MEM only reports the highest quality mapped read location. There is a way on BWA-MEM to have the option of listing out all locations (i read somewhere there is, just wanted to confirm).

The reason I'm asking is, I want to be able to identify reads that map to one species over another. Reads that are basically part of a very similar sequence in both species, will be ambiguous. Using the newest software "Disambiguate" (Ahdesmaaki et al), it seems upon first few reads that the tool aligns separately (as opposed to mapping to only one reference) to each species, and then just compiles all ambiguous reads that have equal mapping quality to regions on two species, and takes them out of reads mapping to either species all together. I fear that this might lead to a severe underestimation of coverage of a particular location. However, if we were to go with my method of simply using mapping quality and location to filter, we might risk under/over-counting because of the randomly chosen locations for those equally mapped (ambiguous) reads. The chance that the gene sequence is going to be exactly IDENTICAL between two different species is slim, given the average insert size being longer than 50 bp, but I wanted to what people would do in such situations. In short, is there no way to really know where an ambiguously mapped read comes from? How do people deal with this?

bwamem Disambiguate • 2.6k views

ADD COMMENT • link 6.6 years ago by maheetha.b ▴ 70

0

Entering edit mode

I'm working with mice and human. I know many genes are similar, 80% 1-1 correlation, but not sure how much of it is in coding regions. I'll take a look! thank you!

But again, I feel like there is no way to differentiate between reads that map equally well to both species (although highly unlikely still possible),

ADD REPLY • link 6.6 years ago by maheetha.b ▴ 70

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 6.6 years ago by GenoMax 141k

score 2 · Answer 1 · 2017-09-08

If you have sequence for one (or both) genomes of interest then use BBSplit from BBMap Suite to bin your reads. Take a look at this thread for information. You can choose how you handle reads that multi-map across genomes.

That said, if the species are very similar then it was a bad strategy to mix the two samples in one pool. If they are not then BBSplit is your friend.