Question

how do aligners determine which reads to apply duplicate reads flag (1024)?

0

Entering edit mode

7.5 years ago

epigene ▴ 590

I was trying to understand how aligners, specifically bwa, determine which reads to apply the 1024 flag in the sam output.

I notice I have cases where two identical reads (same read seq and same quality scores) are mapped to different locations on yeast chr16 as shown below:

Mxxx:1:2103:26709:7555     0       chr16   560671  0       90M     *       0       0       ATTAGTATGTAGAAATATAGATTCCATTTTGAGGATTCCTATATCCTCGAGGAGAACTTCTAGTATATTCTGTATACCTAATATTATAGC  CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG      MD:Z:90 PG:Z:MarkDuplicates     NM:i:0  AS:i:90     XS:i:90

Mxxx:1:2108:18700:2139     0       chr16   844608  0       90M     *       0       0       ATTAGTATGTAGAAATATAGATTCCATTTTGAGGATTCCTATATCCTCGAGGAGAACTTCTAGTATATTCTGTATACCTAATATTATAGC  CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG      MD:Z:90 PG:Z:MarkDuplicates     NM:i:0  AS:i:90     XS:i:90

Why are they not considered duplicates? And how are duplicates determined?

bwa sam • 2.3k views

ADD COMMENT • link updated 7.5 years ago by dariober 14k • written 7.5 years ago by epigene ▴ 590

4

Entering edit mode

What makes you think these two reads have the same read name?

One is Mxxx:1:2103:26709:7555, and one is Mxxx:1:2108:18700:2139.

Duplicate calling is usually not done by the mapper like BWA, but rather done later by a duplicate-marking tool, in your case it looks like Picard's MarkDuplicates due to PG:Z:MarkDuplicates. This tool uses not the DNA sequence but the read's mapped position to decide if it's duplicate or not, and these reads despite having the same DNA have different mapped positions.

This is likely because the reads have a MAPQ of 0, which for BWA specifically means that the read maps in multiple locations. In other words, this could very well be a PCR duplicate, however, since it also maps to multiple places on the genome and BWA randomly assigned those reads to different places, they weren't called duplicates by Picard.

I agree this is a really dumb problem to be having but this is how it goes.

ADD REPLY • link 7.5 years ago by John 13k

0

Entering edit mode

I didn't say they have the same read name..and they don't. I meant for the seq and quality.

Yeah, after BWA, the bam file was feed to Picard's MarkDuplicates tool. Thanks for the explanations.

ADD REPLY • link 7.5 years ago by epigene ▴ 590

1

Entering edit mode

Right, sorry, i'm just trying to help clarify some terminology as often different uses of certain words leads to a lot of confusion down the line. For example, you didn't say "identical read names" you said "identical reads", however a "read" is not defined by the sequenced DNA (which was implied), otherwise PCR duplicates or legitimate biological duplicates would all have the same read name. It is defined by where on the flowcell the read came from, (or more generally, an individual "sequencing event" in the sequencer). Thus, when you said "two identical reads" I was expecting to see two entries from the same sequencing event -- which happens when a mapper like BWA also outputs secondary alignments.

Maybe you already knew all that, but when you spend a lot of time on a support forum trying to help people, it becomes a habit to spell everything out just to make sure everyone is on the same page. I really didn't intend to frustrate you or anything like that :)

ADD REPLY • link 7.5 years ago by John 13k

0

Entering edit mode

I understand.. I should have been more careful with the right terms as there are important distinctions.

ADD REPLY • link 7.5 years ago by epigene ▴ 590

score 0 · Answer 1 · 2016-10-11

0

Entering edit mode

7.5 years ago

dariober 14k

So your question is effectively about picard MarkDuplicates and how it works. Have seen this FAQ page http://broadinstitute.github.io/picard/faq.html?

ADD COMMENT • link 7.5 years ago by dariober 14k