Why Do We Need Markduplicates For Variants Detection In Gatk Processing Pipeline?
2
19
Entering edit mode
12.1 years ago
Lds ▴ 450

Hi fellows,

It's said that MarkDuplicates in Picard matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the 'best' pair. If I have three pairs, with one of which is the 'best' pair, they're all truely from the target genome but not from sequencing artifacts, and if I set REMOVE_DUPLICATES=True, it will delete the two non-best pairs, then it will decrease the coverage for that region. This doesn't make sense, maybe I misunderstood the purpose of MarkDuplicates. So my question is, what's the purpose for MarkDuplicates, why does it delete the duplicates?

Thanks in advance

gatk markduplicates picard • 23k views
ADD COMMENT
0
Entering edit mode

Lots of previous information in these threads: http://biostar.stackexchange.com/search?q=duplicates

ADD REPLY
11
Entering edit mode
12.1 years ago

Almost all statistical models for variant calling assume some sort of independence between measurements. The duplicates (if one assumes that they arise from PCR artifact) are not independent. This lack of independence will usually lead to a breakdown of the statistical model and measures of statistical significance that are incorrect.

There are experiments where one should not make the assumption that reads that have the same start positions are PCR duplicates. In that case, using MarkDuplicates is not justified.

ADD COMMENT
0
Entering edit mode

Thanks so much. This is the discussion in seqanswers: http://seqanswers.com/forums/showthread.php?t=6854

I think that we should using MarkDuplicates in SNP calling.

ADD REPLY
0
Entering edit mode

Yes, you should.

ADD REPLY
11
Entering edit mode
12.1 years ago

MarkDuplicates is important in removing PCR duplicates -- which can introduce bias in your variant calling. If you did not mark duplicates, you would risk having over-representation in your sequence of areas preferentially amplified during PCR. One way to think about it is that marking duplicates and removing them does not really have a detrimental effect on your overall depth of coverage -- but increases the quality/reliability of the areas you have covered.

There is a good discussion covered here.

And also further discussion on the Picard Main Page.

ADD COMMENT

Login before adding your answer.

Traffic: 2341 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6