Removing PCR duplicates - fastq or BAM?
1
2
Entering edit mode
6.8 years ago
rbronste ▴ 420

Wondering about pros/cons of removing duplicates from the raw fastq files vs the raw BAM alignment? Thanks.

alignment BAM fastq duplicates • 8.7k views
ADD COMMENT
4
Entering edit mode

Less work if you dedupe up front. clumpify.sh (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) from BBMap does this without a need for alignments.

ADD REPLY
4
Entering edit mode

Though personally I greatly prefer Clumpify for duplicate removal, mapping-based approaches can be more robust to reads with lots of errors (if you consider those duplicates). But in addition to the increased time, mapping-based also has the disadvantage of a lossy conversion to sam/bam format which typically chops off some of the original header (everything after the first whitespace).

I think some mapping-based deduplication tools may not be robust to read pairs that map to different chromosomes, or when only one read is mapped, and certainly not when neither read is mapped. I wrote a mapping-based deduplication program that handles duplicates in the first two scenarios, but as a result it uses a lot of memory. My recollection was that one of samtools or GATK handled duplicates of pairs mapped to different chromosomes, and the other didn't. And as for unmapped reads - some aligners will not map reads that have a lot of adapter sequence, even if they came from the correct genome, so those short-insert reads would not be deduplicated based on mapping the raw reads.

Multi-mapping reads can also pose a problem to mapping-based deduplication methods, depending on how the aligner handles ambiguity (e.g. non-determinsitic assignment is common), as can split alignments, which are produced by some aligners.

ADD REPLY
3
Entering edit mode

To support your contention, picard misses PE duplicates with mates mapping to different chromosomes.

ADD REPLY
0
Entering edit mode

Ah, thanks, Picard was indeed what I was thinking of.

ADD REPLY
0
Entering edit mode

Have you got a reference for that? I've read that Picard's MarkDuplicates can handle inter-chromosomal pairs and Samtool's rmdup cannot:

  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965708/
  2. https://sourceforge.net/p/picard/wiki/Main_Page/#q-what-is-the-difference-between-markduplicates-and-samtools-rmdup
ADD REPLY
1
Entering edit mode

Plain observation. I've recently been improving the duplicate marking in deepTools and this is one of the few sources of difference between it and picard in the output. So even if they document catching them, they don't always.

ADD REPLY
0
Entering edit mode

Interesting and quite surprising, I'll double-check my data

ADD REPLY
9
Entering edit mode
6.8 years ago

Basically duplicates are of two kinds:

  • natural duplicates - caused by the biological system producing identical DNA fragments
  • artificial duplicates - caused by the sequencing instrument producing identical DNA fragments

Of course, we'd want to keep the first kinds of duplicates and remove the second kinds. But rarely if ever is a clear distinction possible between the two situations. Hence the conundrum.

While we are at it, an empirical observation that I made is that data with high rates of artificial duplication is often useless even after fixing this problem. Many other problems turn up. So it does not really matter what you do with it - remains useless.

In general, from what I understand, people tend to deduplicate their data where a uniform coverage is expected across the genome and when the coverage over a given position has major implications regarding the results. For example in SNP calling the number of reads supporting a variant is an essential decision maker in trusting that variant. We'd want to avoid using artificial duplicates there.

In most other cases, and especially when the expected coverages vary wildly and there are reasons for a fragment to occur very frequently (highly expressed short transcript in a transcriptome study) duplicate removal is not recommended.

ADD COMMENT
0
Entering edit mode

Thanks for the breakdown, I am doing ATAC-seq so trying to understand the overall pros/cons in that application.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6