Question

How do Read Alignment tools detect PCR duplicates?

0

Entering edit mode

5.1 years ago

madischapel • 0

Hi everyone, I'm working on a term project involving read alignment tools, and I had a question regarding how these programs detect and report PCR duplicates.

As I understand it, a proportion of PCR duplicates will be false positives. One read from the pair may have a sequence identical to other reads, but if the other half of the pair aligns at a different region of the genome, it's not a true PCR duplicate, as it wouldn't originate from the same DNA fragment. And programs like FastQC only consider one read at a time, without looking at the paired end data.

But the SAM output from read alignment tools also contains a flag for PCR duplicates. When flagging a PCR duplicate, do read alignment tools look only at individual reads, or do they take into consideration the position of the other pair when the reads come from a paired-end library?

If anyone could give more insight into this I would appreciate it!

alignment • 3.1k views

ADD COMMENT • link updated 5.1 years ago by Friederike 8.9k • written 5.1 years ago by madischapel • 0

0

Entering edit mode

Take a look at clumpify.sh in the thread here: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files One advantage you don't need align the data to identify duplicates.

If both reads have identical sequence on fragments you are sequencing then there is a chance they are PCR duplicates. You can't be 100% sure until you use UMI's in your library prep to label individual RNA molecules.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Keep in mind also that the label "PCR duplicate" is a bit misleading. In fact, it refers to positional duplicates, i.e. reads or read pairs with identical alignment coordinates. As far as I know, in typical Illumina sequencing libraries there is no way to tell apart positional duplicates from PCR duplicates.

ADD REPLY • link 5.1 years ago by dariober 14k

score 1 · Answer 1 · 2019-03-22

To sum up what swbarnes and genomax wrote:

alignment tools usually don't change the FLAG entry related to whether a given read may be a duplicate or no
FastQC never changes anything in the fastq or bam files it is looking at
commonly used tools that do detect duplicates are, for example, samtools markdup, PICARD's MarkDuplicates, the clumpify tool mentioned by genomax etc. Different tools may handle specific details differently, so if you need to know for absolutely sure it probably pays off to read the documentation of the tool you settle on, but generally the consensus is that, for paired-end reads, both reads of a pair will be taken into consideration.

score 0 · Answer 2 · 2019-03-22

Fastqc is not an aligner. It's reporting any sequences it sees over and over again, as that might be a quality issue.

Many aligners don't touch the PCR duplicate flag. Most people use programs like Picard Tools to flag PCR duplicates after alignment. Picard Tools is smart enough to understand to use both reads of a pair if told to do so.