Hi everyone, I'm working on a term project involving read alignment tools, and I had a question regarding how these programs detect and report PCR duplicates.
As I understand it, a proportion of PCR duplicates will be false positives. One read from the pair may have a sequence identical to other reads, but if the other half of the pair aligns at a different region of the genome, it's not a true PCR duplicate, as it wouldn't originate from the same DNA fragment. And programs like FastQC only consider one read at a time, without looking at the paired end data.
But the SAM output from read alignment tools also contains a flag for PCR duplicates. When flagging a PCR duplicate, do read alignment tools look only at individual reads, or do they take into consideration the position of the other pair when the reads come from a paired-end library?
If anyone could give more insight into this I would appreciate it!
Take a look at
clumpify.sh
in the thread here: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files One advantage you don't need align the data to identify duplicates.If both reads have identical sequence on fragments you are sequencing then there is a chance they are PCR duplicates. You can't be 100% sure until you use UMI's in your library prep to label individual RNA molecules.
Keep in mind also that the label "PCR duplicate" is a bit misleading. In fact, it refers to positional duplicates, i.e. reads or read pairs with identical alignment coordinates. As far as I know, in typical Illumina sequencing libraries there is no way to tell apart positional duplicates from PCR duplicates.