How do Read Alignment tools detect PCR duplicates?
2
0
Entering edit mode
5.1 years ago

Hi everyone, I'm working on a term project involving read alignment tools, and I had a question regarding how these programs detect and report PCR duplicates.

As I understand it, a proportion of PCR duplicates will be false positives. One read from the pair may have a sequence identical to other reads, but if the other half of the pair aligns at a different region of the genome, it's not a true PCR duplicate, as it wouldn't originate from the same DNA fragment. And programs like FastQC only consider one read at a time, without looking at the paired end data.

But the SAM output from read alignment tools also contains a flag for PCR duplicates. When flagging a PCR duplicate, do read alignment tools look only at individual reads, or do they take into consideration the position of the other pair when the reads come from a paired-end library?

If anyone could give more insight into this I would appreciate it!

alignment • 3.1k views
ADD COMMENT
0
Entering edit mode

Take a look at clumpify.sh in the thread here: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files One advantage you don't need align the data to identify duplicates.

If both reads have identical sequence on fragments you are sequencing then there is a chance they are PCR duplicates. You can't be 100% sure until you use UMI's in your library prep to label individual RNA molecules.

ADD REPLY
0
Entering edit mode

Keep in mind also that the label "PCR duplicate" is a bit misleading. In fact, it refers to positional duplicates, i.e. reads or read pairs with identical alignment coordinates. As far as I know, in typical Illumina sequencing libraries there is no way to tell apart positional duplicates from PCR duplicates.

ADD REPLY
1
Entering edit mode
5.1 years ago

To sum up what swbarnes and genomax wrote:

  1. alignment tools usually don't change the FLAG entry related to whether a given read may be a duplicate or no
  2. FastQC never changes anything in the fastq or bam files it is looking at
  3. commonly used tools that do detect duplicates are, for example, samtools markdup, PICARD's MarkDuplicates, the clumpify tool mentioned by genomax etc. Different tools may handle specific details differently, so if you need to know for absolutely sure it probably pays off to read the documentation of the tool you settle on, but generally the consensus is that, for paired-end reads, both reads of a pair will be taken into consideration.
ADD COMMENT
0
Entering edit mode
5.1 years ago

Fastqc is not an aligner. It's reporting any sequences it sees over and over again, as that might be a quality issue.

Many aligners don't touch the PCR duplicate flag. Most people use programs like Picard Tools to flag PCR duplicates after alignment. Picard Tools is smart enough to understand to use both reads of a pair if told to do so.

ADD COMMENT

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6