Question

Difference between marking duplicates and filtering BAM on phred score

0

Entering edit mode

5.8 years ago

Nalini ▴ 20

Hey!

I am working with whole genome sequencing data of bacterial samples and have done an Illumina for the same. So on receiving the .fastq files (paired-end), I have aligned them to my reference genome using bowtie2 to get a SAM file which I converted to BAM and sorted the bam file. I then filtered the sorted bam file to obtain one where phred score >30. This filtered bam file is what I have used for my downstreaming analysis.

My question is whether there is a difference in the final output file if I use MarkDuplicates by Picard? I read how MarkDuplicates by Picard works, where it recognizes optical artifacts and PCR duplicates by seeing a pair with Q>15 is what is considered. Hope I have got that right! So does that mean when I just do a Q>30, I have taken care of duplicates or are these two totally different quality checks?

Please advice if I can go ahead with the Q>30 filtered files or I need to use the Mark Duplicates tool also. Would be great if you could give me a simple explanation for the same.

Thanks in advance!! :)

alignment next-gen quality duplicates picard • 3.3k views

ADD COMMENT • link updated 5.8 years ago by d-cameron ★ 2.9k • written 5.8 years ago by Nalini ▴ 20

score 2 · Answer 1 · 2018-06-16

are these two totally different quality checks?

These are two totally different quality checks.

Mark duplicates removes fragments that been sequenced multiple times due to PCR amplification.

MAPQ filtering removes reads that are ambiguously placed by the aligner. When a read aligner places a read, it also reports a MAPQ (mapping quality) phred-scaled quality score. Since many genomes contain repetitive sequence, many reads cannot be unambiguously placed as the read aligns equally well to two or more locations in the genomes (multi-mapping reads).

Note that there are also a phred-scaled base quality score for each base. When you say "phred score >30", it is not obvious whether you are talking about filtering out multi-mapping reads, or trimming reads with runs of low base quality scores (which you should also do).

Edit: due to the difficulty/impossibility of determining the actual source location of multi-mapping reads, these reads are also quite difficult to correctly deduplicate with MarkDuplicates since MarkDuplicates relies on the read alignments of the duplicate reads to also match.

score 1 · Answer 2 · 2018-06-16

Hello,

MarkDuplicates finds duplicates based on the mapping information. The quality values are only taken into account for determine which of the duplicates should stay as the "original" read.

So filtering your bam file by phred scores doesn't remove duplicates.

Also I guess it's better to perform first MarkDuplicates and doing than any filtering steps.

fin swimmer