Differences between FastQC duplication levels for paired fastq and bam files.
2
0
Entering edit mode
5.8 years ago

I am using FastQC to summarize qc statistics for a RNAseq assay and I am a little bit confused about sequence duplication levels. I am getting an error (i.e. high level of duplication) for both paired fastq files: Col0_1_R1.fq and Col0_1_R2.fq.

Col0 R1 fq

However no such an error is reported for the corresponding aligned file Col0_1.bam

Col0 1 bam

Does fastQC use different heuristics to detect duplicates in for both situations?

Thanks in advance

RNA-Seq rna-seq fastqc • 3.3k views
ADD COMMENT
0
Entering edit mode

Post about sequence duplication from FastQC authors.

ADD REPLY
0
Entering edit mode
5.8 years ago
drkennetz ▴ 560

Sure, the fastqc duplication detection assumes that a library is an unenriched library. So if you have enriched for a particular region or gene, and have a lot of the same captured sequence, a high duplication level will be reported from your fastqc of unaligned fastqs. After alignment, the bam file contains more information on the sequence to tell if the read is actually a duplicate (PCR artifact) or not as well as its coordinate location.

An important note on fastqc: it should not be used to determine the final metrics for a library, but should be used for a quick check up front.

I hope this helps,

Dennis

ADD COMMENT
0
Entering edit mode

Thanks for your prompt response drkennetz! I suspected that somehow aligned pairs were used to remove degeneracy in duplicate detection but couldn't find how this is actually achieved. Could you refer me to some publications or technical descriptions of the involved heuristics?

ADD REPLY
0
Entering edit mode

The only info I can give you as far as technical notes is that the fastqc documentation says that it will treat raw fastqs differently than mapped sams/bams and that they will be opened as sam or bam files using all mapped and unmapped sequences to determine analysis results. If you have the software downloaded you can check the documentation here:

/apps/fastqc/install/0.11.5/Help/2 Basic Operations

It also says this for RNA-seq libraries:

"In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates."

I cannot give any exact pointers to something that explicitly states how the software works, and it is all compiled when installed. Sorry about that. I don't think they would explicitly state that they would treat mapped sequences differently than fastqs in their documentation unless they used all of the sam/bam information though. But I cannot say this for certain!

ADD REPLY
0
Entering edit mode
5.8 years ago

Ok, I see. Thanks again Dennis!

ADD COMMENT

Login before adding your answer.

Traffic: 1949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6