Question

Duplicate Reads In Rnaseq

5

Entering edit mode

11.6 years ago

Ashutosh Pandey 12k

In a genomic analysis including variant discovery it is advisable to not to consider OR remove duplicate reads because a replication error could be easily misunderstood as a SNP. So we usually use reads with unique start positions.

Library generation protocol for RNAseq involves amplification at some point. I know amplification is required for sequencing and I am also sure that amplification is uniformly carried out for the transcriptome (at least theoretically). For example, if Gene A has 4 RNA copies in the sample and Gene B has 10 copies. After amplification, if gene A has 16 copies then Gene B should have 40 copies. I also know if you have very deep library then lot of duplicate reads are expected not because of the amplification.

a) Has someone any idea about what percentage (range) of duplicate reads in total reads is considered to be normal for the RNASeq data ?

Also, do we also need to discard duplicate reads in case of RNASeq experiments where we have to compare gene expression between two samples where the number of duplicate reads differ significantly between these two samples(One sample was more amplified and the other was less). I know dividing the read counts with the total number of mapped read in the sample removes the bias when you have unequal number of reads for different samples but will this normalization step take care of duplicates so that we get the true representation of the transcriptome after normalization.

For some protocols dealing with very small quantity of RNA as a starting material, lot of amplification is required before sequencing. I have a RNAseq data for two of such samples where first sample has only 30% of unique or non-duplicate reads and the second sample has around 50% of unique or non-duplicate reads. Can I still carry out the RPKM normalisations and use tools like DEGseq, EdgeR to get the list of differentially expressed genes ?

Thanks

duplicates rna-seq rpkm • 14k views

ADD COMMENT • link updated 11.6 years ago by Ketil 4.1k • written 11.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I think we DON'T need to discard duplicate reads from RNASeq experiments

ADD REPLY • link 11.6 years ago by Rm 8.3k

0

Entering edit mode

I agree with that too

ADD REPLY • link 11.6 years ago by JC 13k

score 3 · Answer 1 · 2012-09-13

3

Entering edit mode

11.6 years ago

Rm 8.3k

for some insights on PCR duplicates in RNAseq data : follow this thread on seqanswers

ADD COMMENT • link 11.6 years ago by Rm 8.3k

score 1 · Answer 2 · 2012-09-17

I think the number of duplicates depend on many factors, so it is hard to give any general and useful rules of thumb. Usually, duplicates are correlated with too little sample material, and/or difficulties in the lab. I expect more complex procedures may cause more duplicates, but I don't have any hard numbers on that. In my experience, duplication rates seem to be higher and less evenly distributed with 454 than on Illumina, but that could be a bias from the types of data I've seen.

I'm curious if others' experiences agree with mine.