Hi everyone,
I have some paired end RNA-seq samples that have high levels of duplication (some as high as only 6% remaining after de-duplication). I think it was due to low concentration of input RNA (~1ng), and smaller subset of genes being expressed (because the RNA is from a specific cell type isolated from brain). Even after a poly-A selection, the highest gene expressed in my samples was a ribosomal RNA transcript.
I used Picard's MarkDuplicates to remove duplicated reads from my samples and looked at how that affected counting. I was happy to see that the counts for the rRNA gene were greatly reduced, but it also seems that the counts for almost every single gene are reduced. I thought that only high expressing genes would have duplicate reads. I also did a correlation analysis between the regular samples and the de-duplicated samples and saw that there was excellent correlation between them, but I'm just confused now.
If basically every gene has duplicates, what does it mean? Should I only use de-duplicated samples for further analysis? I know there are lots of other threads on this issue but it seems like my duplication is more severe.
Someone with better experimental chops will need to confirm but perhaps extra cycles of amplifications caused this problem?
If you feel that the experiment did not work as intended then perhaps it is time to consider redoing (at least the library part) (that is easy for someone like me to say, so apologies in advance, if this is an irreplaceable sample/difficult experiment).