Biostars beta testing.
Question: Should We Remove Duplicated Reads In Rna-Seq ?
11
Entering edit mode

Is it OK to remove duplicated reads in RNA-seq before testing for differential expression?

ADD COMMENTlink 7.3 years ago siddharth.sethi5 • 230 • updated 6.2 years ago earonesty • 220
Entering edit mode
0

how do you define a duplicated read? If you delete duplicate reads, how will you get information on expression levels?

ADD REPLYlink 7.3 years ago
Whetting
♦ 1.5k
Entering edit mode
2

Duplicated read is not the same a duplicated transcript. The level of natural read duplications will be a function of coverage and expression levels. And since most of the time the latter is not known we can only estimate the natural duplication rate.

ADD REPLYlink 7.3 years ago
Istvan Albert
80k
Entering edit mode
0

Reads which are exactly the same and on the same strand. They get aligned at the same place and looks like multiple copies of one read. So, I assume the answer is NO?

ADD REPLYlink 7.3 years ago
siddharth.sethi5
• 230
23
Entering edit mode

The general consensus seems to be to NOT remove duplicates from RNA-seq. Read the biostar discussions:

and this seqanswers thread and the other threads it links to.

ADD COMMENTlink 7.3 years ago Obi Griffith 17k • updated 7.3 years ago Istvan Albert 80k
Entering edit mode
5

The general consensus I see from those discussions, rather than "no", is "it depends". The first link: Malachi Griffith writes: "Observing high rates of read duplicates in RNA-seq libraries is common. It may not be an indication of poor library complexity caused by low sample input or over-amplification. It might be caused by such problems but it is often because of very high abundance of a small number of genes." Second link: Istvan Albert writes: "My personal opinion is to investigate the duplication rates and remove them if there is indication that these are artificial ones." 3rd link: Ketil writes: "I think the number of duplicates depend on many factors, so it is hard to give any general and useful rules of thumb. Usually, duplicates are correlated with too little sample material, and/or difficulties in the lab." And the seqanswers thread goes both ways.

ADD REPLYlink 7.3 years ago
Mikael Huss
4.6k
Entering edit mode
0

Yes. I should have said that the general consensus seems to be to not blindly remove duplicates.

ADD REPLYlink 7.3 years ago
Obi Griffith
17k
Entering edit mode
0

But are type I/II errors from duplication rates expression/quartile dependent?

Coefficient of Variation is often the lowest in the most abundant genes, which are also the genes with the highest probability of duplicates (optical, amplification, or true duplicates) by virtue of the amount of data they consume. Variance is often artificially inflated for these genes in the end anyways in most regularization protocols for differential expression, so the conclusions from DE analyses should be largely the same... In contrast, genes with low expression stand to be the most affected by duplicates for this exact same reason (i.e. their variance is artificially reduced by regularization AND duplicates are a larger proportion of their count totals to begin with, thus increasing type I error). The odds of these arguments affecting opinions is low, but I think it's a conversation worth having.

I would recommend re-running differential expression analyses with/without deduplication and then comparing the resulting gene sets for each quartile of expression. My educated guess is that conclusions for the top quartile would be fairly consistent (do to the low CoV) while results could be quite different in genes with low expression where a larger percentage of the counts may be optical or technical duplicates (in comparison to genes in the top quartile).

Edit: rephrasing and grammar

ADD REPLYlink 2.9 years ago
mrals89
• 20
Entering edit mode
0

This is an interesting comment and I'd like to understand it better. Are you using the word "regularization" in the statistical sense, for example as in "Tikhonov regularization"? If so, can you describe the statistical methods you have in mind? Also, how do you know that there are more duplicates per read for low-expressed genes?

ADD REPLYlink 2.5 years ago
eric.kern13
• 150
4
Entering edit mode

I really depends. I have seen alignments with so many same reads at a certain position (5 and 3 prime exactly matching), that it were for sure PCR duplicates (on other positions in the alignment there was no such coverage). However, you might loose information about strongly expressed genes this way. I would check for un/equal coverage for every gene/contig.

ADD COMMENTlink 7.3 years ago Biomonika (Noolean) 3.1k
4
Entering edit mode

If you have paired-end reads, I definitely think you should remove duplicates (alignments that start at the same locations at both read 1 and read 2). These are very unlikely to occur by chance because of the variation in fragment size.

If you have a small amount of RNA going into the experiment, you will have run a lot of PCR cycles before sequencing and the representation of some fragments will have become very biased. Duplicate removal is a way to mitigate this effect although it will not solve it.

It continues to baffle me why people keep saying that you have to keep duplicates because you will lose information - but apparently it's perfectly fine to get grossly distorted read counts because of amplification artifacts! There is no way to avoid bias completely.

ADD COMMENTlink 7.3 years ago Mikael Huss 4.6k
Entering edit mode
7

The problem is that your first premise that "These are very unlikely to occur by chance because of the variation in fragment size" is true for DNA but NOT necessarily true for RNA. If you have a short RNA (i.e., small transcript, miRNA, etc) that is very highly expressed there might be many, many legitimate duplicate copies of that RNA with exactly the same fragment size/position. The size of that transcript might even be such that effectively no fragmentation occurs. Also fragmentation may not occur in a totally random fashion across the transcript. By removing those duplicates you may be introducing a new source of "grossly distorted read counts". As, you say, there is no way to avoid bias completely. But, depending on the complexity and quality of your library you may be better off living with a small amount of amplification artifacts.

ADD REPLYlink 7.3 years ago
Obi Griffith
17k
Entering edit mode
0

OK, but for miRNA etc you wouldn't do paired-end sequencing anyway; what I had in mind was the typical mRNA case. Anyway, your point is noted, and again underlines that one has to think about the biases, and not just blindly follow a rule, such as "always remove duplicates" or "never remove duplicates".

ADD REPLYlink 7.3 years ago
Mikael Huss
4.6k
4
Entering edit mode

The EXPRESS pipeline (berkeley) does this for you. It calculates a coverage distribution and then removes the "spikes" from that distribtion. Because duplicates are expected in RNA Seq, removing them "blindly" is a bad idea. You should be smoothing, not removing.

ADD COMMENTlink 6.2 years ago earonesty • 220
Entering edit mode
0

How much of an issue is read duplicates in RNA seq experiments, both single fragment and PE sequenced? Is it affecting a large proportion of the data? What are the duplicate rates like for an average human sample experiment? Is it in the order of 1% duplicates? Or 10% duplicates?

ADD REPLYlink 4.6 years ago
14134125465346445
♦ 3.4k
Entering edit mode
0

The rate of natural duplication of the reads coming from a transcript will be a function of the length of the transcript, the expression levels of the transcript and the overall coverage of the sample. Hence there is no rule that you could use a priori to separate natural duplicates (that should be kept) from artificial duplicates (that should be removed)

ADD REPLYlink 4.6 years ago
Istvan Albert
80k
Entering edit mode
0
I have seen genome-wide duplication rates up to 50% in our own data (50bp SE) and think this is nothing unusual.
ADD REPLYlink 4.6 years ago
Christian
♦ 2.8k
3
Entering edit mode

Short answer - NO!

You are checking for differential expression of genes, number of reads mapping to a gene is a measure of its expression. By removing reads, you will bias the true expression measurements.

ADD COMMENTlink 7.3 years ago VS • 710
Entering edit mode
2

I don't agree at all. By not removing duplicates, you bias the expression measurements towards PCR artifacts, especially from paired-end data.

ADD REPLYlink 7.3 years ago
Mikael Huss
4.6k
Entering edit mode
3

Well I have not worked with paired-end data so far, can't say about them. But I produced 'tag' files to track the number of copies of 'duplicate reads' for single-end seq. data. The very highly expressed genes tend to have more of such 'duplicate' reads. We also did validations with qPCR that confirm the higher expression. So I don't think you can say duplicate reads result purely from PCR amplification artifacts.

ADD REPLYlink 7.3 years ago
VS
• 710
Entering edit mode
0

OK, that's an interesting finding actually. Sure, duplicate reads are not always from PCR amplication artifacts, but sometimes they are! Probably you should ideally go on a case-by-case.

ADD REPLYlink 7.3 years ago
Mikael Huss
4.6k
Entering edit mode
0

But highly expressed genes will have higher number of reads in general - I would be very interested in ratios (dupl/all) for highly/medium/low expressed genes if you could provide this information please.

ADD REPLYlink 6.4 years ago
Biomonika (Noolean)
3.1k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0