Question

How Much Coverage Do We Need For An Rna-Seq Experiment?

59

Entering edit mode

11.2 years ago

Obi Griffith 20k

I often get this question from collaborators and PIs trying to plan their experiments and budgets. How much coverage is sufficient for an RNA-seq experiment?

One problem with this question is that having a single meaningful coverage value is difficult for RNAseq. Any sample might have a different total amount of transcription, different numbers of transcribed genes/transcripts, different amount of transcriptome complexity (more or less alternate expression) and a different distribution of expression levels for those transcripts. Not to mention common confounding factors like 3' end-bias. All of these factors effectively alter the denominator for any overall coverage calculation. More useful metrics in my opinion are things like total number of reads (and percent of those which map to transcriptome) and total number of transcripts detected with at least X% of junctions with at least X coverage. We usually target at least 10k transcripts with at least 50% of their junctions with at least 10 or 20x coverage. That is approximately what we currently get from a single hiseq lane of 200-300M reads.

But, how much coverage is sufficient? It's even harder to answer this as it really depends on what you are hoping to accomplish. If you only need gene expression levels equivalent to say an Affymetrix gene expression array then it is probably more than sufficient. Same if you only want to validate variants in medium to highly expressed genes. But, I would argue that if that's all you want, then don't waste time/money with RNAseq. What we hope to get from RNA-seq are the above two items plus also confirm variants in lower expressed genes, get good estimates of expressed VAFs, identify lowly or rarely expressed tumor-specific isoforms, show significant differences between alternative splicing patterns, etc. For all these purposes, the one hiseq lane described above is just enough to get us started in my opinion. At present I think it is a good compromise between cost and benefit. But, as prices go down for sequencing we will want to increase it, not decrease it.

We recently found a known promoter mutation (TERT) in some tumors (HCC) we were studying. The mutation is predicted to increase binding of a transcription factor and has been shown to drive subtle but significant 2-4 fold increases in transcription. When we look at expression levels for this gene in RNAseq data we just barely detect it. In fact, the FPKM levels would normally be considered in the noise range. A typical filter of FPKM>1 across at least 20% of samples would eliminate this gene before even testing for a significant difference between normal/tumor or mutant/wildtype. This is a very important cancer gene, with a known mutation causing functional up-regulation that is almost undetectable at current depth levels if we don't already know to look for it! So, I argue that more depth is still needed (cost permitting). Would love to hear other people's thoughts on this.

rna-seq coverage • 78k views

ADD COMMENT • link 9.4 years ago by Obi Griffith 20k

1

Entering edit mode

One issue to keep in mind: some (many? most?) samples are dominated by reads from just one or a few genes. Since these genes take up a large fraction of your fixed supply of total reads, the rest of the genes get lower coverage as a result. For example, in blood samples, globin genes generally make up 50% or more of the total reads, leaving you with less than 50% of the reads for the rest of the transcriptome. Keep this in mind when deciding how many reads you need, your samples are known to be dominated by certain transcripts.

ADD REPLY • link 11.1 years ago by Ryan Thompson ★ 3.6k

1

Entering edit mode

Indeed. This falls under the topic of "different distribution of expression levels for those transcripts". While the overall shape of the distribution does change somewhat from sample to sample, it is I think always the case that you will "lose" a large fraction of your reads for a relatively small number of highly expressed genes. Its probably the single biggest reason why we need so many reads to properly cover the transcriptome. This effect can be mediated somewhat with cDNA capture or other strategies but that is probably a topic for another post.

ADD REPLY • link 11.1 years ago by Obi Griffith 20k

Giovanni M Dall'Olio · Answer 1 · 2013-03-02

Have you read the paper, "Standards, Guidelines and Best Practices for RNA-Seq" published by ENCODE? It may be helpful for you. There is a topic about sequencing depth as follow:

The amount of sequencing needed for a given sample is determined by

the goals of the experiment and the nature of the RNA sample.

Experiments whose purpose is to evaluate the similarity between the
transcriptional profiles of two polyA+ samples may require only
modest depths of sequencing (e.g. 30M pair-end reads of length >
30NT, of which 20-25M are 3 mappable to the genome or known
transcriptome, Experiments whose purpose is discovery of novel
transcribed elements and strong quantification of known transcript
isoforms requires more extensive sequencing. The ability to detect
reliably low copy number transcripts/isoforms depends upon the depth
of sequencing and on a sufficiently complex library. For experiments
from a typical mammalian tissue or in which sensitivity of detection
is important, a minimum depth of 100-200 M 2 x 76 bp or longer reads
is currently recommended. [Specialized studies in which the
prevalence of different RNAs has been intentionally altered (e.g.
“normalizing” using DSN) as part of sample preparation need more than

the read amounts (>30M paired end reads) used for simple comparison
(see above). Reasons for this include: (1) overamplification of
inserts as a result of an additional round of PCR after DSN and (2)
much more broad coverage given the nature of A(-) and low abundance
transcripts.

score 8 · Answer 2 · 2013-03-02

As always, it depends.

You have very specific needs for annotating the transcriptome for splice sites and variants. Although, I have to ask why aren't you doing exome capture for the variant analysis, you wouldn't need half as many reads? Coverage wouldn't be dependent on expression which is probably why you need so many reads.

For quantitation of any RNA, depth is much less important than replicates. We always insist on good numbers of replicates for any quantitative RNAseq: at least 5. No amount of depth is going to give you confidence of the quantification that replicates will give you. This was learned years ago with microarrays, but seems to have been forgotten about with RNAseq. This letter in Nature Biotech puts it quite succinctly http://www.nature.com/nbt/journal/v29/n7/full/nbt.1910.html

This recent preprint paper shows some results comparing depth vs replicate count http://arxiv.org/abs/1301.5277

Depth for our experiments is typically the least important aspect. It's a process of diminishing returns: the more reads you have the more you'll see, but the increase is not linear. This paper (I think, I don't have access at home) shows that even with 800M reads, in mouse, you're still not seeing everything http://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.html

score 7 · Answer 3 · 2013-03-02

7

Entering edit mode

11.2 years ago

Ido Tamir 5.2k

If you have pilot experiment data you could use scotty (which I saw advertised here): website and paper. It specifically adresses the question of read depth vs biological replicates, which you don't seem to consider at all. They also have a nice blog detailing some points.

ADD COMMENT • link 11.2 years ago by Ido Tamir 5.2k

1

Entering edit mode

Nice. Another cool tool from the Marth lab.

ADD REPLY • link 11.2 years ago by Obi Griffith 20k

1

Entering edit mode

Yes, please use Scotty!

I ended up making Scotty because when I was trying to answer this question and I looked at different data sets that were supposedly measuring the same thing (human liver) I got answers that were an order of magnitude different depending on the experiment the data came from. So if you read the papers the data sets come from one of them says X number of reads can replace a microarray and the other says something like 10X is not enough (because the authors were looking at very different data). These are fairly old data sets because those were the only ones that were publicly available that I could find where the different labs did more or less the same thing. The protocols are more established now, but if you have tricky library preps like you get with clinical samples then you might not know until you see the data what sort of complexity you are going to get out of the library and how deep you need to sequence.

I like rarefaction plots for measuring complexity with the X axis being reads sequenced and the Y axis being transcripts detected with at least 5 uniquely aligned reads.

But then of course you have to look at your duplication rate to make sure you are actually sequencing new molecules with new reads. It's possible to run of molecules before you run out of reads and then Scotty's statistical models breaks (so does almost everyone else's).

Nice question!

ADD REPLY • link 11.2 years ago by Michele Busby ★ 2.2k

score 4 · Answer 4 · 2013-03-02

4

Entering edit mode

11.2 years ago

swbarnes2 14k

15 sounds awfully low. At my lab, we only do differential expression, just straight gene counts, and we always shoot for 30-50 Mreads.

We use ERCC spike-ins to get a handle on our accuracy at lower FPKM.

ADD COMMENT • link 11.2 years ago by swbarnes2 14k

score 3 · Answer 5 · 2013-03-05

The above nature methods paper (from 2008) that suggested 700 million mouse reads was using 35bp reads.

For some perspective, the Trinity paper (http://www.ncbi.nlm.nih.gov/pubmed/21572440) maxed out at 106 million reads of 75bp for mouse cultured cells. The ENCODE data for the mouse organs have 4 replicates of about 70-80 million reads (75pb, paired end).

From my own experience, we get about 60 million reads (30M pairs, 100bp). We used to get 100M reads, and did an experiment of subsets from them. We found that sequencing errors tend to actually cause trouble with the de novo assembly after about 70 (they start causing misassemblies with far fewer) and that we were not really adding more information, only noise, after 70M.

If you are not assembling, but rather mapping them onto a genome, you could probably get away with more (100M+), since the mappers are usually set to have 97% identity or something like that. I guess it depends a bit on how much money you have to sequence.

score 2 · Answer 6 · 2013-03-01

2

Entering edit mode

11.2 years ago

Ryan Thompson ★ 3.6k

Here's the simple answer that I've heard from at least 2 independent sources: in order to match microarrays in terms of power to detect differential expression, you need somewhere around 10 to 15 million reads per human RNA sample.

Edit:

Okay, here's my longer answer. Two groups that I trust have independently set out to answer the question "How deep do we have to sequence to replace microarrays for differential gene expression with no loss in statistical power?" They both came up with the same answer: 10-15 million reads per sample. One group is in industry and won't be publishing the result, and the other has a paper in review (I'll try to remember to link to it when it gets published). This figure assumes that all you want to is use RNA-seq as a direct replacement for human Affy expression arrays for the purpose of testing for differential expression at the gene level. No alternative splicing or per-exon expression, etc.

Also, remember that for doing differential anything, more biological replicates always helps more than additional sequencing depth.

ADD COMMENT • link 11.2 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

Very interesting. I can definitely imagine that for most transcripts you don't need 200-300 million. But, I wouldn't feel totally comfortable with less than 50 million. There is sometimes a large range in percent of reads mapping successfully. But, mostly that's just a feeling. Looking forward to your elaboration.

ADD REPLY • link 11.2 years ago by Obi Griffith 20k

0

Entering edit mode

You are speaking about human RNASeq only, right? Thanks.

ADD REPLY • link 11.1 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

I don't know about others. But, I am only speaking of human RNA-seq. I imagine that all the same points will be relevant also for mouse, etc. But, if you are studying a model organism with significantly smaller genome or simpler transcriptome then a lot of these calculations will change for sure.

ADD REPLY • link 11.1 years ago by Obi Griffith 20k