Question

RNA-seq dispersion estimation

1

Entering edit mode

5.1 years ago

jjv2124 ▴ 10

Hi everyone,

We did two time-series experiments with mixed microbial communities, to figure out how expression changes as a previously major respiration shift happens. We are an environmental engineering lab, and didn't really have prior experience with RNA-seq, and didn't design our experiment properly. Specifically, we made the common mistake of no/low biological replicates. I found a lot of posts about RNA-seq experiments with no replicates. This led us to decide to use edgeR for analysis, and we looked at their suggestions for dealing with no replicates. I was wondering if anyone had any insight on the best dispersion estimation. Our options are as follows:

We have two samples from the same mixed community at T=0. At that point, the two experiments were the same, so I believe they can be regarded as true biological replicates. None of the other time points really have replicates. We are considering the following for our analysis:

1) Estimate dispersion based on the two T=0 samples.

2) Identify a set of 'housekeeping' genes in the community and estimate biological dispersion from that across all of our samples.

The benefit for the first approach would be that all the transcripts would be included in the dispersion estimate. However, it only is two replicates. The second approach benefits from having many 'replicates' (16 samples in total), but not all the transcripts are included and we worry that our identification on housekeeping genes would have to hinge on transcripts with low gene count variation across the samples, and thus perhaps be a bit biased/lead to underestimate of dispersion.

What do y'all think of our options? Would two replicates be enough to get an estimate? I understand that the reliability of the analysis will be impacted by not having enough replicates for all time points.

RNA-Seq metatranscriptomics dispersion • 1.4k views

ADD COMMENT • link updated 5.1 years ago by Charles Warden 8.2k • written 5.1 years ago by jjv2124 ▴ 10

score 0 · Answer 1 · 2019-03-20

In general, I would try to test DESeq2 / edgeR / limma-voom for every project. All three of those programs will work with time-series data (in addition to regular linear regression on log-transformed expression).

Strictly speaking, I think the modeling really should be done per-gene and having a limited number of replicates probably has some effect on the accuracy of the dispersion estimates. However, I think testing different freely available programs can help when no individual program can be used in all situations. Coping part of my earlier answer to another question:

While hard to define precisely, I would do things like i) compare the size of gene lists, ii) visually inspect heatmap with differentially expressed genes (to check things like clustering of replicates), iii) see if functional enrichment can inform the use of upstream strategies (such as the p-value method, differential comparison strategy or strategies), and iv) (if available) check the status of genes that known to change between the conditions that you are comparing.

So, I wouldn't recommend normalizing based upon housekeeping genes, but visualizing some candidate genes (or genes that are known to vary between your conditions) may be helpful.