Question

RNA-Seq: using GLMM to detect differentially expressed genes

2

Entering edit mode

9.2 years ago

alesssia ▴ 580

Hi All.

I have a set of raw count data and I am interested in using (G)LMM to detect differentially expressed genes. However, I have a number of questions about how to prepare the better (correct?) pipeline for this task.

I am aware that using linear models (instead of well-know tools, such as DESeq2) will give me less power -- unless I have a large set of samples. I know that this is a dumb question, but which number of samples can be called "large"?
To have meaningful results I believe that a filtering and a normalisation step are needed beforehand. Is this assumption correct? Which is a reliable approach to filter/normalise my data?
May it be useful to work with transformed versions of the count data?
I usually use LMMs (lme4 R package) when looking for differentially expressed genes in the context of microarray data -- I work with multiplex family data and I want to correct for samples' relatedness. However, when RNA-Seq counts are at hand, is it better to use zero-inflated Poisson models? Or can I assume that there is only an overdispersion problem? Can the answer to this question be data-dependent?

Thanks in advance for your help,

Alessia

GLMM differential-expression RNA-Seq • 3.3k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by alesssia ▴ 580

score 2 · Answer 1 · 2015-02-17

I suspect that Gordon Smyth has given a recommendation on this somewhere, though I haven't ever come across it. My gut would say you should a hundred of samples or so, but that should be taken with a large grain of salt without empirical data. I should note that you'll always have lower power without sharing information across genes, it's just a question of how much you've lost. Of course, the more complicated the model, the more samples you'd really need to have.
Normalization yes, filtering no. Well, filtering other than just removing rows with 0 counts (or otherwise will break the (G)LMM function) isn't necessary. You'll need to perform a library-size normalization. The most straight-forward way to do this is to first use DESeq or DESeq2 and get the resulting sizeFactor(). This can then be used as weight in your glmm. You can perform independent filtering after the fact once you have raw p-values. The genefilter package is convenient for this.
Possible. If you run everything through limma::voom() first, then you'd have data in a nice format for a more traditional LMM.
I've not seen much of any gain from zero-inflated based models over "simple" negative binomial models. There are a couple papers out there comparing negative-binomial, zero-inflated negative-binomial, and zero-inflated poisson models if you want some hard numbers on this.