Biostar Beta. Not for public use.
Best Way To Quantify Influence Of Different Covariates On Gene Expression?
6
Entering edit mode
2.9 years ago
Stockholm

I am looking for input on ways to put numbers on how much different factors/covariates contribute to gene expression patterns. For example, let's say we have RNA-seq (or microarray) data from male and female mice that have received two different treatments (or no treatment) and the samples have been prepped in two different batches. Now we would like a bar plot (or similar) of how much the different covariates (1) sex, (2) treatment and (3) batch effects contribute, respectively, to the gene expression.

Some ideas:

  1. Use PCA on the expression matrix and calculate the correlation between PC1 scores and the covariate vectors.
  2. Use some machine learning algorithm to try to discriminate between male/female (or treatments, or batches), and evaluate by cross-validation how well this is working. Use the prediction performance as a measure of how strongly the covariate is reflected in the expression profile.
  3. Is there a way to find out this sort of summarized importance from the linear models fitted as part of limma/edgeR/DESeq workflows?

I'd be grateful for any thoughts.

rna-seq microarray • 7.3k views
ADD COMMENTlink
0
Entering edit mode

Do you want this for each gene or some sort of summarized metric for the whole experiment (I'm guessing the latter from your mention of just using PCA)?

ADD REPLYlink
0
Entering edit mode

Yes, a summarized metric for the whole experiment.

ADD REPLYlink
11
Entering edit mode
12 months ago
Irsan ♦ 6.9k
Amsterdam

As a first step, it might be useful to remove genes with count 0 in more than 90% of your samples. Then obtain normalized and log-transformed expression estimates (logFPKM; maybe edgeR and voom() can be of use?) for all your samples and genes. I assume you already have obtained them somehow. Then prepare your data in such a way that you have a data frame where a row represents 1 measurement and the columns geneId, sex, treatment, batch, logFpkm (maybe aggregate() or melt() functions can be useful to reshape your data). Then perform anova and look at the sums of squares and the p-values of the F-test. Most likely, geneId has the biggest influence on your logFpkms and hopefully treatment is a good second.

fit <- lm(logFpkm ~  geneId + sex + treatment + batch,data=yourData)
anova(fit)

When you have many samples and many genes you can also take a random sample of your genes (say 1000) to get an idea and then scale up to bigger sample sizes.

Edit: Brentp, you are right. Its maybe better to exclude the geneId from the linear model

ADD COMMENTlink
0
Entering edit mode

you mean to have a single table with n_samples * n_genes and fit one model to that?

ADD REPLYlink
0
Entering edit mode

so take an expression matrix, melt it, add covariate info and fit model / do anova. Maybe there are better alternatives but it comes into my mind. I just tried it on one of my datasets and it said that geneId is by far the biggest source of variance, then tissue, then patientGroup, then batch which makes sense for my dataset.

ADD REPLYlink
0
Entering edit mode

Thanks for the tip. I had been thinking along these lines but hadn't hatched the idea about using geneId as a feature. I'll definitely try it.

ADD REPLYlink
0
Entering edit mode

you can also leave it out of the model of course. Its not surprising that it is the biggest source of variation.

ADD REPLYlink
0
Entering edit mode

Would you please explain more about how to prepare data?

ADD REPLYlink
2
Entering edit mode
12 months ago
brentp 23k
Salt Lake City, UT

This is a good question, in ecology, for example, one might have a single set of data and they will scrutinize a model for years and find the "best" one. In genomics, we choose one model and apply it to N (million) probes / measurements / genes / CpG's / SNPs etc.

You could look at the distribution of p-values, either of your covariate of interest or of the potential confounder with and without the confounder. You can also look at the distribution of covariate estimates. If you see genomic inflation without correction for some important covariate, then that might be a good thing to check.

You could also do something like SVA or PEER on a simple model and see how the inferred surrogate variables correlate with the covariates that you did not include in the model (this is similar to your PCA idea).

Of course, it could be that different covariates only have an effect at certain sites in the genome--we know this will be true in the case of, gender and the sex chromosomes. For other effects, it will be more subtle. In the end, you will have to rely on biological insight to a large extent.

ADD COMMENTlink
0
Entering edit mode

The SVA idea is interesting - I have been playing around with this package recently but hadn't thought to apply it to this question.

ADD REPLYlink
1
Entering edit mode
9 months ago
Duarte, CA

I would check the signal distributions: if the male and female samples look very different (for example, I have seen this be the case for germ-line samples), I would just analyze them separately.

The most direct answer to your question is that you can use a 2-way or 3-way ANOVA to factor out the influence of covariates. It's not super rigorous, but I think you can use something like the average sum of squares to give an idea about the relative contribution. For example, I know Partek makes a sources of variation box-plot like you described, and I think that is how they do it.

DESeq, edgeR, etc. also have workflows for multivariate analysis (I believe using a generalized linear model in both of those cases).

ADD COMMENTlink
0
Entering edit mode

Thanks for your comments. I have used DESeq, edgeR and limma to look at these data, but I am not sure what is the best way to use information about the fitted coefficients to something in general about the covariates. I should say that in my case the covariates are not in fact sex, treatment and batch but three other things; those were just an example.

ADD REPLYlink
1
Entering edit mode
10 months ago
chris86 • 290
United Kingdom, London

Since quite a few people are looking at this thread I should inform that there is a package called PVCA (principle variance component analysis) which features in a lot of good papers to determine the effect of covariates, especially batch effects on the data. It fits a mixed model to principle components to determine the affects of a number of defined variables. I have found this approach really helpful in my various analyses.

https://www.bioconductor.org/packages/release/bioc/html/pvca.html

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1