you mean to have a single table with n_samples * n_genes and fit one model to that?

Entering edit mode

I am looking for input on ways to put numbers on how much different factors/covariates contribute to gene expression patterns. For example, let's say we have RNA-seq (or microarray) data from male and female mice that have received two different treatments (or no treatment) and the samples have been prepped in two different batches. Now we would like a bar plot (or similar) of how much the different covariates (1) sex, (2) treatment and (3) batch effects contribute, respectively, to the gene expression.

Some ideas:

- Use PCA on the expression matrix and calculate the correlation between PC1 scores and the covariate vectors.
- Use some machine learning algorithm to try to discriminate between male/female (or treatments, or batches), and evaluate by cross-validation how well this is working. Use the prediction performance as a measure of how strongly the covariate is reflected in the expression profile.
- Is there a way to find out this sort of summarized importance from the linear models fitted as part of limma/edgeR/DESeq workflows?

I'd be grateful for any thoughts.

Entering edit mode

As a first step, it might be useful to remove genes with count 0 in more than 90% of your samples. Then obtain normalized and log-transformed expression estimates (logFPKM; maybe edgeR and voom() can be of use?) for all your samples and genes. I assume you already have obtained them somehow. Then prepare your data in such a way that you have a data frame where a row represents 1 measurement and the columns geneId, sex, treatment, batch, logFpkm (maybe aggregate() or melt() functions can be useful to reshape your data). Then perform anova and look at the sums of squares and the p-values of the F-test. Most likely, geneId has the biggest influence on your logFpkms and hopefully treatment is a good second.

```
fit <- lm(logFpkm ~ geneId + sex + treatment + batch,data=yourData)
anova(fit)
```

When you have many samples and many genes you can also take a random sample of your genes (say 1000) to get an idea and then scale up to bigger sample sizes.

Edit: Brentp, you are right. Its maybe better to exclude the geneId from the linear model

Entering edit mode

0

you mean to have a single table with n_samples * n_genes and fit one model to that?

Entering edit mode

0

so take an expression matrix, melt it, add covariate info and fit model / do anova. Maybe there are better alternatives but it comes into my mind. I just tried it on one of my datasets and it said that geneId is by far the biggest source of variance, then tissue, then patientGroup, then batch which makes sense for my dataset.

Entering edit mode

This is a good question, in ecology, for example, one might have a single set of data and they will scrutinize a model for years and find the "best" one. In genomics, we choose one model and apply it to N (million) probes / measurements / genes / CpG's / SNPs etc.

You could look at the distribution of p-values, either of your covariate of interest or of the potential confounder with and without the confounder. You can also look at the distribution of covariate estimates. If you see genomic inflation without correction for some important covariate, then that might be a good thing to check.

You could also do something like SVA or PEER on a simple model and see how the inferred surrogate variables correlate with the covariates that you did not include in the model (this is similar to your PCA idea).

Of course, it could be that different covariates only have an effect at certain sites in the genome--we know this will be true in the case of, gender and the sex chromosomes. For other effects, it will be more subtle. In the end, you will have to rely on biological insight to a large extent.

Entering edit mode

I would check the signal distributions: if the male and female samples look very different (for example, I have seen this be the case for germ-line samples), I would just analyze them separately.

The most direct answer to your question is that you can use a 2-way or 3-way ANOVA to factor out the influence of covariates. It's not super rigorous, but I think you can use something like the average sum of squares to give an idea about the relative contribution. For example, I know Partek makes a sources of variation box-plot like you described, and I think that is how they do it.

DESeq, edgeR, etc. also have workflows for multivariate analysis (I believe using a generalized linear model in both of those cases).

Entering edit mode

0

Thanks for your comments. I have used DESeq, edgeR and limma to look at these data, but I am not sure what is the best way to use information about the fitted coefficients to something in general about the covariates. I should say that in my case the covariates are not in fact sex, treatment and batch but three other things; those were just an example.

Entering edit mode

Since quite a few people are looking at this thread I should inform that there is a package called PVCA (principle variance component analysis) which features in a lot of good papers to determine the effect of covariates, especially batch effects on the data. It fits a mixed model to principle components to determine the affects of a number of defined variables. I have found this approach really helpful in my various analyses.

https://www.bioconductor.org/packages/release/bioc/html/pvca.html

Loading Similar Posts

Do you want this for each gene or some sort of summarized metric for the whole experiment (I'm guessing the latter from your mention of just using PCA)?

Yes, a summarized metric for the whole experiment.