Question

Detecting highly variable genes within replicates

3

Entering edit mode

5.1 years ago

Alternative ▴ 270

Hello,

We are looking for a method to detect/extract genes that show variability within biological replicates (typically 3 replicates) of the same condition.

Differential expression algorithms (DESeq2, EdgeR) will consider only genes (or group of genes) that show similarity in their expression profiles between biological replicates. In our case, we are interested in genes that are significantly variable within replicates. Is this information available in DESeq2/EdgeR objects? If not, what would be the best approach for that?

Any though or hint will help,

Best and thanks

Gene expression variability • 4.7k views

ADD COMMENT • link updated 5.1 years ago by Michele Busby ★ 2.2k • written 5.1 years ago by Alternative ▴ 270

5

Entering edit mode

5.1 years ago

Friederike 8.9k

You can calculate the variance (or any other measure of variability) fairly easily in R, e.g. with the rowVars or rowWeightedVars, e.g. :

# normalize and rlog-transform the data to get rid of most of the technical/uninteresting variation
rlog_ds <- rlog(DESeq.ds)
# calculate variance (or whatever measure you prefer) per gene
rv <- rowVars(assay(rlog_ds))
# sort, so that the most variable genes will be on top of the object
rv <- order(rv, decreasing = TRUE)

You may want to do subset by condition to find genes that seem to be least reproducible between replicates.

EDIT: updated to include the rlog-transformation following ATPoint's suggestion

ADD COMMENT • link 5.1 years ago by Friederike 8.9k

1

Entering edit mode

Adding on this (not sure if you did this), it might be advisable to normalize your data prior to calculating variance. See Calculating significant variance of gene expression across multiple samples for a code example pretty much the same as Friederike showed but with vst-transformed data.

ADD REPLY • link 5.1 years ago by ATpoint 81k

0

Entering edit mode

Can a vst transformation be used insted of rlog ?

ADD REPLY • link 4.4 years ago by pawan12394 • 0

1

Entering edit mode

yes, you can (also see the documentation of either ?DESeq2::rlog or ?DESeq2::vst

ADD REPLY • link 4.3 years ago by Friederike 8.9k

1

Entering edit mode

In many cases you will be forced to use vst anyway as rlog is super slow for many samples. Once you have cohorts of samples, like dozens or hundreds or more, rlog is basically unusable as it would take ages. I think that vst is currently the recommendation by the DESeq2 maintainer as a default choice unless there is a specific reason against it. Cannot find the website with the reference for this sentence though.

ADD REPLY • link 4.3 years ago by ATpoint 81k

1

Entering edit mode

5.1 years ago

JC 13k

After you compute the dispersion in your samples, you can extract that value with disp <- getDispersion(y), check the plotBCV code to see how is used to create dispersion plots.

ADD COMMENT • link 5.1 years ago by JC 13k

score 7 · Accepted Answer · 2019-02-25

This is a good question and you were wise to ask it because I don't think the obvious approach is the correct one.

For RNA Seq, the observed variance that you get from calculating the variance in the replicates will have a mixture of biological + technical variance.

If the libraries are good, the total variance you observe in biological replicates will be as follows:

Total variance=Biological variance + library prep and sequencing variance + Poisson variance (shot noise, counting noise) from sampling the molecules

To isolate the biological variance I would:

1) Normalize the data

2) Calculate the total variance as the variance of the normalized counts across replicates

3) ignore the library prep and sequencing variance by assuming it is zero (which is usually OK, you can also estimate it from technical replicates if you have them)

4) Estimate the poisson variance as the mean of the expression. This is because the variance of a Poisson distribution is the mean of that distribution.

5)Therefore, the biological variance of the gene can be calculated as:

Total variance=Biological variance + [zero for technical]+the observed mean expression [for Poisson]

Biological variance = Total variance - the observed mean expression

6) Then you can sort by the isolated biological variance. If you don't do this you will see a lot more variance in low expression genes, and low expression genes will appear to be more variable* just because the noise is higher (i.e. 1 vs 2 is more likely to vary as an artifact of sampling noise than 100 vs 200, which is certainly not counting noise).

This blog explains the ideas better: http://michelebusby.tumblr.com/post/26913184737/thinking-about-designing-rna-seq-experiments-to

The supplement here goes into depth on all the formulas: https://academic.oup.com/bioinformatics/article/29/5/656/252753

You don't need to read the paper (trust me, I wrote it), just the first section of the supplement. Supplement 1. Sources of variance in expression measurements

Here is a toy example in python. You can also do it in Excel (don't at me):

import numpy as np
measurements=[5, 10, 18]
v=np.var(measurements)
print("Observed variance:", v)

m=np.mean(measurements)
print("Mean expression:", m)

poissonVariance=m

biologicalVariance=v-poissonVariance
print("Isolated Biological Variance:", biologicalVariance)

Observed variance: 28.666666666666668

Mean expression: 11.0

Isolated Biological Variance: 17.666666666666668

*Itay Tirosh has some old papers in yeast looking at variability in expression and what it is correlated with but I can't remember all the details.