Question

logFoldChange computed by EdgeR different from naive log2FCcomputed based on reads

1

Entering edit mode

3.0 years ago

ZheFrench ▴ 570

For example, for one gene:

logFC edgeR = 4

log2(mean reads condition A / mean reads condtition B ) = 3

For all genes, all the logFC computed from edgeR seems to be different ~ +1 from naive computation based on reads log2(mean Reads samples cond A / means Reads samples cond B).

It seems a huge difference, quite difficult to explain to a lambda user. Also curious to understand what could be the factor responsible for this kind of offset.

How can you explain that with edge R and glmfit ?

I'm doing something classical :

cm     <- makeContrasts(contrasts = comp,levels = dge$samples$group)

dge    <- estimateDisp(dge, de.design,robust=T)

fit.y  <- glmFit(dge, de.design)
lrt    <- glmLRT(fit.y,contrast = cm)

edgeR fold-change • 745 views

ADD COMMENT • link updated 3 months ago by Gordon Smyth ★ 7.1k • written 3.0 years ago by ZheFrench ▴ 570

score 0 · Answer 1 · 2024-02-02

The short answer is that your naive computation is incorrect because it doesn't take into account different sequencing depths / library sizes between the different samples.

It isn't generally true that the edgeR logFC are consistently different from the naive computution by a constant offset, that's just an accident of your data or of how you did the calculation.

edgeR estimates the logFC by a negative binomial generalized linear model, which is described in detail in the published papers. The model takes into account the counts, the effective library sizes and dispersions and also applies some logFC shrinkage. In the simplest case of a oneway layout, constant effective library sizes and no shrinkage, the logFC returned by glmFit would be exactly the same as the log2 ratio of the mean counts. In general, though, you cannot reproduce the edgeR calculation by a simple formula.

Here is a toy example where the edgeR logFC agrees exactly with the naive formula. In this example, I have set all the library sizes to be the same (equal to 1 million). I have also set prior.count=0 so as to turn off the logFC shrinkage:

> group <- factor(c("A","A","B","B"))
> design <- model.matrix(~group)
> counts <- matrix(c(1,1,8,8),1,4)
> fit <- glmFit(counts,design,dispersion=0,prior.count=0,lib.size=rep(1e6,4))
> lrt <- glmLRT(fit)
> topTags(lrt)
Coefficient:  groupB 
  logFC   logCPM       LR       PValue          FDR
1     3 2.700434 12.39534 0.0004304059 0.0004304059

Here logFC = log2(8/1) = 3. In general, though, the library sizes are very unlikely to be identical for every sample, in which case the naive formula really is too naive and edgeR will give a different result.