Question

Design for GSVA - gene set enrichment analysis

2

Entering edit mode

7.3 years ago

mforde84 ★ 1.4k

I'm a little confused by some of the documentation for R package GSVA.

Here's the sample analysis provided in the documentation:

library(GSVA)
library(limma)
p <- 10 ## number of genes
n <- 30 ## number of samples
nGrp1 <- 15 ## number of samples in group 1
nGrp2 <- n - nGrp1 ## number of samples in group 2
## consider three disjoint gene sets
geneSets <- list(set1=paste("g", 1:3, sep=""),
set2=paste("g", 4:6, sep=""),
set3=paste("g", 7:10, sep=""))
## sample data from a normal distribution with mean 0 and st.dev. 1
y <- matrix(rnorm(n*p), nrow=p, ncol=n,
dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))
## genes in set1 are expressed at higher levels in the last 10 samples
y[geneSets$set1, (nGrp1+1):n] <- y[geneSets$set1, (nGrp1+1):n] + 2
## build design matrix
design <- cbind(sampleGroup1=1, sampleGroup2vs1=c(rep(0, nGrp1), rep(1, nGrp2)))
## fit linear model
fit <- lmFit(y, design)
## estimate moderated t-statistics
fit <- eBayes(fit)
## genes in set1 are differentially expressed
topTable(fit, coef="sampleGroup2vs1")
## estimate GSVA enrichment scores for the three sets
gsva_es <- gsva(y, geneSets, mx.diff=1)$es.obs
## fit the same linear model now to the GSVA enrichment scores
fit <- lmFit(gsva_es, design)
## estimate moderated t-statistics
fit <- eBayes(fit)
## set1 is differentially expressed
topTable(fit, coef="sampleGroup2vs1")

I'm somewhat confused by the design variable:

>design <- cbind(sampleGroup1=1, sampleGroup2vs1=c(rep(0, nGrp1), rep(1, nGrp2)))
> design
      sampleGroup1 sampleGroup2vs1
 [1,]            1               0
 [2,]            1               0
 [3,]            1               0
 [4,]            1               0
 [5,]            1               0
 [6,]            1               0
 [7,]            1               0
 [8,]            1               0
 [9,]            1               0
[10,]            1               0
[11,]            1               0
[12,]            1               0
[13,]            1               0
[14,]            1               0
[15,]            1               0
[16,]            1               1
[17,]            1               1
[18,]            1               1
[19,]            1               1
[20,]            1               1
[21,]            1               1
[22,]            1               1
[23,]            1               1
[24,]            1               1
[25,]            1               1
[26,]            1               1
[27,]            1               1
[28,]            1               1
[29,]            1               1
[30,]            1               1

If I understand the documentation correctly, then the sample group 1 and 2 should be mutually exclusive. However, sample group 1 includes all samples. Is this supposed to be the intercept?

Also, another question concerning multiple comparisons. Say I wanted to add a third exclusive group, and perform all pairwise comparisons where 1vs2, 1vs3, 2vs3.

1) How should I set up the design for this? Do I just add another grouping category to the design?:

design <- cbind(sampleGroup1=1, sampleGroup2vs1=c(rep(0, nGrp1), rep(1, nGrp2), rep(2,nGrp3)))

But then how do I specify result for all possible post-hoc comparisons?

2) When I report the enriched pathways for a particular coefficient (e.g., 1vs2), are the reported adjusted-pvalues corrected test or group-wise? In theory, should I aggregate all of the pvalues for all comparisons, then do FDR or Bonferroni multiple testing corrections?

gsea gsva enrichment regression lmFit • 4.1k views

ADD COMMENT • link 7.3 years ago by mforde84 ★ 1.4k

score 3 · Answer 1 · 2017-01-13

Figured it out with some help from bioconducter support.

For the first question, it's an intercept.

For the design question, I used modelmatrix() and makecontrasts() to generate the appropriate comparisons, first running lmfit() using the main group effects, then running contrast.fit() using the group comparisons generate by makecontrasts().

For the pvalue question, I did BH multiple testing corrections on the pvalues for all of the post-hoc comparisons from the fit, and the results were consistent with results from topTable().