parametric ANOVA (slow; simple `for` loop):

Question

Running ANOVA in R on three tables of gene expression data

4

Entering edit mode

5.6 years ago

Volka ▴ 180

Hi all, I am currently trying to reproduce some results of previous work done at a lab. What I have right now are gene expression data in tables for three ethnic groups, with columns formatted as such:

ID Batch Gender Gene1 Gene2 Gene3 ...

..and so on for about 20 thousand genes. I have three of these tables for three ethnic groups, and I would like to know how to perform ANOVA in R on these three groups, in order to get the most differentially expressed genes. I have already tried limma for this, and I am looking to use ANOVA now to compare the results.

Thanks in advance!

anova RNA-Seq microarray gene expression • 7.2k views

ADD COMMENT • link updated 4.1 years ago by janainamace ▴ 10 • written 5.6 years ago by Volka ▴ 180

0

Entering edit mode

Hello. This was the best tutorial on how to apply the command in the analysis of gene expression. It worked perfectly with my data. Please, if I want to store all the values shown by the print command, is it possible?

ADD REPLY • link 4.1 years ago by janainamace ▴ 10

0

Entering edit mode

Take a look at write.table()

ADD REPLY • link 4.1 years ago by Kevin Blighe 87k

score 8 · Answer 1 · 2018-08-24

You should first aim to merge your datasets. To do this, they'll require common colnames. Here's a reproducible example using random-generated data:

data1 <- matrix(rexp(200, rate=.1), ncol=20)
data2 <- matrix(rexp(400, rate=.1), ncol=20)
data3 <- matrix(rexp(100, rate=.1), ncol=20)

colnames(data1) <- paste("gene", 1:ncol(data1), sep="")
colnames(data2) <- paste("gene", 1:ncol(data2), sep="")
colnames(data3) <- paste("gene", 1:ncol(data3), sep="")

data1 <- data.frame(rep("Black", nrow(data1)), data1)
colnames(data1)[1] <- "ethnicity"
data2 <- data.frame(rep("Asian", nrow(data2)), data2)
colnames(data2)[1] <- "ethnicity"
data3 <- data.frame(rep("Hispanic", nrow(data3)), data3)
colnames(data3)[1] <- "ethnicity"

merge <- rbind(data1, data2, data3)
merge[,1:5]

   ethnicity       gene1       gene2       gene3       gene4
1      Black  1.16514388 12.30338069  3.82688868 11.36645599
2      Black 24.38089444  9.03542219  6.80817973  0.10700084
3      Black 17.43543444 12.92760579 35.47927735  5.04846676
...
9      Black 12.98195603 21.59992268  0.71956284  3.55260945
10     Black 16.53245930  3.64460684 18.28704018 13.95026225
11     Asian  6.50373870  1.76874482 41.11393353 22.17943616
12     Asian  6.58345454 13.25700767  2.04473476  8.48304042
13     Asian 22.83291833  0.06681459  2.58220889  7.63248332
...
27     Asian  7.97365692  7.67312685  4.41263392 10.72628836
28     Asian  1.87894546  1.47503110  9.30714773 16.05473038
29     Asian  9.10527813 11.70059070 26.77912256  2.67517565
30     Asian  2.11775040  0.62188180 15.13412334  7.83838828
31  Hispanic 27.56932722  1.56425509  9.30974219  6.38043914
32  Hispanic  3.40418082 19.97802677  8.17777047  2.53841623
33  Hispanic 31.19901835  5.39049237  0.43565871  6.44909321
34  Hispanic  5.12689485 11.20531531 17.32246742  5.86373778
35  Hispanic 12.60268618  9.75769548 12.34081183 27.50402493

Then, set the ethnicity variable as categorical:

merge$ethnicity <- factor(merge$ethnicity, levels=c("Asian","Black","Hispanic"))
merge$ethnicity
 [1] Black    Black    Black    Black    Black    Black    Black    Black   
 [9] Black    Black    Asian    Asian    Asian    Asian    Asian    Asian   
[17] Asian    Asian    Asian    Asian    Asian    Asian    Asian    Asian   
[25] Asian    Asian    Asian    Asian    Asian    Asian    Hispanic Hispanic
[33] Hispanic Hispanic Hispanic
Levels: Asian Black Hispanic

Then. we can do either parametric or non-parametric ANOVA:

parametric ANOVA (slow; simple `for` loop):

baseformula <- " ~ ethnicity"
for (i in 2:ncol(merge)) {
    formula <- paste(colnames(merge)[i], baseformula, sep="")

    p <- summary(aov(as.formula(formula), data=merge))[[1]][["Pr(>F)"]][1]

    print(paste(formula, ": p=", p, sep=""))
}
[1] "gene1 ~ ethnicity: p=0.622349829393588"
[1] "gene2 ~ ethnicity: p=0.63661393769293"
[1] "gene3 ~ ethnicity: p=0.948436180231233"
[1] "gene4 ~ ethnicity: p=0.875863416358478"
[1] "gene5 ~ ethnicity: p=0.306303439037517"
[1] "gene6 ~ ethnicity: p=0.622549894820414"
...
[1] "gene15 ~ ethnicity: p=0.848376444462962"
[1] "gene16 ~ ethnicity: p=0.802850599451517"
[1] "gene17 ~ ethnicity: p=0.756847943886486"
[1] "gene18 ~ ethnicity: p=0.431368157733262"
[1] "gene19 ~ ethnicity: p=0.780660298718924"
[1] "gene20 ~ ethnicity: p=0.819316198089084"

non-parmetric ANOVA (Kruskal-Wallis test) (slow; simple `for` loop):

baseformula <- " ~ ethnicity"
for (i in 2:ncol(merge)) {
    formula <- paste(colnames(merge)[i], baseformula, sep="")

    p <- kruskal.test(as.formula(formula), data=merge)$p.value

    print(paste(formula, ": p=", p, sep=""))
}

[1] "gene1 ~ ethnicity: p=0.329166862417036"
[1] "gene2 ~ ethnicity: p=0.761201522936653"
[1] "gene3 ~ ethnicity: p=0.812129687344178"
[1] "gene4 ~ ethnicity: p=0.553535953984283"
[1] "gene5 ~ ethnicity: p=0.692051268905425"
[1] "gene6 ~ ethnicity: p=0.77824469542416"
...
[1] "gene15 ~ ethnicity: p=0.729788874269058"
[1] "gene16 ~ ethnicity: p=0.894555285927434"
[1] "gene17 ~ ethnicity: p=0.759029765430655"
[1] "gene18 ~ ethnicity: p=0.178257916287501"
[1] "gene19 ~ ethnicity: p=0.634599044972864"
[1] "gene20 ~ ethnicity: p=0.852143788966208"

---------------------------------------------------------------

You have Batch and Gender in your data, too. If you want to adjust for those, you may instead consider multinomial logistic regression and including them as covariates, i.e., same as above but:

glm(ethnicity ~ gene1 + Batch + Gender, data=merge)

Kevin

parametric ANOVA (slow; simple for loop):

non-parmetric ANOVA (Kruskal-Wallis test) (slow; simple for loop):

---------------------------------------------------------------

parametric ANOVA (slow; simple `for` loop):

non-parmetric ANOVA (Kruskal-Wallis test) (slow; simple `for` loop):