Running ANOVA in R on three tables of gene expression data
1
4
Entering edit mode
5.6 years ago
Volka ▴ 180

Hi all, I am currently trying to reproduce some results of previous work done at a lab. What I have right now are gene expression data in tables for three ethnic groups, with columns formatted as such:

ID Batch Gender Gene1 Gene2 Gene3 ...

..and so on for about 20 thousand genes. I have three of these tables for three ethnic groups, and I would like to know how to perform ANOVA in R on these three groups, in order to get the most differentially expressed genes. I have already tried limma for this, and I am looking to use ANOVA now to compare the results.

Thanks in advance!

anova RNA-Seq microarray gene expression • 7.2k views
ADD COMMENT
0
Entering edit mode

Hello. This was the best tutorial on how to apply the command in the analysis of gene expression. It worked perfectly with my data. Please, if I want to store all the values ​​shown by the print command, is it possible?

ADD REPLY
0
Entering edit mode

Take a look at write.table()

ADD REPLY
8
Entering edit mode
5.6 years ago

You should first aim to merge your datasets. To do this, they'll require common colnames. Here's a reproducible example using random-generated data:

data1 <- matrix(rexp(200, rate=.1), ncol=20)
data2 <- matrix(rexp(400, rate=.1), ncol=20)
data3 <- matrix(rexp(100, rate=.1), ncol=20)

colnames(data1) <- paste("gene", 1:ncol(data1), sep="")
colnames(data2) <- paste("gene", 1:ncol(data2), sep="")
colnames(data3) <- paste("gene", 1:ncol(data3), sep="")

data1 <- data.frame(rep("Black", nrow(data1)), data1)
colnames(data1)[1] <- "ethnicity"
data2 <- data.frame(rep("Asian", nrow(data2)), data2)
colnames(data2)[1] <- "ethnicity"
data3 <- data.frame(rep("Hispanic", nrow(data3)), data3)
colnames(data3)[1] <- "ethnicity"

merge <- rbind(data1, data2, data3)
merge[,1:5]

   ethnicity       gene1       gene2       gene3       gene4
1      Black  1.16514388 12.30338069  3.82688868 11.36645599
2      Black 24.38089444  9.03542219  6.80817973  0.10700084
3      Black 17.43543444 12.92760579 35.47927735  5.04846676
...
9      Black 12.98195603 21.59992268  0.71956284  3.55260945
10     Black 16.53245930  3.64460684 18.28704018 13.95026225
11     Asian  6.50373870  1.76874482 41.11393353 22.17943616
12     Asian  6.58345454 13.25700767  2.04473476  8.48304042
13     Asian 22.83291833  0.06681459  2.58220889  7.63248332
...
27     Asian  7.97365692  7.67312685  4.41263392 10.72628836
28     Asian  1.87894546  1.47503110  9.30714773 16.05473038
29     Asian  9.10527813 11.70059070 26.77912256  2.67517565
30     Asian  2.11775040  0.62188180 15.13412334  7.83838828
31  Hispanic 27.56932722  1.56425509  9.30974219  6.38043914
32  Hispanic  3.40418082 19.97802677  8.17777047  2.53841623
33  Hispanic 31.19901835  5.39049237  0.43565871  6.44909321
34  Hispanic  5.12689485 11.20531531 17.32246742  5.86373778
35  Hispanic 12.60268618  9.75769548 12.34081183 27.50402493

Then, set the ethnicity variable as categorical:

merge$ethnicity <- factor(merge$ethnicity, levels=c("Asian","Black","Hispanic"))
merge$ethnicity
 [1] Black    Black    Black    Black    Black    Black    Black    Black   
 [9] Black    Black    Asian    Asian    Asian    Asian    Asian    Asian   
[17] Asian    Asian    Asian    Asian    Asian    Asian    Asian    Asian   
[25] Asian    Asian    Asian    Asian    Asian    Asian    Hispanic Hispanic
[33] Hispanic Hispanic Hispanic
Levels: Asian Black Hispanic

Then. we can do either parametric or non-parametric ANOVA:

parametric ANOVA (slow; simple for loop):

baseformula <- " ~ ethnicity"
for (i in 2:ncol(merge)) {
    formula <- paste(colnames(merge)[i], baseformula, sep="")

    p <- summary(aov(as.formula(formula), data=merge))[[1]][["Pr(>F)"]][1]

    print(paste(formula, ": p=", p, sep=""))
}
[1] "gene1 ~ ethnicity: p=0.622349829393588"
[1] "gene2 ~ ethnicity: p=0.63661393769293"
[1] "gene3 ~ ethnicity: p=0.948436180231233"
[1] "gene4 ~ ethnicity: p=0.875863416358478"
[1] "gene5 ~ ethnicity: p=0.306303439037517"
[1] "gene6 ~ ethnicity: p=0.622549894820414"
...
[1] "gene15 ~ ethnicity: p=0.848376444462962"
[1] "gene16 ~ ethnicity: p=0.802850599451517"
[1] "gene17 ~ ethnicity: p=0.756847943886486"
[1] "gene18 ~ ethnicity: p=0.431368157733262"
[1] "gene19 ~ ethnicity: p=0.780660298718924"
[1] "gene20 ~ ethnicity: p=0.819316198089084"

non-parmetric ANOVA (Kruskal-Wallis test) (slow; simple for loop):

baseformula <- " ~ ethnicity"
for (i in 2:ncol(merge)) {
    formula <- paste(colnames(merge)[i], baseformula, sep="")

    p <- kruskal.test(as.formula(formula), data=merge)$p.value

    print(paste(formula, ": p=", p, sep=""))
}

[1] "gene1 ~ ethnicity: p=0.329166862417036"
[1] "gene2 ~ ethnicity: p=0.761201522936653"
[1] "gene3 ~ ethnicity: p=0.812129687344178"
[1] "gene4 ~ ethnicity: p=0.553535953984283"
[1] "gene5 ~ ethnicity: p=0.692051268905425"
[1] "gene6 ~ ethnicity: p=0.77824469542416"
...
[1] "gene15 ~ ethnicity: p=0.729788874269058"
[1] "gene16 ~ ethnicity: p=0.894555285927434"
[1] "gene17 ~ ethnicity: p=0.759029765430655"
[1] "gene18 ~ ethnicity: p=0.178257916287501"
[1] "gene19 ~ ethnicity: p=0.634599044972864"
[1] "gene20 ~ ethnicity: p=0.852143788966208"

---------------------------------------------------------------

You have Batch and Gender in your data, too. If you want to adjust for those, you may instead consider multinomial logistic regression and including them as covariates, i.e., same as above but:

glm(ethnicity ~ gene1 + Batch + Gender, data=merge)

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin, thanks very much for the reply. I am able to follow up until the function for the parametric ANOVA. Here, it is giving me this error:

Error in levels(x)[x] : only 0's may be mixed with negative subscripts In addition: Warning messages: 1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored 2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors

The 'formula' variable at the very beginning of my loop is "X7896863 ~ Ethnic_group", for my first gene. I am not sure where the negative subscript may have come in. Do you have any suggestions on what the problem is?

ADD REPLY
1
Entering edit mode

If you have hyphens in your gene names or your factor levels, then that will cause an error. You should change these to underscores ("_") or full-stops ("."), or something else. Your gene data should also be numeric. What happens when you run:

as.numeric(MyData$X7896863)

?

What about:

MyData$Ethnic_group

?

ADD REPLY
0
Entering edit mode

Ah yes, MyData$Ethnic_group worked! Thank you!

ADD REPLY
1
Entering edit mode

Okay, great!

ADD REPLY

Login before adding your answer.

Traffic: 2700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6