Why Would P-Value Histogram Skew Right For Anovas?
2
4
Entering edit mode
13.8 years ago
Eric ▴ 40

I have a linear regression model on someone else's unpublished microarray data in which I'm testing the significance of certain 2nd and 3rd-order variable interactions via ANOVA on the nested model (i.e with and without the specific interactions). For many of the interactions, however, I am obtaining a p-value distribution that's skewed towards high p-values. Can incorrect model selection lead to something like that? Could it be because I encoded the model matrix incorrectly? Might the skewed distribution reflect some underlying structure in the data? Am I choosing the wrong model vs. null model?

P-value histogram

The Model

There are three cell types, dummy-encoded with the two variables cell.type.b and cell.type.c, and there are three drug treatment conditions (drug1, drug2, and drug1+drug2) encoded by the absence or presence of each drug.

intercept    cell.type.b    cell.type.c    drug1    drug2
1            0              0              0        0
1            0              0              0        1
1            0              0              1        0
1            0              0              1        1
1            1              0              0        0
1            1              0              0        1
1            1              0              1        0
1            1              0              1        1
1            0              1              0        0
1            0              1              0        1
1            0              1              1        0
1            0              1              1        1

I also have (but did not include in the table for simplicity) the 6 interaction parameters

(cell.type.b + cell.type.c)*(drug1 + drug2 + drug1*drug2).

The null hypothesis is that the model without a particular interaction cell.type x drug.type is not significantly worse than a model including that interaction.

Calculating p-values

To calculate the p-values I have the following R function I essentially copied-and-pasted from the Leek and Storey sva R package. It's just a vectorized version of using the hat matrix to extract residuals. Models are input in matrices like the table above; my model has additional columns for the interactions, of course. In this case, there is a separate F statistic calculated for each row the data matrix (i.e. each gene of the microarray data set):

f.pvalue <- function(dat,mod,mod0){
  # This is a function for performing
  # parametric f-tests on the data matrix
  # dat comparing the null model mod0
  # to the alternative model mod. 
  n <- dim(dat)[2]
  m <- dim(dat)[1]
  df1 <- dim(mod)[2]
  df0 <- dim(mod0)[2]
  p <- rep(0,m)
  Id <- diag(n)

  resid <- dat %*% (Id - mod %*% solve(t(mod) %*% mod) %*% t(mod))
  resid0 <- dat %*% (Id - mod0 %*% solve(t(mod0) %*% mod0) %*% t(mod0))

  rss1 <- resid^2 %*% rep(1,n)
  rss0 <- resid0^2 %*% rep(1,n)

  fstats <- ((rss0 - rss1)/(df1-df0))/(rss1/(n-df1))
  p <-  1-pf(fstats,df1=(df1-df0),df2=(n-df1))
  return(p)
}
microarray statistics r • 6.8k views
ADD COMMENT
0
Entering edit mode

Hi @Eric, I have seen this pattern in 2 different project while analyzing second level interactions using the R-Maanova package. I did not find any particular reason or explanation, so any sensible reply to your question is also of high interest to me :) Did you use any particular package for the analysis or did you simply use a standard anova test implemented in R? Cheers

ADD REPLY
0
Entering edit mode

Hi @Eric. I'm using the hat matrix to extract residuals from the microarrays and then directly calculate F statistics, so I'm using the standard R function pf(q, df1, df2) to get p-values.

ADD REPLY
1
Entering edit mode
10.9 years ago
David W 4.9k

One possibility is that your models are missing an important covariate.

Imagine a simple case in which you are looking at 3 drugs across 3 cells types...

cells <- rep(c("A", "B", "C"), each=30) 
drugs <- rep(c("I", "II","II"), 30)

...and the drugs don't have an effect on what you're measuring but the cell-types do...

cell_type_effects <- c("A"=0, "B"=1, "C"=2)
drug_effects <- c("I"=0, "II"=0, "II"=0)

...if you run a bunch of studies looking only at the drug-effects and failing to include the cell type you won't get a uniform distrbution of p-values, rather you'll ahve a right-skew

simulate_study <- function(){
   y <- rnorm(90, cell_type_effects[cells] + drug_effects[drugs],1) # since all drug_effects are 0 true model  y ~ cell + normally distributed error
   res <- anova(lm(y ~ drugs))[[5]][1]
   return(res)
 }
 pvals <- replicate(1e4,simulate_study())
 hist(pvals)

mean(pvals < 0.1)
## [1] 0.0327
mean(pvals > 0.9)
## [1] 0.13

It's not the only way you could wind up with the distribution of p-values you have. But it's probably worth thinking about

ADD COMMENT
0
Entering edit mode
10.9 years ago
Ryan D ★ 3.4k

The most likely explanation--it would seem to me--is that you are violating some of the assumptions of the ANOVA test. I would check the normality of your signal data and try plotting the values of an equivalent non-parametric test like the Kruskal-Wallis.

ADD COMMENT

Login before adding your answer.

Traffic: 2346 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6