Question

Easy Microarray Statistics Question

0

Entering edit mode

12.7 years ago

User 7352 • 0

Let's say I'm looking at 5 independent microarrays, and some number of genes are upregulated on each microarray. If 200 of the same genes are upregulated on every microarray, what's the statistical test to prove that it's a significant enrichment? What if the genes are upregulated on 4 out of 5 microarrays?

microarray • 3.4k views

ADD COMMENT • link updated 12.7 years ago by Stefano Berri 4.4k • written 12.7 years ago by User 7352 • 0

0

Entering edit mode

I think Stefano's answer is on track, reason: to find up-regulation you need to run a statistical test across replicates (aka ttest, limma, anova) if all arrays stem from the same experiment. Then, your test options are sort of used up as there is no way to test if a gene is regulated on one array only. However, did you instead intend to check differential regulation between unrelated arrays without replicates? Not a good idea.

ADD REPLY • link 12.7 years ago by Michael 54k

0

Entering edit mode

But, on the other hand, ignoring whether or not it makes actually sense, the hypergeometric distribution could be applicable.

ADD REPLY • link 12.7 years ago by Michael 54k

0

Entering edit mode

I agree, if your 5 arrays are simply replicates of the same condition then Micheal and Stefano are correct. However if you wanted to check if say 5 different conditions affect the same gene set, then it's a question of set overlap, and a different test is used. For instance if you independently knocked out 5 genes thought to be part of the same protein complex, and then did microarrays (with replicates for each condition), you might expect similar genes to be affected for each condition, and it would be a very interesting question to compare them.

ADD REPLY • link 12.7 years ago by seidel 11k

0

Entering edit mode

So, I'm asking a slightly different question. I'm not looking to identify genes which are upregulated, I can do that fine. What I want to show is that in these 5 different yeast strains, a significant number of the same genes are upregulated. So, to simplify, let's say that the array has 5000 genes. On each array 500 genes are upregulated. 100 of the same genes are upregulated on every array. If you find 100 of the same gene upregulated on two arrays, you can use the hypergeometric test to show that that's significant. But how do you factor in all 5 arrays?

ADD REPLY • link 12.7 years ago by User 7352 • 0

0

Entering edit mode

Yes, Seidel, that's the question I'm trying to answer

ADD REPLY • link 12.7 years ago by User 7352 • 0

0

Entering edit mode

That cracks me up, our comments are 1 second apart, and I was using yeast as an example - and that's what you're actually using. I think to extend the analysis across the 5 data sets you multiply the p-values, because each is like asking for the probability of a given event, and you have 4 events (so in my mind it seems like the odds of 4 successive dice rolls).

ADD REPLY • link 12.7 years ago by seidel 11k

0

Entering edit mode

If the the experiments are comparable and you have different strains, you could still treat them as biological replicates. Just ask you a slightly different question: What genes are regulated among these strains of yeast? If log2 of the ratio of expression of geneX among the different strain is statistically different from zero, it would be picked up.

ADD REPLY • link 12.7 years ago by Stefano Berri 4.4k

score 4 · Answer 1 · 2011-08-06

Maybe I am getting this wrong, but I do not think the hypergeometric is the way to go. Am I right you are talking about 5 microarray for the same "experiment" like 5 biological replicates?

Install LIMMA from bioconductor, load the microarray, follow the documentation and perform a "standard" analysis. It is a linear model, and it does not use the hypergeometric, but the t-test (or a derivate...). If your array are Affymetrix, use package affy first and then LIMMA.

The hypergeometric doesn't take into account HOW MUCH they are upregulated nor how consistent your up-regulation is. The t-test does.

Then, of course, correct for multiple test.

I would use the hypergeometric only when comparing results of different experiments (using different platform or different conditions), but it does not sound like your case.

In general, try to learn about microarray analysis as much as you can before starting the analysis.

Good luck

score 2 · Answer 2 · 2011-08-06

I think I know the answer, but let me say up front, I'm not a statistician. I think you use the hypergeometric distribution, and the first array forms the basis of a question that you then use to evaluate against the other arrays. Using the phyper function in R, you can calculate the probability of obtaining the same gene set between two array results, and I think you then simply repeat the process and multiply the resulting p-values (the same way you would multiply the odds of a given repeated dice roll). The help for phyper uses the Urn analogy, so that's what I'll use. Say that a given array has 10,000 spots, and you identify 300 top genes. Then you perform a second array, and you also select 300 top genes. When you examine the overlap, it is 200 genes. What is the likelihood of getting a 200 gene overlap by chance? The first array sets up the Urn as follows: there are 10,000 balls total, 300 of them are white. Doing the second array asks the question, what is the likelihood of drawing 300 balls from such an Urn and having 200 of them be white? (or more generally, for a top gene set of a given size from the second array, what are the chances that 200 of them will be white?). In R, he phyper function takes arguments of x = # white balls drawn (number of genes from array 2 that were found in common with array 1), m = # white balls total in the Urn (size of the original top gene set from array 1), n = # of black balls total in the Urn (# of array spots - the top gene set from array 1), k = # of balls drawn (size of the top gene set from array 2). So the the answer for the overlap between array 1 and 2 is:

# phyper function in R for geometric distribution    
1 - phyper(x,m,n,k)

Then you calculate the same thing for array 1 and 3, and 1 and 4, and 1 and 5, and then you multiply them. That's my guess.