Question

How To Determine The Statistical Significance Of Overlap (Intersect) Between Three Sets

5

Entering edit mode

11.1 years ago

bsmith030465 ▴ 240

I have three overlapping sets and I want to find the probability of finding a larger/greater intersection for 'A intersect B intersect C' (in the example below, I want to find the probability of finding more than 135 elements that are common in sets A, B & C). For a two set problem, I guess I would do a Fisher or chi-square test. Here is what I have attempted so far:

### Prepare a 3 way contingency table:
mytable <- array(c(135,116,385,6256,
                    48,97,274,9555),
                  dim = c(2,2,2),
                  dimnames = list(
                    Is_C = c('Yes','No'),
                    Is_B = c('Yes','No'),
                    Is_A = c('Yes','No')))

## test
mantelhaen.test(myrabbit, exact = TRUE, alternative = "greater")

Is this the right test (alongwith the current parameters) to determine what I want or is there a more appropriate test for this?

statistics r • 13k views

ADD COMMENT • link updated 7.8 years ago by Biostar 20 • written 11.1 years ago by bsmith030465 ▴ 240

1

Entering edit mode

I was going to suggest you post this also at cross-validated, but then I saw this! Glad biostars are more responsive...

ADD REPLY • link 11.1 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

I'm interested to hear what other say as to wether mantelhaen is the right test there. Don't forget if your sets are genomic intervals, the standard methods are less likely to apply due to the non-randomness of the genome. e.g. if all 3 of your datasets are likely to occur in gene-bodies, then that is the relationship, but it will make them appear to be co-occuring if you're considering the entire genome as the background.

ADD REPLY • link 11.1 years ago by brentp 24k

0

Entering edit mode

Each set consists of a group of genes, and I'm trying to see if the overlap is significant. All the sets are drawn from the full complement of genes across the genome (~17k). Does that answer your question?

ADD REPLY • link 11.1 years ago by bsmith030465 ▴ 240

0

Entering edit mode

Can you tell us if you are looking for genomic overlap?

ADD REPLY • link 11.1 years ago by Zev.Kronenberg 12k

score 4 · Answer 1 · 2013-03-13

4

Entering edit mode

11.1 years ago

brentp 24k

I think you probably want the multivariate version of the hypergeometric. You can find an implementation and documentation on that for R here:

http://rss.acs.unt.edu/Rdoc/library/BiasedUrn/html/BiasedUrn-3-Multivariate.html

ADD COMMENT • link 11.1 years ago by brentp 24k

0

Entering edit mode

For a strawman case, if we assume that there is no bias, I'm not sure if the above models will apply.

ADD REPLY • link 11.1 years ago by bsmith030465 ▴ 240

1

Entering edit mode

that may well be. can you elaborate? bias is a loaded term.

ADD REPLY • link 11.1 years ago by brentp 24k

score 1 · Answer 2 · 2013-03-13

1

Entering edit mode

11.1 years ago

Larry_Parnell 16k

The approaches described in this report - Annotation Enrichment Analysis: An Alternative Method for Evaluating the Functional Properties of Gene Sets - may be useful to you, depending on exactly what you're comparing and where you want to take the results. The report is available here. Although the authors discuss an approach in dealing with gene set enrichment using GO terms when not all genes are equally annotated, the approach could be applied to other labels of the entities for which you looking for overlap/enrichment.

ADD COMMENT • link 11.1 years ago by Larry_Parnell 16k

0

Entering edit mode

That looks like an interesting paper. In the abstract, they say that their method "is able to predict biologically meaningful results that are obscured by the many false-positive enrichment scores that occur in FET (Fisher's Exact Test)...." I wonder if simply using a FDR with FET would correct for some of this. I've done this in the past, but a quick search to find some support for this idea turns up this related paper with a potentially useful Perl package (from the same paper) for doing these calculations.

ADD REPLY • link 11.1 years ago by SES 8.6k

0

Entering edit mode

FDR (which we often employ) and FET may be adequate. We have not yet done what is described in the paper to which I linked, but intend to. It is an interesting approach indeed.

ADD REPLY • link 11.1 years ago by Larry_Parnell 16k

0

Entering edit mode

Interesting paper - will go into it a little later.

At the moment, I'm just trying to get a 'strawman' probability. If we assume independence and no bias (i.e. assume that there are ~17k numbered balls in an urn) , what is the probability of finding greater than 135 balls that are common in all the three draws?

Although blatantly incorrect from a biological/genetic point of view, this is just one number that I'll be presenting...

Thanks for the replies!

ADD REPLY • link 11.1 years ago by bsmith030465 ▴ 240