Question

Gene list overlap - Null distribution

0

Entering edit mode

7.3 years ago

Wario • 0

Hello everyone,

This is probably a stupid question but I need help.

I want to calculate the null distribution for the gene overlap between 2 lists.

The first list is Chip-seq data and the second RNA-seq. And the background genome is 20000 thousand genes. I have this data for 50 samples.

The first sample has a Chip list with 751 genes and a 590 RNA-seq gene list.

I tried it with r but the result looks odd.

ts = replicate(5000,t.test(rnorm(751),rnorm(590))$statistic) 
range(ts)

pts = seq(-3.5, 3.5,length=100)
plot(pts,dt(pts,df=25),col='red',type='l') 
lines(density(ts))

RNA-Seq ChIP-Seq • 1.5k views

ADD COMMENT • link updated 7.3 years ago by i.sudbery 19k • written 7.3 years ago by Wario • 0

0

Entering edit mode

I formatted your code (using the 101010 button) for readability, but perhaps you should check I did it correctly.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks, didn't know about that.

ADD REPLY • link 7.3 years ago by Wario • 0

score 2 · Answer 1 · 2017-01-14

As was mentioned by @Lars Juhl Jensen, the standard null distribution for two gene lists is the hypergeometric distribution. However, this assumes that all genes are independent and equally likely to show up. There are several reasons why this might not be the case:

Longer genes are more likely to be called differentially expressed as you have more power to detect (higher read numbers)
You don't say how your chip-seq gene list is devired. If it is by overlapping with the gene region, then again, longer genes are more likely to overlap if you are assigning peaks to genes based on a promoter region or gene territory, are all promoters/territories the same length?

There are a couple of ways around this. First the pacakge goseq is designed to manage gene length bias in differential expression analysis. While you are not doing GO analysis, the problem is conceptually equivalent.

Alternatively the program GAT (gene association tester), tests whether a set of intervals overlaps with another set of intervals more often than you would expect, accounting for all length bias, GC content bias etc.

score 1 · Answer 2 · 2017-01-14

1

Entering edit mode

7.3 years ago

Lars Juhl Jensen 11k

You could model this with a simple hypergeometric distribution, if you make the assumption that all genes are equally likely to appear on the two lists.

ADD COMMENT • link 7.3 years ago by Lars Juhl Jensen 11k