Question

Using hypergeometric distribution to determine significance of gene overlap in RNA-seq

0

Entering edit mode

6.3 years ago

nmac12 • 0

What would be the correct way to go about determining the significance of gene overlap in RNA-seq? For example, I am comparing my RNA-seq data against other known regulons in mycobacteria. These regulons include genes that respond to acid stress, hypoxia, osmolarity, and so on. Would it be correct to use the "phyper" function in R found here? Probability of gene list overlap

Thanks

RNA-Seq R bacteria • 1.9k views

ADD COMMENT • link updated 5.2 years ago by Biostar 20 • written 6.3 years ago by nmac12 • 0

0

Entering edit mode

What do you mean by "RNAseq data" and by "overlap"? Is the RNAseq data a list of differentially regulated genes? and by overlap do you mean the say for a regulon with 10 genes, what fraction of those 10 genes are in the differential list?

ADD REPLY • link 6.3 years ago by i.sudbery 19k

0

Entering edit mode

By RNA-seq data, I mean I have a list of genes and their associated change in regulation. Analysis of the raw RNA-seq data is complete, now I need to compare those genes that respond to the environmental stress I induced to other known stress-induced regulons. Also, yes, that is what I mean by overlap. Thanks!

ADD REPLY • link 6.3 years ago by nmac12 • 0

score 3 · Answer 1 · 2018-01-11

This task is basically the same as doing a GO enrichment analysis, and thus the tools that are used for this task are also suitable here.

However, when your gene list comes from a list of genes significantly up/down regulated by RNA-seq there are a couple of confounding factors you need to bare in mind. Both are connected to the fact that more reads = higher power to detect in RNAseq.

In any RNAseq experiment, longer genes attract more reads. More reads means that any difference is more liekly to be statistically significant. For example if a 1kb gene goes from 1TPM to 2TPM you are more likely to call that as significant than if a 2kb gene goes from 1TPM to 2TPM because the 2kb gene will have a higher read count. If some of your regulons have longer genes than is average then they are more likely to be enriched than regulons with short genes.
Genes that are more highly expressed are more likely to be called as significant than genes that have a lower expression. Thus a 1kb gene that goes from 10TPM to 20TPM is more likley to be called significant than a gene that goes from 1TPM to 2TPM. If you have regulons that are more highly or lowly expressed at base level then this will skew your results. For example you could end up in a situation where genes from a highly expressed regulon all change by 1.1 fold being called as significant, while another, more lowly expressed regulon has its genes changed by 2 or even 4 fold not being called significant.

There are two solutions to these problems. You could use the GOSeq pacakge R package, specifically designed to tackle this problem for GO enrichment analysis, but could be adapted to use your own gene categories (regulons) rather than GO categories. Or you could use a competitive test rather than a self-contained one. An example of this would be cameraPR from the limma package, making sure that the statistic you provide for each gene is the log fold change, not the p-value.