Using hypergeometric distribution to determine significance of gene overlap in RNA-seq
1
0
Entering edit mode
6.3 years ago
nmac12 • 0

What would be the correct way to go about determining the significance of gene overlap in RNA-seq? For example, I am comparing my RNA-seq data against other known regulons in mycobacteria. These regulons include genes that respond to acid stress, hypoxia, osmolarity, and so on. Would it be correct to use the "phyper" function in R found here? Probability of gene list overlap

Thanks

RNA-Seq R bacteria • 1.9k views
ADD COMMENT
0
Entering edit mode

What do you mean by "RNAseq data" and by "overlap"? Is the RNAseq data a list of differentially regulated genes? and by overlap do you mean the say for a regulon with 10 genes, what fraction of those 10 genes are in the differential list?

ADD REPLY
0
Entering edit mode

By RNA-seq data, I mean I have a list of genes and their associated change in regulation. Analysis of the raw RNA-seq data is complete, now I need to compare those genes that respond to the environmental stress I induced to other known stress-induced regulons. Also, yes, that is what I mean by overlap. Thanks!

ADD REPLY
3
Entering edit mode
6.3 years ago

This task is basically the same as doing a GO enrichment analysis, and thus the tools that are used for this task are also suitable here.

However, when your gene list comes from a list of genes significantly up/down regulated by RNA-seq there are a couple of confounding factors you need to bare in mind. Both are connected to the fact that more reads = higher power to detect in RNAseq.

  1. In any RNAseq experiment, longer genes attract more reads. More reads means that any difference is more liekly to be statistically significant. For example if a 1kb gene goes from 1TPM to 2TPM you are more likely to call that as significant than if a 2kb gene goes from 1TPM to 2TPM because the 2kb gene will have a higher read count. If some of your regulons have longer genes than is average then they are more likely to be enriched than regulons with short genes.

  2. Genes that are more highly expressed are more likely to be called as significant than genes that have a lower expression. Thus a 1kb gene that goes from 10TPM to 20TPM is more likley to be called significant than a gene that goes from 1TPM to 2TPM. If you have regulons that are more highly or lowly expressed at base level then this will skew your results. For example you could end up in a situation where genes from a highly expressed regulon all change by 1.1 fold being called as significant, while another, more lowly expressed regulon has its genes changed by 2 or even 4 fold not being called significant.

There are two solutions to these problems. You could use the GOSeq pacakge R package, specifically designed to tackle this problem for GO enrichment analysis, but could be adapted to use your own gene categories (regulons) rather than GO categories. Or you could use a competitive test rather than a self-contained one. An example of this would be cameraPR from the limma package, making sure that the statistic you provide for each gene is the log fold change, not the p-value.

ADD COMMENT

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6