GO enrichment analysis using R
2
4
Entering edit mode
7.4 years ago
rubic ▴ 270

Hi,

I'm trying to run a GO enrichment analysis in R. I'm using the gage package, and the GO terms are downloaded from ensembl using the biomaRt package. My problem is that I'm getting too many enriched categories and they're pretty redundant. This is after using an FDR p-value = 0.05 cutoff and only testing for GO categories with 10-50 genes in order to avoid too esoteric categories or too general ones.

I came across two solutions to this issue:

  1. It's possible to cluster GO terms using pairwise distances between them, which can be obtained by packages such as GOSim, using the function getTermSim. However, if I get a few hundreds of enriched terms which I'd like to cluster in order to remove redundancy, getTermSim takes very very long, hence is impractical.

  2. Use go-slim terms. For that I use the GSEABase package and download goslim files from geneontology.org, and use that to trim the GO terms downloaded using biomaRt. The problem here, is that at least for human data - which is what I'm analyzing, the go-slim terms seem a bit poor to me.

So my question is if there's a solution to this? some happy medium?

Is there a precomputed file of all pairwise GO term distance that can be downloaded? That'll save calling getTermSim each time I run the script.

GO GO-slim enrichment-analysis R • 28k views
ADD COMMENT
3
Entering edit mode

I usually find that topGO is a good algorithm to get rid of the excessive redundancy of GO terms. It also often reports medium-sized categories as the most significant ones.

ADD REPLY
4
Entering edit mode
7.4 years ago
Guangchuang Yu ★ 2.6k

Maybe you can try clusterProfiler, which can do GO enrichment analysis in either hypergeometric test or GSEA.

It can simplify the result by removing highly similar terms calculated by GOSemSim.

ADD COMMENT
0
Entering edit mode

But like GOSim, clusterProfiler generate a pairwise semantic distance matrix, which takes very long

ADD REPLY
0
Entering edit mode

should output in reasonable time.

ADD REPLY
1
Entering edit mode
7.4 years ago

My problem is that I'm getting too many enriched categories and they're pretty redundant.

A third solution could be to filter out enriched GO categories based on

  • pval (be more stringent)
  • number of genes in categories (very big groups are often not very informative - yes I'm talking to you "cellular process")
  • minimal number of genes enriched in categories (sometimes, having just one gene enriched in a category is found significant, especially if the category is very small)
ADD COMMENT
2
Entering edit mode

Thanks for the response. I'm actually already applying these filters - just updated that in my post.

ADD REPLY

Login before adding your answer.

Traffic: 1482 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6