Question

Some clarification on enrichment analyses and pathway analyses?

3

Entering edit mode

6.4 years ago

kirannbishwa01 ★ 1.6k

I have analyzed my RNAseq data and identified the genes with significant foldchanges and p-values. Next step is to do enrichment analyses.

I have been some extensive reading for last two weeks and have tried some analyses using command based tools, but things are getting confusing for me due to so many different packages and references that are available - but what is missing is a comprehensive and conceptual tutorial on how and why to do things in enrichment analyses?

Just a few questions:

Should I only select significant genes for my enrichment analyses, pathway analyses? Why, why not?
I have found several tutorials on DESeq/2, but I am not finding any one that gives a clean and comprehensive view on how to further process the data for downstream enrichment and visualization?
What is the difference between doing GO enrichment by CC vs. BP vs MF?
What is the difference between GO vs KEGG?
I am working with non model organism: in that case is it best to do these analyses by matching the geneID/name of my organism to orthlog geneID/name of a model organism? This may or may not be a good idea because certain pathways between organisms might be different, but what is any proposed solution.

Any ideas please.

pathview R gene-enrichment RNA-Seq • 4.8k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 6.4 years ago by kirannbishwa01 ★ 1.6k

score 12 · Accepted Answer · 2017-11-12

Well, gene enrichment (or 'gene-set enrichment analysis'; GSEA) is one of those things on which everyone has their own take, i.e., opinion. I've met people who don't even want to hear anything about it, to those who apparently idolise it. The way that you've carefully written your question tells me that you're in between these two extremes.

The first thing to consider is that gene enrichment is an in silico analysis, but many of the enrichment terms are based on curated datasets. For the Gene Ontology terms, for example, each and every term has an assigned evidence code, which can be taken into account when interpreting a particular enrichment. Take a look at my answer here: A: Go annotation reliability ?

Should I only select significant genes for my enrichment analyses, pathway analyses? Why, why not?

The general idea of gene enrichment is that you have identified a group of genes as being statistically significantly associated to a particular condition and that you want to learn more about the potential functions, processes, pathways et cetera, that may be altered as a result. Thus, it does not make much sense to perform the enrichment on non-significant genes.

Edit: 11th January 2019: some programs can specifically take all genes in your dataset, perform enrichment, and then determine degree/level of enrichment by utilising the p-values and fold-changes. These methods are more powerful, I feel.

I have found several tutorials on DESeq/2, but I am not finding any one that gives a clean and comprehensive view on how to further process the data for downstream enrichment and visualization?

You will never find a 'clean and comprehensive' tutorial - everyone has their own take on it. DESeq2 is excellent at conducting analyses of [primarily] RNA-seq data but it's not a gene enrichment program.

What is the difference between doing GO enrichment by CC vs. BP vs MF?

CC, cellular component
BP, biological process
MF, molecular function

Think of these as sub-classifications. Each of these will contain 1000s of gene enrichment terms that are organised in a hierarchical fashion. Most people will be interested in just BP and MF.

What is the difference between GO vs KEGG? These are different organisations/groups.

The Gene Ontology (GO) Consortium is based in the USA and is funded by the NHGRI. The consortium has been in existence for almost 20 years and its aim to is define natural/healthy biological processes, molecular functions, and components (as per the sub-classifications mentioned above). Their gene enrichment categories and terms are either based on in silico or confirmed laboratory evidence (as per the evidence codes that I mentioned above).
The Kyoto Encyclopaedia of Genes and Genomes (KEGG) is a consortium based in Japan. It has been in existence slightly longer than GO and is most recognised for the curation of pathways in human and other species. KEGG covers a lot of things other than pathways, though. Also KEGG focuses on both normal/healthy and also disease-related pathways.

NB - it's important to remember that some GO terms relate to pathways too.

I am working with non model organism: in that case is it best to do these analyses by matching the geneID/name of my organism to orthlog geneID/name of a model organism? This may or maynot be a good idea because certain pathways between organisms might be different, but what is any proposed solution.

If you use an enrichment tool like DAVID, your species of interest is most likely included in this and, in addition, with DAVID, you can do enrichment on both GO and KEGG (and other databases) at the same time. On DAVID's main page, go to Functional Annotation and there you'll see a text box where you can input your genes.

My advice to you is to do the enrichment but to be cautious about the interpretation of the results. It is quite easy to 'cherry pick' the enrichment terms that you want to see, i.e., those that fit your hypothesis(es). If you get lucky and everything comes up for which you had hoped, I would still exercise caution. Don't get too excited by gene enrichment.

In terms of filtering enriched terms, if you use DAVID, you can filter enrichment terms based on a Benjamini P value. In terms of displaying gene enrichments, I would recommend simple displays like these: