Question

Transcript-Level Versus Gene-Level Go Enrichment Analysis (For Non-Model Organism)

12

Entering edit mode

11.5 years ago

dsbreak ▴ 160

I have a basic question about what test/reference sets can be used for GO enrichment analysis. All of the studies I come across ask whether certain gene subsets are enriched for a GO term. Is it appropriate to ask if a transcript subset is enriched? Or would that lead to some skewing of the statistics for/against genes with multiple isoforms?

I ask because I am working with a non-model organism (i.e. I need to do my own GO annotation) and would like to know if any of the genes/transcripts that are differentially expressed between two conditions are enriched for specific GO terms. I have a draft genome, a draft transcriptome (annotated using blast2go), and mRNA-Seq data. However, I find that there are several situations where a given gene with multiple isoforms has different GO-terms associated with each isoform.

My specific questions:

Is it appropriate to do transcript-level GO enrichment analysis?
Any references to studies that have done this successfully before?
Alternatively, I could run a gene-level analysis if someone could suggest how to "collapse" different isoforms into a single sequence for use as input for blast2go :)

go enrichment transcript isoform • 12k views

ADD COMMENT • link updated 11.5 years ago by Obi Griffith 20k • written 11.5 years ago by dsbreak ▴ 160

score 15 · Answer 1 · 2012-10-28

I think you have answered your own question when you observe that there are many genes with multiple transcript/protein isoforms where each isoform has different GO annotations. This is because the Gene Ontology attaches terms from it's three ontologies (molecular functions, biological processes and/or cellular components) to gene products, not genes. In other words, terms are associated with specific protein isoforms. In many cases people have information only at the gene-locus level (e.g., their expression arrays don't do a good job of measuring specific transcripts) or if they have transcript-level data they map those transcripts to the gene-locus level rather than the protein isoform level. However, if you do have good transcript-level data I would argue that it is better to map those to the corresponding protein isoform (e.g., UniProt) and use that as input for your Gene Ontology analysis. Most GO over-representation software will allow you to upload your own "total/complete" lists from which your protein subset was derived. This will prevent the skewing of statistics that you are quite wisely concerned about. As an illustrative example, check out DAVID. Choose their 'Functional Annotation' (gene-annotation enrichment analysis) tool and you will see that you can upload many different types of transcript IDs or protein IDs for both your "gene list" and "background" list of interest. Running their statistics will tell you which GO terms are over-represented in your subset of transcript/protein IDs relative to the total/background list. Most GO enrichment tools will follow this pattern. You can explore a list maintained by GO here. All of this was a really long way of answering your first question: YES - it is appropriate to do transcript-level GO enrichment analysis. For your second question, there must be many references for this. Unfortunately, it is so common now that most people don't really explain what they have done in their publications. For your third question, given the above, I would not "collapse" different isoforms.

Your situation of not having a model organism creates a lot more challenges. I've never worked with blast2go. But I suppose if you have a complete set of transcripts, get some functional annotations for many of them from blast2go, then you should be able to build your own transcript-annotation database and use that for over-representation analysis of subsets of genes versus the total list. This will likely require custom analysis as opposed to tools like DAVID. I suggest you investigate Bioconductor packages like GOStats. They actually have a short vignette for your situation. This thread looks really helpful for someone trying to figure that vignette out for the first time.