Question

Ortholog in superfamily with promiscuous domain

0

Entering edit mode

8.4 years ago

Anand Rao ▴ 630

Background: I have ~ 18K proteins from 30+ plant species. The only thing common to them is the presence of a certain domain ~ 50aa , which is also length variant. So domain-based phylogeny is ruled out. This domain is very promiscuous, i.e. found as several domain combinations.

For ortholog inference, I binned the proteins based on their protein domain architectures (PDAs), and ran OrthoFinder within each bin. OrthoFinder processes BLAST hits using the standard MCL software. It also aligns these putative orthologs using MAFFT -LINSi, option and returns its tree using FastTree (approximate ML method).

Problem: Even within each PDA bin, proteins are quite length variant (due to recombination events). As a result I think, the alignments returned by OrthoFinder are extremely gappy, suggesting that proteins of disparate lengths, and therefore likely different evolutionary origins, are being incorrectly classified together in the same orthologous group. Is there a way to prevent / compensate for this?

My specific questions are:

Should I pre-process my dataset differently before OrthoFinder step? And should it involve PDA-based binning or not? For example, cluster them based on length using UCLUST may be? Before or after PDA-based binning?
I have not used OrthoMCL, so I wonder if it too will suffer from the same sort of mis-classification? AND / OR
Should I post-process my alignment instead, using something like GBlocks?
OrthoFinder uses MAFFT *L-INS-i (probably most accurate; recommended for <200 sequences; iterative refinement method incorporating local pairwise alignment information). Perhaps I should force OrthoFinder to use *E-INS-i (suitable for sequences containing large unalignable regions; recommended for <200 sequences) instead?

The short version of my question:

For short promiscuous domain itself of variable length, in plant proteins also of variable length and different domain architectures, what might be the best bioinformatics pipeline to accurately infer orthologs?

length-variation ortholog • 2.0k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by Anand Rao ▴ 630

Ram · Answer 1 · 2015-12-05

What Is The Best Method To Find Orthologous Genes Of A Species? asks a more general question on finding orthologs. Here, Blast and alignments based methods are going to suck. You likely want to use a phylogeny based method. You need a tree with your species (which you could get with another set of orthologs, but unambiguous).

I would start small, with an evolutionary representative subsets of your species. You can try to roughly define groups of orthologous proteins. Take a look at http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0053786 (introduction). You could then maybe try to build HMM of particular groups and do a search in all your species, to define clusters. Your groups should be hopefully different enough from one another.

You can then try to refine, adding more species and comparing small relevant groups in closely related species. You could then use OrthoMCL and others, but never on your whole dataset.

Just make sure you are not doing a job that has already been done, take a look at Ensembl Compara and others. Then, if you still have no results, you will likely need to create an original workflow.