Metrics to evaluate tree congruence
4
2
Entering edit mode
7.3 years ago
fhsantanna ▴ 610

I have multiple phylogenetic trees of different marker genes. Each one contains the same organisms. I would like to verify the congruence of these trees in a pairwise fashion (ideally I would have a congruence value matrix). Of course, since I have about 100 trees, I do not want to do it by naked eye. It would be great that some I could evaluate it by analyzing a "congruence metric". For example, I would like to know if a gyrB gene tree is more congruent to 16S rRNA gene tree than recA gene tree. Do you know such metrics? Which software do you recommend?

phylogeny congruence software • 2.6k views
ADD COMMENT
2
Entering edit mode
7.3 years ago
Joe 21k

Here's a partial solution you might be able to run with: I did something like this recently, though I did do it 'by eye'. I clustered my trees by eye (though there are some tools like TOPD that will do it, but I dont know how good they are). I got a couple of other unbiased people to corroborate my cluster estimates.

I created a score matrix like so (this is shortened):

Gene    Tree1   Tree2   Tree3   Tree4   Tree5   Tree6   Tree7   Tree8   Tree9
PAU_pnf 1   1   1   1   1   1   1   1   1
PAK_pnf 1   1   1   1   1   1   1   1   1
PAU_cif 2   2   2   2   2   2   2   2   2
PAK_cif 2   2   2   2   2   2   2   2   2
PLT_cif 2   7   8   2   2   6   2   2   2
PAU_lopT    3   3   9   3   3   1   3   3   3
PAK_lopT    3   3   10  3   3   8   3   3   3
PLT_lopT    3   3   2   3   3   5   3   3   3
PAU_U4  4   4   4   4   4   4   4   4   4
PLT_U4  4   4   4   4   5   4   5   4   4
PAK_U2  4   6   4   4   4   4   6   4   4

I.e. cluster number 1 is arbirarily applied to the node that joined PAU_pnf and PAK_pnf in my dataset. This node persists across all my gene trees here.

Then take that matrix and use the Adjusted Wallace Test described here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3209087/ there are MATLAB codes for this, though I simply ran it through their webserver here http://www.comparingpartitions.info/index.php?link=Tool

Its a reasonably robust metric for comparing Sequence Types and can be repurposed for congruency :)

That spits you back out a matrix of congruency across each set of clusters/trees: https://s30.postimg.org/jpp2y1969/Screen_Shot_2016_05_14_at_13_42_18.png

Which I then replotted with ggplotly in RStudio:

Voila: https://s30.postimg.org/uo0cg7xrl/heatmap2k_transp1.png

ADD COMMENT
0
Entering edit mode

Interesting idea! The only problem for me is doing the score matrix "by eye"... I will take a look in TOPD (nice suggestion).

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Interesting, I'd not found that. That's useful. I had originally intended to use Prunier to infer lateral gene transfers from the phylip alignments in conjunction to the gene trees (I had already used ASTRAL to create the species tree to which I was comparing everything). I just could not get it to work in the end for some reason, and never got a reply from the developers :(

I'd add to this that I'd be interested in hearing about any other offering people have for decent (and ideally easy to use) congruency analysers.

ADD REPLY
0
Entering edit mode

Another option: http://phylo.io/

ADD REPLY
2
Entering edit mode
7.3 years ago
jhc ★ 3.0k

perhaps normalized Robinson-Foulds distances help here. Take a look at the ete-compare too. It would allow you to compute all those distances very easily from the command line.

ADD COMMENT
0
Entering edit mode
7.3 years ago
apa@stowers ▴ 600

The cophenetic correlation coefficient can be used for that purpose. For example, in R, get your trees into objects of class "dendrogram" -- if your trees are in Newick format then the "ape" package should be able to read them -- then, "cor( cophenetic(tree1), cophenetic(tree2) )" which you can use to populate a pairwise matrix. In Matlab, the function would be "cophenet".

Basically, you regenerate the sample distance matrices from the branch lengths, linearize, and correlate. Any pair of trees which encodes the same dendrogram distances between genes will have an R value of 1. From simulations, R <= 0.75 generally indicates unrelated trees.

ADD COMMENT
0
Entering edit mode
7.3 years ago
Joseph Hughes ★ 3.0k

You could use the Robinson-Foulds measure in Mesquite.

ADD COMMENT

Login before adding your answer.

Traffic: 2579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6