Biostar Beta. Not for public use.
Metrics to evaluate tree congruence
2
Entering edit mode
12 months ago
fhsantanna • 440
Brazil

I have multiple phylogenetic trees of different marker genes. Each one contains the same organisms. I would like to verify the congruence of these trees in a pairwise fashion (ideally I would have a congruence value matrix). Of course, since I have about 100 trees, I do not want to do it by naked eye. It would be great that some I could evaluate it by analyzing a "congruence metric". For example, I would like to know if a gyrB gene tree is more congruent to 16S rRNA gene tree than recA gene tree. Do you know such metrics? Which software do you recommend?

ADD COMMENTlink
2
Entering edit mode
3 months ago
Joe 12k
United Kingdom

Here's a partial solution you might be able to run with: I did something like this recently, though I did do it 'by eye'. I clustered my trees by eye (though there are some tools like TOPD that will do it, but I dont know how good they are). I got a couple of other unbiased people to corroborate my cluster estimates.

I created a score matrix like so (this is shortened):

Gene    Tree1   Tree2   Tree3   Tree4   Tree5   Tree6   Tree7   Tree8   Tree9
PAU_pnf 1   1   1   1   1   1   1   1   1
PAK_pnf 1   1   1   1   1   1   1   1   1
PAU_cif 2   2   2   2   2   2   2   2   2
PAK_cif 2   2   2   2   2   2   2   2   2
PLT_cif 2   7   8   2   2   6   2   2   2
PAU_lopT    3   3   9   3   3   1   3   3   3
PAK_lopT    3   3   10  3   3   8   3   3   3
PLT_lopT    3   3   2   3   3   5   3   3   3
PAU_U4  4   4   4   4   4   4   4   4   4
PLT_U4  4   4   4   4   5   4   5   4   4
PAK_U2  4   6   4   4   4   4   6   4   4

I.e. cluster number 1 is arbirarily applied to the node that joined PAU_pnf and PAK_pnf in my dataset. This node persists across all my gene trees here.

Then take that matrix and use the Adjusted Wallace Test described here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3209087/ there are MATLAB codes for this, though I simply ran it through their webserver here http://www.comparingpartitions.info/index.php?link=Tool

Its a reasonably robust metric for comparing Sequence Types and can be repurposed for congruency :)

That spits you back out a matrix of congruency across each set of clusters/trees: https://s30.postimg.org/jpp2y1969/Screen_Shot_2016_05_14_at_13_42_18.png

Which I then replotted with ggplotly in RStudio:

Voila: https://s30.postimg.org/uo0cg7xrl/heatmap2k_transp1.png

ADD COMMENTlink
0
Entering edit mode

Interesting idea! The only problem for me is doing the score matrix "by eye"... I will take a look in TOPD (nice suggestion).

ADD REPLYlink
0
Entering edit mode
ADD REPLYlink
0
Entering edit mode

Interesting, I'd not found that. That's useful. I had originally intended to use Prunier to infer lateral gene transfers from the phylip alignments in conjunction to the gene trees (I had already used ASTRAL to create the species tree to which I was comparing everything). I just could not get it to work in the end for some reason, and never got a reply from the developers :(

I'd add to this that I'd be interested in hearing about any other offering people have for decent (and ideally easy to use) congruency analysers.

ADD REPLYlink
0
Entering edit mode

Another option: http://phylo.io/

ADD REPLYlink
2
Entering edit mode
12 months ago
jhc ♦ 2.8k
Germany

perhaps normalized Robinson-Foulds distances help here. Take a look at the ete-compare too. It would allow you to compute all those distances very easily from the command line.

ADD COMMENTlink
0
Entering edit mode
23 months ago
apa@stowers • 420
Kansas City

The cophenetic correlation coefficient can be used for that purpose. For example, in R, get your trees into objects of class "dendrogram" -- if your trees are in Newick format then the "ape" package should be able to read them -- then, "cor( cophenetic(tree1), cophenetic(tree2) )" which you can use to populate a pairwise matrix. In Matlab, the function would be "cophenet".

Basically, you regenerate the sample distance matrices from the branch lengths, linearize, and correlate. Any pair of trees which encodes the same dendrogram distances between genes will have an R value of 1. From simulations, R <= 0.75 generally indicates unrelated trees.

ADD COMMENTlink
0
Entering edit mode
12 months ago
Joseph Hughes ♦ 2.7k
Scotland, UK

You could use the Robinson-Foulds measure in Mesquite.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1