Question

Phylogenetic Analysis Of Whole Genomes

12

Entering edit mode

13.7 years ago

Aparna ▴ 130

hi can anyone tell me the name of the software for performing the alignment and constructing the phylogenetic tree of whole genome. thanks in advance.

phylogenetics tree • 27k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 13.7 years ago by Aparna ▴ 130

2

Entering edit mode

I think you need to elaborate on what exactly you are trying to accomplish. Are you trying to make a species tree or gene trees? How many genomes are you starting from? Are they prokaryotic or eukaryotic genomes?

ADD REPLY • link 13.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

i need species tree containing 24 species all belonging to prokaryotic genomes

ADD REPLY • link 13.7 years ago by Aparna ▴ 130

9

Entering edit mode

6.2 years ago

Eli Korvigo ▴ 230

Genome-scale multiple sequence alignments are not quite good for phylogenies: they take a lot of time to compute and are never accurate. Moreover, it's hard to imagine a general-purpose sequence evolution model that would be equally adequate for protein-coding, rRNA, tRNA genes, repeats and other regions. Picking a subset of genes manually is not a nice option either, because you will lose a lot of phylogenetic resolution. I would thus recommend building a tree based on all orthologous genes, which is the most common thing to do as far as I can tell. Here is a general pipeline:

Annotate your genomes using Prokka (for prokaryotes) or another tool;
Find one-to-one protein-coding orthologs using OrthoFinder or OrthoMCL;
Run multiple sequence alignments (MSAs) for each group (any MSA tools will do, but I prefer mafft);
Filter each MSA using Gblocks;
Merge filtered alignments (I use Python for that, but I'm pretty sure there are some tools that don't require programming skills);
Use raxml (maximum likelihood) or beast (bayesian inference) to infer the phylogeny.

ADD COMMENT • link 6.2 years ago by Eli Korvigo ▴ 230

5

Entering edit mode

13.7 years ago

Science_Robot ★ 1.1k

RaxML: I'm not sure but I think this program works for entire genomes and is supposed to be very fast:

Results: In this paper we present the latest release of our program RAxML-III for rapid maximum likelihood-based inference of large evolutionary trees which allows for computation of 1.000-taxon trees in less than 24 hours on a single PC processor. We compare RAxML-III to the currently fastest implementations for maximum likelihood and bayesian inference: PHYML and MrBayes. Whereas RAxML-III performs worse than PHYML and MrBayes on synthetic data it clearly outperforms both programs on all real data alignments used in terms of speed and final likelihood values.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Science_Robot ★ 1.1k

1

Entering edit mode

Like PhyML and MrBayes, RAxML takes a multiple sequence alignment as input and uses maximum-likelihood to infer an evolutionary tree. It is thus not a tool that you can just give a bunch of genomes and produce a trees; you'd have to first make, for example, a 16S rRNA alignment or a concatenated ribosomal protein alignment.

ADD REPLY • link 13.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

MrBayes is not an ML method. It's based on Bayesian inference.

ADD REPLY • link 6.2 years ago by Eli Korvigo ▴ 230

5

Entering edit mode

13.2 years ago

Jarretinha 3.4k

Alignment of whole genomes is a quite delicate task and a pain to parse a lot of different output formats until a measure of distance/similarity emerges. Good aligners are MUMMER and MAUVE. I really like MAUVE, used it to play with a lot of genomes from different strains of E. coli. That's the advantage of whole genome comparision! You can find "species" tree even when 16S says that the distance is zero.

For the phylogeny part of the work, you can use RaxML as said by some folks here. For high number of taxa this guy is the fastest one on the road. In your case a more precise approach is feasible. So, you can use ERATE which is Sean Eddy's version of DNAML from Phylip. It can deal with indels and I recommend it even in the 16S case.

But, if you really don't wanna suffer, just check the Genome-To-Genome Distance Calculator service and choose your own setup. After getting the distances, just use Clearcut to generate a NJ tree. Fast and cheap! Not very accurate if you work with very divergent species.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.2 years ago by Jarretinha 3.4k

2

Entering edit mode

13.7 years ago

Dave Lunt ★ 2.0k

Hi, the approach at MicrobesOnline looks interesting. If the 24 species genomes are public and high quality their phylogenetic positions may already be there for you (click on "Species Tree"). If they are unpublished genomes they also allow you to host data privately- although I am only assuming that you would then be able to add them to the existing data sets, I don't know for sure.

The trees are made from 78 protein coding loci, so not "whole genomes" but the difference is probably trivial for most species.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Dave Lunt ★ 2.0k

1

Entering edit mode

13.2 years ago

Adam Witney ▴ 10

Which program would be a more modern and better alternative to Phylip PARS for clustering 0/1 data representing presence/absence of genes amongst multiple strains of bacteria?

ADD COMMENT • link 13.2 years ago by Adam Witney ▴ 10

1

Entering edit mode

Adam: don't open new questions inside another discussion. Open a new thread instead, otherwise nobody will be able to answer you.

ADD REPLY • link 13.2 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I was actually following on from Dave Lunt's comment that said there are better alternatives to Phylip now, but maybe I put the question in the wrong place (should have been a comment on his comment). Thanks

ADD REPLY • link 13.2 years ago by Adam Witney ▴ 10

1

Entering edit mode

9.3 years ago

Chrispin Chaguza ▴ 280

It's a very old post but I thought I could add to it to help others who might want to do a similar analysis i.e. create phylogenies from whole genomes for prokaryotic species. I have created a basic analysis pipeline that tries to simplify the process of creating phylogenetic trees at species level using only the conserved (otherwise known as the core) genomic content of all the 'bacterial' species. The steps used are described and the script is available at http://mcgp.sourceforge.net/

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 9.3 years ago by Chrispin Chaguza ▴ 280

0

Entering edit mode

Hi. Why don't you put it on GitHub?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by GouthamAtla 12k

0

Entering edit mode

At first glance, it appears that your pipeline is what community microbiologists/metagenomics people do as a day-to-day part of a standard analysis. How does yours differ from established pipelines/workflows in the currently published literature?

(Also, to respond to the other comment: It's on SourceForge as an SVN repository.)

ADD REPLY • link 9.3 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

sounds good such a nice tool. So if i align 100 genome using Mauve and generate a whole genome alignment tree and on other hand if i use your tool how much it will be differ, what do you think ???

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 9.3 years ago by HG ★ 1.2k

1

Entering edit mode

6.2 years ago

ofanoyi ▴ 160

This tool may work.

https://realphy.unibas.ch/fcgi/realphy

ADD COMMENT • link 6.2 years ago by ofanoyi ▴ 160

0

Entering edit mode

5.0 years ago

rm.umayal24 ▴ 10

The VCF2PopTree software would be helpful if you are constructing a phylogenetic tree from VCF or SNP file. It reads even the human genome. It is so cool and it does not need any dependencies.

The software link is as follows: http://sankarsubramanian.net/dat/index.html

ADD COMMENT • link 5.0 years ago by rm.umayal24 ▴ 10

Ram · Accepted Answer · 2010-07-29

20

Entering edit mode

13.7 years ago

Lars Juhl Jensen 11k

I am not aware of an easy way to construct reliable species trees based on complete genomes. The general approach that you need to take is to pick one or more genes based on which to base your phylogeny. This could be either 16S rRNA, all ribosomal-protein-coding genes, or other highly conserved genes that are universally present and rarely subject to gene duplications or lateral gene transfer.

Once you have picked the genes, you need to make a multiple sequence alignment(s). You need to do this for each of the genes that you want to use for your phylogeny. For this I would tend to use either muscle or mafft. After that I would use Gblocks to extract the conserved blocks in the alignment(s) in order to not use potentially misaligned parts as the basis for tree building.

If you decided to use multiple genes as the basis for your phylogeny, you now have to make a big decision, namely whether to go for a concatenated alignment approach or a supertree approach. In the first case, you would concatenate all of the multiple alignments and use the resulting big alignment as input for a phylogenetic tree reconstruction program, for example PhyML. In the second case, you would use such a program to make a separate tree for each of the genes of interest, and subsequently use one of several supertree programs to derive a consensus tree based on these. If you went for just using a single gene as the basis for your tree, you obviously just build a tree for that one gene and you are done.

I hope this helps, although it is certainly very far from a "push of a button" solution.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

Depends a bit on what you want to do, but as long as the 24 genomes are not too far apart, I agree that 16S rRNA is a good choice. If one wants to attempt to resolve very deep-branching parts of the tree, I believe you need a multi-locus approach to get enough information to be able to do much. But in that case using just 24 genomes would be unlikely to work anyway.

ADD REPLY • link 13.7 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

Just don't use any of the alignment software suggested; try something with a "profile"-based alignment or something geared to rRNA.

ADD REPLY • link 13.7 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

I wold recommend ssu-align for 16S multiple sequence alignment. It uses a 16S HMM.

ADD REPLY • link 6.2 years ago by Eli Korvigo ▴ 230

0

Entering edit mode

+1 for 16S rRNA instead of whole genome

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

+1 and agree on 16S, all other genes will lead to a sort of 'non-standard' approach.

ADD REPLY • link 13.7 years ago by Michael 54k

0

Entering edit mode

@Paulo, good point. I completely agree that if you want to do rRNA alignment you should use dedicated, profile-based tools. The alignment tools were meant as suggestions for how to make multiple alignments of protein-coding genes.

ADD REPLY • link 13.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

I would try to use the "fasttree" program, it gives comparable results to PhyML but is much faster, which would be beneficial on a genome wide scale. Anyway, if you use multiple loci of whole genomes for phylogeny reconstruction, there would be only a very tiny difference between different programs. Anyway, if you have whole genome sequences available, do not just rely on 16S rRNAs but take as much as sequence data as possible into account..

ADD REPLY • link 13.2 years ago by Peter ▴ 90

0

Entering edit mode

Could you give a recommendation for a "supertree" program? I have trees built from genotypes from individual chromosomes and I want to generate a consensus tree.

ADD REPLY • link 12.8 years ago by User 3875 ▴ 50