Phylogenetic Tree Of Fragments Of The Same Protein (From A Metagenome)
4
3
Entering edit mode
12.2 years ago
Gimly_Gloin ▴ 70

OK, I have several hundred fragments of a protein of interest(699 sequences) that I would like to align and make a neighbor joining tree of. These fragments do not in many cases align well to one another (different regions of the same or similar proteins).

However, whole protein sequence(s) have been defined and submitted to NCBI and other databases etc. There are also trees made in literature for this protein. Is there a way to take my fragments from my metagenome, and align them to the known sequences to define where each of my fragments lie on the published tree? my only solution to this is to run each sequence (or cluster of sequences) on the predefined tree (using the original whole protein sequences from publication) so as to define where each fragment would lie.

My sequences are non assembly sequences (can't assemble them, too diverse)

Average read length is 400bp

General protein length is around 350aa

IS there an easier way to do this?

How accurate would diversity statistics be on this protein? (will not be adding the known protein sequence for this one)

Thanks for any advice/help in advance.

phylogenetics metagenomics • 4.5k views
ADD COMMENT
0
Entering edit mode

PAGAN could be helpful in the alignment part. Please see http://code.google.com/p/pagan-msa/wiki/PAGAN?tm=6 and contact the author if you have any questions. The program is actively developed and recent features (e.g. translated and ORF alignment) are still undocumented.

You could try (1) "pileup alignment" (one ref. sequence) and (2) "unguided placement (ref. alignment and tree):

pagan --reads-pileup --ref-seqfile ref_sequence.pep  --readsfile prot_frags.fas
pagan --ref-seqfile ref_alignment.fas --ref-treefile ref_tree.nh --readsfile prot_frags.fas --fast-placement --test-every-node
ADD REPLY
4
Entering edit mode
12.2 years ago
Ari ▴ 120

PAGAN could be helpful in the alignment part. Please see http://code.google.com/p/pagan-msa/wiki/PAGAN?tm=6 and contact the author if you have any questions. The program is actively developed and some recent features (e.g. translated and ORF alignment) are still undocumented.

You could try (1) "pileup alignment" (with one reference sequence) and (2) "unguided placement" (with a reference alignment and tree):

(1)

pagan --reads-pileup --ref-seqfile ref_peptide.fas --readsfile prot_frags.fas

(2)

pagan --ref-seqfile ref_alignment.fas --ref-treefile ref_tree.nh --readsfile \
prot_frags.fas --fast-placement --test-every-node

With the second option, PAGAN adds the new sequences "inside" the reference alignment at the phylogenetic positions that they match best. This is based on a greedy search, though, and should not be taken as a proper phylogenetic analysis.

ADD COMMENT
3
Entering edit mode
12.2 years ago

I think you can use PaPaRa for this. Build a phylogenetic tree with the full length proteins and align your queries to the tree/s.

Check the publication from Berger&Stamatakis: http://bioinformatics.oxfordjournals.org/content/27/15/2068.long

or their web page: http://www.exelixis-lab.org/

ADD COMMENT
0
Entering edit mode

See my comment below for the newer EPA algorithm from the same group that extends PaPaRa into an ML framework.

ADD REPLY
1
Entering edit mode
12.2 years ago
ALchEmiXt ★ 1.9k

The displayed trees at NCBI are more or less pair-wise BLAST based (or you mean some other trees?). If you have the sequences of a certain tree you should be able to reproduce that tree quite easily based on pair-wise sequence content comparsion (i.e. using BLAST or MUMmer).

If that is the case you can add your own sequences and "see" in which clade they end up all in one go. There assuming the tree is not too much disturbed by the additional sequences. These algorithms are quite fast and therefore allow lots of room to test settings and see various outcomes.

ADD COMMENT
1
Entering edit mode

I would avoid the NCBI reference trees whenever possible. They are essentially hierarchical distance trees and not necessarily representative of the true phylogeny depending on the sequenc ein question and the taxa represented.

ADD REPLY
0
Entering edit mode

No, It isn't an NCBI tree, what I meant was that the sequence data for the protein is from NCBI, the tree is actually a Maximum Likelihood generated by PHYML. Thanks for the suggestion though, I can already do this by clustering with USEARCH which gives me a rough Idea where each fragment is but doesn't provide enough data for statistical analysis using OTUs in MOTHUR...

ADD REPLY
1
Entering edit mode
12.2 years ago
DG 7.3k

Don't do an NJ tree. NJ phylogenetic tree algorithms are prone to all sorts of biases and artefacts, like long-branch attraction, that could be particularly problematic for this sort of problem.

There is a version of RAxML, a Maximum-Likelihood phylogenetics software (http://www.exelixis-lab.org/) called the Evolutionary Placement Algorithm (Paper is here). You can use a reference phylogeny and aligned sequences to do short-read mapping of your metagenomic data to the try in a full maximum-likelihood context. Including models of substitution, frequency estimates, etc is very important, especially if you are dealing with a large number of taxa and large amounts of diversity.

ADD COMMENT
0
Entering edit mode

Thanks for the advise, after rereading my source of phylogeny for my protein, it appeared they had done a Maximum Likelihood tree using PHYML. Will have a look at RAxML!

ADD REPLY

Login before adding your answer.

Traffic: 2620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6