How to convert fasta file format to phylip file format
2
1
Entering edit mode
5.7 years ago
Mike ★ 1.9k

Hi all,

I have fasta sequence of some proteins and I want to convert fasta format to phylip file format to build phylogenetic tree using ggtree. I tried online EMBOSS seqret tool to convert fasta file to phylip format but I got error when i read in ggtree.

my input sequence

>proteinsA
MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADH
RTDKKPALCKSYQKLVSEVWHKKRPSYVVP
>proteinsB
MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGT
SHVSFP
>proteinsC
MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADD
ITTEETMEEDKSQSDVDFQSCESCSNSDRAENENGSRCFSEDNNETTMLIQDDENN

and EMBOSS seqret output is..

 3 116

proteinsA MGDSRDLCPH LDSIGEVTKE DLLLKSKGTC QSCGVTGPNL WACLQVACPY
proteinsB MTGSNSHITI LTLKVLPHFE SLGKQEKIPN KMSAFRNHCP HLDSVGEITK
proteinsC MGDSRDLCPH LDSIGEVTKE DLLLKSKGTC QSCGVTGPNL WACLQVACPY

          VGCGESFADH RTDKKPALCK SYQKLVSEVW HKKRPSYVVP ----------
          EDLIQKSLGT SHVSFP---- ---------- ---------- ----------
          VGCGESFADD ITTEETMEED KSQSDVDFQS CESCSNSDRA ENENGSRCFS

          ---------- ------
          ---------- ------
          EDNNETTMLI QDDENN

But I got error in reading this phylip file...

tree <- read.phylip("emboss_seqret_output.txt")

 

Error in read.phylip("emboss_seqret_output.txt") : 
  input file is not phylip tree format...

Could you please help me what is problem with my input file or can you please suggest me to some alternative ways.

Thanks a lot.

phylip fasta ggtree Phylogenetic tree R • 20k views
ADD COMMENT
1
Entering edit mode

I use different tools to build phylogenetic trees. But I also need to convert fasta to phylip.

To convert fasta to phylip: http://sequenceconversion.bugaco.com/converter/biology/sequences/fasta_to_phylip.php

A program for phylogenetic trees: http://www.atgc-montpellier.fr/phyml/

Other useful programs from that site: http://www.atgc-montpellier.fr/index.php?type=pg

ADD REPLY
1
Entering edit mode

Don’t convert fasta to phylip. That tool is steering you wrong. While it is possible to represent the 2 files in a visually similar manner you should not do this as a text manipulation. The input sequences should be fed to an alignment program.

ADD REPLY
1
Entering edit mode

i was able to load phylip file you posted here (output from emboss seqret) using read.phylip function from phylotools. I used both .phy and .txt extension. In either case, I didn't see a difference. I think you are using read.phylip function coming from treeio package.@ Mike. Minor changes I made were to insert an extra space between sequence IDs and sequences, removed extra line between very 1st line and next line.

> library("phylotools")
> read.phylip("file.phy")
   seq.name
1 proteinsA
2 proteinsB
3 proteinsC
                                                                                                              seq.text
1 MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADHRTDKKPALCKSYQKLVSEVWHKKRPSYVVP--------------------------
2 MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGTSHVSFP--------------------------------------------------
3 MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADDITTEETMEEDKSQSDVDFQSCESCSNSDRAENENGSRCFSEDNNETTMLIQDDENN
Warning message:
In readLines(infile) : incomplete final line found on 'file.phy'
> read.phylip("file.txt")
   seq.name
1 proteinsA
2 proteinsB
3 proteinsC
                                                                                                              seq.text
1 MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADHRTDKKPALCKSYQKLVSEVWHKKRPSYVVP--------------------------
2 MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGTSHVSFP--------------------------------------------------
3 MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADDITTEETMEEDKSQSDVDFQSCESCSNSDRAENENGSRCFSEDNNETTMLIQDDENN
Warning message:
In readLines(infile) : incomplete final line found on 'file.txt'
ADD REPLY
0
Entering edit mode

Thanks cpad0112, yes I can read file in read.phylip function from phylotools but not from ggtree/ treeio. How can I build tree using this phylip file in phylotools.

ADD REPLY
0
Entering edit mode

I guess you have resolved the issue. For future reference, the tool needs sequential phylip format not a interleaved format. It also needs dendrogram information (nexus may be) at the end of phy format file.

ADD REPLY
0
Entering edit mode

You may need to make the file extension “.phy”.

Also, I’m not sure if its just how you’ve copied and pasted, but there isn’t normally a space between the 2 numbers in the first line, and the start of the alignment itself (as least as far as I have seen in the past, and PHYLIP is one of the more strict formats).

The bigger issue here is that you should not be “converting” a fasta to a PHYLIP. A phylip is an alignment file, not just a sequence representation. For your tree to be meaningful at all you need to align the sequences, using something like CLUSTAL or MUSCLE.

ADD REPLY
0
Entering edit mode

It is not copied and pasted file , I downloaded from from EMBOSS seqret result page as per below...

enter image description here

I have also MAFFT (alignment file) file but dont know how to use this file for generate tree.

ADD REPLY
0
Entering edit mode

That confirms my suspicions about the spacing of the first and second lines in your pasted example.

You can try to fix it, but its not the file you should be using. Can you paste what your MAFFT output looks like?

ADD REPLY
0
Entering edit mode
>proteinsA
-------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGT
CQSCGVTGPNLWACLQVACPYVGCGESFADHRT-------DKKPA-----LCKSY-----
------QKLVSEVWHKKRPSYVVP-----
>proteinsB
MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGT
S--------------HVSFP----------------------------------------
-----------------------------
>proteinsC
-------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGT
CQSCGVTGPNLWACLQVACPYVGCGESFADDITTEETMEEDKSQSDVDFQSCESCSNSDR
AENENGSRCFSE--DNNETTMLIQDDENN
ADD REPLY
2
Entering edit mode

That’s an aligned fasta (though to my eye it looks to be a fairly poor alignment) - proceed with caution.

Most tree building software will be able to accept fasta as an input. Otherwise you have 2 options:

  1. Go back to MAFFT and request phylip as the output format directly.
  2. Convert the aligned fasta to a phylip.

Additionally, ggtree is not a tree construction program, it is just for rendering/plotting precalcuated trees. From there documentation apparently it supports “phylip tree format”, not a format I’m familiar with, but still requires a newick representation tree in the phylip with the aligned sequences.

I would probably start over from you original fasta, align with MAFFT/Clustal/whatever, output directly as a phylip, then use something like IQTREE to actually calculate the tree itself.

Lastly I would just ask: is this a toy data set for our benefit or have you really only got 3 sequences?

ADD REPLY
0
Entering edit mode

Thanks jrj.healey for your help, I have around 150 protein sequences, this is just toy/example data.

ADD REPLY
0
Entering edit mode

ggtree expects a phylip file with the newick string. The file you have converted using Seqret does not have the newick string.

Please see 'Parser functions defined in treeio' table in the ggtree documentation for more info.

 read.phylip    parsing phylip file (phylip alignment + newick string)
ADD REPLY
0
Entering edit mode

Thanks Sej, thats my problem, how to generate phylip file (phylip alignment + newick string) to plotting in ggtree.

ADD REPLY
0
Entering edit mode

No need. You don’t need the phylip at all, you just need a newick formatted tree, which is the most common output for any phylogenetics tool.

Use a tool like IQTREE, and just take the treefile it gave you. You do not need to do anything else.

ADD REPLY
0
Entering edit mode

see if this is what you want @ Mike :

input:

$ cat test.fa
>proteinsA
-------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGT
CQSCGVTGPNLWACLQVACPYVGCGESFADHRT-------DKKPA-----LCKSY-----
------QKLVSEVWHKKRPSYVVP-----
>proteinsB
MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGT
S--------------HVSFP----------------------------------------
-----------------------------
>proteinsC
-------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGT
CQSCGVTGPNLWACLQVACPYVGCGESFADDITTEETMEEDKSQSDVDFQSCESCSNSDR

output:

#NEXUS
begin data;
    dimensions ntax=3 nchar=149;
    format datatype=protein missing=? gap=-;
matrix
proteinsA -------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADHRT-------DKKPA-----LCKSY-----------QKLVSEVWHKKRPSYVVP-----
proteinsB MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGTS--------------HVSFP---------------------------------------------------------------------
proteinsC -------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGTCQSCGVTGPNLWACLQVACPYVGCGESFADDITTEETMEEDKSQSDVDFQSCESCSNSDRAENENGSRCFSE--DNNETTMLIQDDENN
;
end;

code: (works with python > 3.5, biopython latest version 1.71):

from Bio import AlignIO
from Bio.Alphabet import IUPAC, Gapped

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(output_file, "w") as o:
    with open(input_file, "r") as i:
        infa = AlignIO.parse(i, "fasta", alphabet=Gapped(IUPAC.protein))
        AlignIO.write(infa, o, "nexus")
ADD REPLY
0
Entering edit mode

This doesn’t solve OPs problem because it still contains no dendrogram information.

ADD REPLY
1
Entering edit mode
5.7 years ago
Mike ★ 1.9k

I found a nice tutorial to build Phylogenetic Trees from fasta sequence...

http://www.cbs.dtu.dk/courses/biosys/binfintro/phylogeny.php

Step 1: Open the sequence file (fasta), select the entire file, and copy the sequences.

Step 2 : Align the sequences in using the mafft server at EBI with default settings as follows

Step 3: Open the TreeHugger web server. (The TreeHugger server constructs a neighbor joining tree from an aligned set of sequences).

Step 4: Download data in Newick/Phylip format" (treehugger_newick.nwk)

Step 5: Visualizing using ggtree

library(ggtree)
tree <- read.tree("treehugger_newick.nwk")
ggtree(tree) + geom_tiplab()
ADD COMMENT
0
Entering edit mode

you can download the phylogeny information from ebi server it self.

ADD REPLY
2
Entering edit mode
5.6 years ago
Guangchuang Yu ★ 2.6k

ggtree support phylip tree format but not phylip mutiple sequence alignment.

the phylip tree file contains msa in the famous phylip format with additional record of corresponding tree in newick text.

ggtree supports visualizing phylogenetic tree and you need to have a tree before passing it to ggtree.

the phylip sequence file only contains sequence and you need to construct the tree before visualizing it.

I am the author of ggtree and recommend you to post ggtree question to the google group.

ADD COMMENT

Login before adding your answer.

Traffic: 3129 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6