Question

How To Replace Sequence Id'S In A Text (Tree) File With Taxonomy Strings From A Corresponding Tab Delimited Taxonomy File

4

Entering edit mode

10.8 years ago

samlambrechts299 ▴ 170

Hi!

I have a phylogenetic tree file, which is basically a text file that looks like this:

((((((EF019104:0.08997,(EU135350:0.05132,((EU135320:0.04788,FJ592807:0.04392)0.465:0.01790,(EU134780:0.05467,(EU135316:0.03496,((EU135328:0.04185,((AY456902:0.01911,HM445090:0.00744)0.991:0.01269,(EU135341:0.03902,((EU135339:0.01451,EU135345:0.01593)0.726:0.00145,JN580049:0.05998)0.771:0.00197)0.783:0.00175)0.862:0.00467)0.689:0.01098,HM445243:0.03376)0.871:0.01490)0.701:0.00326)0.813:0.01180)0.818:0.00986)0.954:0.02106)0.942:0.01986,(HM187318:0.10108,(HM186778:0.03166,(EU132538:0.05510,(

Next I have a corresponding taxonomy file that looks like this:

JN178341    Bacteria;__Verrucomicrobia;__OPB35_soil_group;__o;__f;__g
GQ898616    Bacteria;__Firmicutes;__Clostridia;__Clostridiales;__Ruminococcaceae;__Incertae_Sedis

Now the identifiers in the first column of the tab delimited taxonomy file are also in the tree file, and the goal is to replace those identifiers in the tree file with the corresponding taxonomy string from the taxonomy file

I'm guessing that this isn't a complicated problem for experienced bioinformaticians, but I currently draw blank here

Any help would be greatly appreciated!

perl taxonomy identifiers tree search • 7.2k views

ADD COMMENT • link updated 10.8 years ago by jhc ★ 3.0k • written 10.8 years ago by samlambrechts299 ▴ 170

score 4 · Answer 1 · 2013-07-18

4

Entering edit mode

10.8 years ago

Kenosis ★ 1.3k

Given your data sets, perhaps the following will help:

use strict;
use warnings;

my $treeFile = pop;
my %taxonomy = map { /(\S+)\s+(.+)/; $1 => $2 } <>;

push @ARGV, $treeFile;

while ( my $line = <> ) {
    $line =~ s/\b$_\b/$taxonomy{$_}/g for keys %taxonomy;
    print $line;
}

Usage: perl script.pl taxonomyFile phylogeneticTreeFile [>outFile]

The last, optional parameter will direct output to a file.

First, the tree file name is poped off (the implicit) @ARGV, and saved for later. Then the taxonomy file is read into a hash, where the identifiers becomes the keys and the taxonomy strings become the values. Next the tree file is read a line at a time, and for each line, there's a global replacement in that line of any identifier key with its associated value. Finally, the line's printed.

ADD COMMENT • link 10.8 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Works perfectly! Thanks

ADD REPLY • link 10.8 years ago by samlambrechts299 ▴ 170

0

Entering edit mode

perfect! Thanks much!

ADD REPLY • link 5.8 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Works perfectly. Thank you!

ADD REPLY • link 4.0 years ago by JMMM ▴ 10

score 4 · Answer 2 · 2013-07-19

You could use any of the phyloinformatics toolkits available. A possible approach using Python and the ETE toolkit would look like this:

    from string import strip
    from ete2 import Tree

    # load taxonomy table (assuming that the two idname and taxonomy
    # columns are tab delimited) as python dictionary
    name2tax = dict([map(strip, line.split("\t")) for line in open("taxonomyTable.tab")])

    # loads your tree
    t = Tree("myTreeFile.nw")
    print t
    #     /-JN178341
    #----|
    #     \-GQ898616

    # replace leaf names
    for leaf in t.iter_leaves():
      leaf.name = name2tax[leaf.name]

    # check that everything is ok
    print t
    t.show()
    #     /-Bacteria;__Verrucomicrobia;__OPB35_soil_group;__o;__f;__g
    #----|
    #     \-Bacteria;__Firmicutes;__Clostridia;__Clostridiales;__Ruminococcaceae;    

    # export newick (note that you may need to replace some spacial characters from node names (i.e. ";")
    print t.write(outfile="newTree.nw")