Replace Sequence IDs in many text files (newick format) - Change script
2
0
Entering edit mode
7.1 years ago
ibasan ▴ 50

Dear all, i found this great script to replace sequence identifiers in a text file (newick format) with strings from a tab delimited file in this post

use strict;
use warnings;
my $treeFile = pop;
my %taxonomy = map { /(\S+)\s+(.+)/; $1 => $2 } <>;
push @ARGV, $treeFile;
while ( my $line = <> ) {
    $line =~ s/\b$_\b/$taxonomy{$_}/g for keys %taxonomy;
    print $line;
}

It works perfect for one text file (newick) but i have thousonds (<20000) of such text files (newick) and one tab delimited file. How can i change this script for all of my text files? Any help would be great!

perl newick tree replace • 2.9k views
ADD COMMENT
5
Entering edit mode
7.1 years ago
Joe 21k

You don't need to modify the perl script to achieve this, just call the script in a loop in the Shell (if your script works the way I assume it does).

Something like:

for file in *.tree ; do perl scriptname.pl tabfile.tsv "${file}" ; done

I don't know if your script 'modifies-in-place' (I'm no perl-aficionado). If not, just redirect the STDOUT to make a new file:

for file in *.tree ; do perl scriptname.pl tabfile.tsv "${file}" > "${file%.*}"_modified.tree ; done

Would give you a new file with _modified.tree appended to the end, for every input tree you had:

tree1.tree -> tree1_modified.tree etc etc...

ADD COMMENT
1
Entering edit mode

ibasan : If you have access to a cluster you could use this script to fire those jobs in parallel. Adjust do perl line to include relevant job scheduler commands (warning: test with a small set of files first before you fire off 20000 incorrect jobs).

ADD REPLY
0
Entering edit mode

Hi genomax2, atm sridhar56 script is running. I tried your version like this:

!/usr/bin/perl
for file in *.txt ; do perl script.pl tabfile.csv "${file}" > "${file%.*}"_modified.tree;
done

But this did not work. Can you please explain what i have to do?

ADD REPLY
0
Entering edit mode

You either need to run that on the terminal using the line,

for file in *.txt ; do perl script.pl tabfile.csv "${file}" > "${file%.*}"_modified.tree;done

Or copy the script into a shell file, Ex: tree.sh,

#!/bin/bash
 for file in *.txt ; do perl script.pl tabfile.csv "${file}" > "${file%.*}"_modified.tree;done

and run on the terminal,

sh tree.sh
ADD REPLY
0
Entering edit mode

I tested it with 1000 files and it works! Thanks for helping me again!

ADD REPLY
1
Entering edit mode
7.1 years ago
sridhar56 ▴ 110

So, to make your script work for the n number of file you have, these are the things to do:

  1. Read all the files from a directory and modify them
  2. Replace the ID's as done above
  3. Save the files to an array
  4. Print them out as individual files

Below is my code, I modified to fit your situation, you will have to modify it accordingly with your input files

ADD COMMENT
0
Entering edit mode
#!/usr/bin/perl
use strict;
use warnings;
use Cwd;                      #Takes in the current path of the directory
use FileHandle;
my $cwd = getcwd();              #Takes in the current directory of files
my $opendir = "$cwd";               #Obtains the path of your current directory
my $treeFile = pop;
my %taxonomy = map { /(\S+)\s+(.+)/; $1 => $2 } <>;
push @ARGV, $treeFile;
my @file_name;
my @file;
my @directory = grep /tre$/, readdir DIR;  #assuming all your tree files have the tre as file extension
opendir (DIR, $opendir) or die $!;              #Opens the current directory, the path of all your files
my $scalar = scalar(@directory);                #Counting the number of files and printing it in the next line
for (my $i=0; $i<scalar(@directory); $i++){
my $name = $directory[$i];
print "The file read in: $name\n";
push @file_name,$name;
my $tmp = "$opendir/$directory[$i]";
open(FILE, $tmp) or die "Can't open $tmp\n";              #Reading the file and storing them in a string in the rep_line array
while ( <FILE> ) {
    $line = $_;
    $line =~ s/\b$_\b/$taxonomy{$_}/g for keys %taxonomy;
    #print $line;
    push @file,$line;
    }
}


mkdir "$cwd/Individual_tree_files"; 
for (my $i=0;$i<scalar(@file);$i++){
printer($file_name[$i],$file[$i]);                  #Calls the subroutine for printing to new files
}


sub printer { 
my @print_array = @_;
my $print_name = substr $print_array[0],1,-1;                           #Use a substring to grab the first part of the file name, modify accordingly
$print_name=~ s/\//_/g;
print "$print_name\n";
my $tree= "\.tre";
my $name_for_folder = "Individual_tree_files/";
my $name_for_file = "$name_for_folder$print_name$tree";                        #Creating the name for every output file
print "$name_for_file\n";
open(my $fh, '>',$name_for_file) or die "Could not open file '$name_for_file' $!";
 print $fh "$print_array[1]";                                                          #Printing the output file into the folder
 print "File for $print_name copied to Individual_tree_files\n";
 close $fh;
}
ADD REPLY
0
Entering edit mode

Tried to format it, was unable to, if you copy paste the code to a text file, save it as perl file and then run the script, it should work

ADD REPLY
1
Entering edit mode

Highlight text you want to format, click on the 101010 button.

ADD REPLY
0
Entering edit mode

Shall do that in the future!

ADD REPLY
0
Entering edit mode

Thanks a lot sridhar56 for helping me! Do i use the script as follow:

perl script.pl taxonomyfile.csv

in the directory with all my treefiles? I added my $line = $_; in the esction with variables And now its calculating...

ADD REPLY
0
Entering edit mode

The problem is, $line does not have a my before it at line 23. To make the script function well for your files, it would be best to copy all the taxonomy ID's within the script and assign them to the hash %taxonomy. Then map them to your tree files. That will work.

To keep it easy, if you have access to linux, try the solution suggested by,@jrj.healey. His solution is simpler than mine

ADD REPLY

Login before adding your answer.

Traffic: 1839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6