Question

Need Help Writing A Perl Script For Extracting Sequences Based On A List

1

Entering edit mode

10.5 years ago

biolab ★ 1.4k

Dear All, I am a true starter of perl. I have finished reading several chapters of LEARNING PERL. Now I tried to write script. SEQ file contains multiple gene sequences, while LIST file is a list of gene names. Below is an example.

SEQ file:

>gene 1
gctagtcagtacgtacgtac
>gene 2
cgagagagaggccgcgagcatcagtacgtgtac
>gene 3
ccgcggggcggccgcggcccccgcggcgcgcgatta

LIST file:

>gene2
>gene3

My script is as follows:

#!/usr/bin/perl
use strict;

my ($seq, $list) =@ARGV;
open SEQ, "$seq";
open LIST, "$list";
my @seq= <SEQ>;
my @seq = split;
my @list= <LIST>;
my @list = split;

my $i=0;
my $j=0;
while ($i<$#seq){
    while ($j < $#list){
        if ($seq[$i] eq $list[$j]){
            print "$seq[$i] is found\n";
        }
        $j++;
    }
    $i +=2;
}
close SEQ;
close LIST;

Could anyone working on perl make some corrections, suggestions and comments. I appreciate your help, which I believe drive me forward in the bioinformatics field. THANKS A LOT!

perl • 6.4k views

ADD COMMENT • link updated 10.4 years ago by emazep ▴ 30 • written 10.5 years ago by biolab ★ 1.4k

1

Entering edit mode

if you are working on a unix machine and your 'list' file isn't too long you should probably go with:

grep -f list.txt sequences.fa

if your list-file is really large have a look at Kent tools and there the progamme faSomeRecords.

Cheers

ADD REPLY • link 10.4 years ago by Phil S. ▴ 700

0

Entering edit mode

THANKS for handy method!

ADD REPLY • link 10.4 years ago by biolab ★ 1.4k

0

Entering edit mode

no worries... ;)

ADD REPLY • link 10.4 years ago by Phil S. ▴ 700

0

Entering edit mode

I think you should mentioned what do you want your script to do.

ADD REPLY • link 10.5 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Sorry for unclear description. I aim to extract sequences from SEQ file according to LIST file. In my script the line print "$seq[$i] is found\n" should be print "$seq[$i] ($seq[$i+1]) is found\n". Anyway, my script is wrong somewhere, as when running perl test.pl seq.fa list.txt, I didn't get any responding. THANKS for your help!

ADD REPLY • link 10.4 years ago by biolab ★ 1.4k

1

Entering edit mode

First ask yourself, what does this do :

my @list= <LIST>; 
my @list = split;

What do you hope to achieve ? Tip, use the line "use Data::Dumper;" and print a variable using print Dumper($variable); to start debugging. I will not give you the answer since I feel this is a learning opportunity for you.

ADD REPLY • link 10.4 years ago by Gabriel R. ★ 2.9k

score 3 · Answer 1 · 2013-11-09

In a comment, you said, "I aim to extract sequences from SEQ file according to LIST file." Given your list and sequence IDs, you'll never find a match, since the list doesn't contain any spaces, but the sequence IDs do.

Here are just a few items for you to consider:

Always:

use strict;
use warnings;

When opening files, use the three-argument form of open, e.g.,

open my $seq, '<', $seq or die "Unable to open $seq: $!";

Note that this form of open uses lexical varibles (e.g., my $seq) for file handles and not bare words (e.g., SEQ and LIST). This form also traps errors.

It's not necessary to read both files into an array; no arrays are necessary here. In fact, in this case, a hash whose keys are the list items would work better:

Create a hash where the keys are the list items. Use chomp on each line read to remove the trailing EOL character.
Read through the seq file, line-by-line, and solate the seq ID. If that seq ID exists within the list hash, you've got a match, so print the seq.

split is not needed for your task.

Hope this helps!

score 2 · Answer 2 · 2013-11-11

In case it's helpful, here's how I would solve this. Feel free to ask questions about anything that doesn't make sense :)

use v5.10.0;
use strict;
use warnings;

# checks that you have input files
unless ($ARGV[0] and $ARGV[1]) {
    die "Missing input file\nUsage: $0 list.txt seq.fa\n";
}

# opens the list.txt file and stores the names in a hash
my %interestingnames;
open FILE, $ARGV[0];
while (my $line = <FILE>) {
    chomp $line;
    $interestingnames{$line}++;
}
close FILE;

# opens the fasta file
# goes through line by line checking for a match
my $name;
my $seq;
open FILE1, $ARGV[1];
while (my $line = <FILE1>) {
    chomp $line;

    #checks whether it's a header line
    if ($line =~ />(.+)/) {
        # stores the regular expression match as a name
        $name = $1; 
    } else {
        $seq = $line;

        #checks whether name exists in list file, by checking hash keys
        if(exists($interestingnames{$name})) {
            print ">$name\n";
            print "$seq\n";
        }
    }
}
close FILE1;

score 0 · Answer 3 · 2013-11-09

0

Entering edit mode

10.4 years ago

Charles Warden 8.2k

One advantage to using Perl (or Python, or R) is having pre-defined packages for bioinformatics analysis: You should definitely become familiar with BioPerl:

http://www.bioperl.org/wiki/Main_Page

http://www.bioperl.org/wiki/Getting_Started

For example, there is a module that is specifically designed for parsing FASTA files:

http://www.bioperl.org/wiki/HOWTO:Beginners#Retrieving_a_sequence_from_a_file

You just need to selectively print the values from the keys matching your list to a new file.

ADD COMMENT • link 10.4 years ago by Charles Warden 8.2k

1

Entering edit mode

Excellent suggestion to use bioperl! But the OP first needs to get a handle (no pun intended) on some Perl basics as a foundation for using such modules. This is scope and sequence in pedagogical terms, and the OP did say, "I am a true starter of perl..." Additionally, bioperl will not work on the OP's list file, but only on the fasta-type file. Even then, though, no match will be found between the two files, given the OP's datasets.

ADD REPLY • link 10.4 years ago by Kenosis ★ 1.3k

Devon Ryan · Answer 4 · 2013-11-11

0

Entering edit mode

10.4 years ago

biolab ★ 1.4k

Dear all, Thank you for your answers. I still faced trouble with the script to extract seqeunces based on a list. An example is shown below.

FASTA FILE:
>gene1
cgcggccccccccccccccccccc
>gene2
ggggggggggggggggggcgcgcg
>gene3
tagctacgatagcgtcagatgcaacg
>gene4
attttttttttttttttttttttttttttaaaaaaa

LIST FILE:

>gene4
>gene3

MY SCRIPT: #!/usr/bin/perl use strict; use warnings;

open SEQ, "seq.txt";
my @seq=<SEQ>;
open LIST, "list.txt";
my @list=<LIST>;

my $i=0;
while ($i<$#seq){
    my $j=0;
    while ($j<$#list){
        if ($seq[$i] eq $list[$j]){
            print "$seq[$i]\t$seq[$i+1]";
        }
        $j++;
    }
    $i +=2;
}
close SEQ;
close LIST;

THE OUTPUT RESULT:

>gene4
attttttttttttttttttttttttttttaaaaaaa

Where is the error in my script? I have two names in the LIST file, but only got one sequence. THANKS A LOT!!

ADD COMMENT • link updated 10.4 years ago by Devon Ryan 104k • written 10.4 years ago by biolab ★ 1.4k

0

Entering edit mode

Formatting is really important for writing scripts that you and other people can read and debug. While Perl lets you format your scripts however you like, following best practices is still recommended. These are:

One command per line
Use whitespace for clarity, to separate different sections of your code.
Always, always indent loops. Always.
Your aim should not be to write a script as short as possible - it should be to write a script that's as humanly readable as possible. This includes giving variables informative names, and leaving comments about what different sections of your code are doing.

This sort of one line script is not particularly human-readable, therefore finding errors in it is basically impossible. If you post a friendly, formatted version, I'd be happy to have a look :)

ADD REPLY • link 10.4 years ago by Jelena Aleksic ▴ 920

0

Entering edit mode

Though this could just be a formatting issue on the forum rather than in your script :)

ADD REPLY • link 10.4 years ago by Jelena Aleksic ▴ 920

1

Entering edit mode

I've fixed that for biolab, though he/she should really not post that as an answer.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

THANKS for advices. Actually when coping my original formatted script here, its format changed. Anyway I will try to sort out, as I am a new member in Biostars.

ADD REPLY • link 10.4 years ago by biolab ★ 1.4k

0

Entering edit mode

No worries, the formatting takes some time to get used to!

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

score 0 · Answer 5 · 2013-11-12

Assuming that your files are always as shown (no extraneous lines from your examples), a more concise and faster (and probably more readable too) script is possible, which is maybe also interesting as learning material since it uses a couple of useful Perl features:

#!/usr/bin/perl
use strict;
use warnings;

my $listfile = shift;
my %wanted_seq;

{
    local @ARGV = $listfile;
    while ( my $name = <> ) {
        chomp $name;
        $wanted_seq{$name} = 1 unless exists $wanted_seq{$name}
    }
}

while ( my $name = <> ) {
    chomp $name; my $seq = <>;
    print $name, "\n", $seq if $wanted_seq{$name}
}

Usage:

perl script.pl listfile seqfile > outfile

A couple of general advices, besides the ones already offered by others:

unless you need sophisticate error checking, don't explicitly open files in your script, pass instead the filenames on the command line and let Perl open them for you: this way your script can be (more easily) combined with other commands;
avoid slurping files into arrays: it's less efficient both in time and in space - really in this particular case it's better to keep listfile in memory (inside %wanted_seq in my implementation), but seqfile can be examined one line (well, two lines) at a time, with potentially considerable savings;
when iterating over arrays, foreach is nearly always preferable over while (or C-style for loops).

Good work!