Question

Closed:[Discussion] Parsing Fasta Without Bioperl

0

Entering edit mode

11.5 years ago

Puriney ▴ 330

We all known the typical fasta format is like:

>id_123 range=chr1_1234_5678 strand=+
atgcatgc

Now I wish to covert them into single one line, like:

id_123 chr1 1234 5678 atgcatgc

The ID and location information can be obtained in the > line, which is easy. But how to print the format i want? I'll use Perl.

Regular Expression

Straightforward is to use regular expression.

my $infile =shift;
open IN, $infile;
my $seq = '';
while ( my $line = <IN> ) {
    chomp $line;
    if ($line =~ /^>/){
        print $seq . "\n";
        $seq = '';
        print $line."\t"; #header information, u can do other processing. 
    }
    elsif ($line !~ /\s+/){
        $seq = $seq . $line ; 
    }
}

The above codes will leave the sequence of last gene, since only when the $line=~ /^>/ the code will output the sequence.

Thus, a trick is to add a > as the very bottom line of the file before file processing. BTW, we do have hash available, but for thousands of lines, hash will cost great amount of memory, so I don't like hash.

Bio::SeqIO

BioPerl is much more easier.

my $infile =shift;
use Bio::SeqIO;
my $seqio = Bio::SeqIO->new (-file =>$infile, -format=>'fasta');
while (my $seq = $seqio->next_seq){
    my ($id ,$desc, $fullseq) = ($seq->id, $seq->desc, $seq->seq);
# any other processing. Straightforward. 
}

The reason why I don't want to use BioPerl is that not every computer I use will get BioPerl module installed. And this code reminds me of the "Locate non-zero numbers start and end from a string of numbers". Somehow we can treat the > as the first none-zero number ( meanwhile as the start ), and the sequence as zero. But...I cannot figure it out because the > actually induce output, while in the numbers_code.pl the output-inducing duty is on zero number.

Thus, any suggestion?

deleted-post • 373 views

ADD COMMENT • link updated 11.5 years ago by Farhat ★ 2.9k • written 11.5 years ago by Puriney ▴ 330