Closed:[Discussion] Parsing Fasta Without Bioperl
4
0
Entering edit mode
11.5 years ago
Puriney ▴ 330

We all known the typical fasta format is like:

>id_123 range=chr1_1234_5678 strand=+
atgcatgc

Now I wish to covert them into single one line, like:

id_123 chr1 1234 5678 atgcatgc

The ID and location information can be obtained in the > line, which is easy. But how to print the format i want? I'll use Perl.

  • Regular Expression

Straightforward is to use regular expression.

my $infile =shift;
open IN, $infile;
my $seq = '';
while ( my $line = <IN> ) {
    chomp $line;
    if ($line =~ /^>/){
        print $seq . "\n";
        $seq = '';
        print $line."\t"; #header information, u can do other processing. 
    }
    elsif ($line !~ /\s+/){
        $seq = $seq . $line ; 
    }
}

The above codes will leave the sequence of last gene, since only when the $line=~ /^>/ the code will output the sequence.

Thus, a trick is to add a > as the very bottom line of the file before file processing. BTW, we do have hash available, but for thousands of lines, hash will cost great amount of memory, so I don't like hash.

  • Bio::SeqIO

BioPerl is much more easier.

my $infile =shift;
use Bio::SeqIO;
my $seqio = Bio::SeqIO->new (-file =>$infile, -format=>'fasta');
while (my $seq = $seqio->next_seq){
    my ($id ,$desc, $fullseq) = ($seq->id, $seq->desc, $seq->seq);
# any other processing. Straightforward. 
}

The reason why I don't want to use BioPerl is that not every computer I use will get BioPerl module installed. And this code reminds me of the "Locate non-zero numbers start and end from a string of numbers". Somehow we can treat the > as the first none-zero number ( meanwhile as the start ), and the sequence as zero. But...I cannot figure it out because the > actually induce output, while in the numbers_code.pl the output-inducing duty is on zero number.

Thus, any suggestion?

deleted-post • 373 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6