We all known the typical fasta format is like:
>id_123 range=chr1_1234_5678 strand=+
atgcatgc
Now I wish to covert them into single one line, like:
id_123 chr1 1234 5678 atgcatgc
The ID and location information can be obtained in the >
line, which is easy. But how to print
the format i want? I'll use Perl
.
- Regular Expression
Straightforward is to use regular expression.
my $infile =shift;
open IN, $infile;
my $seq = '';
while ( my $line = <IN> ) {
chomp $line;
if ($line =~ /^>/){
print $seq . "\n";
$seq = '';
print $line."\t"; #header information, u can do other processing.
}
elsif ($line !~ /\s+/){
$seq = $seq . $line ;
}
}
The above codes will leave the sequence of last gene, since only when the $line=~ /^>/
the code will output the sequence.
Thus, a trick is to add a >
as the very bottom line of the file before file processing. BTW, we do have hash
available, but for thousands of lines, hash
will cost great amount of memory, so I don't like hash.
- Bio::SeqIO
BioPerl is much more easier.
my $infile =shift;
use Bio::SeqIO;
my $seqio = Bio::SeqIO->new (-file =>$infile, -format=>'fasta');
while (my $seq = $seqio->next_seq){
my ($id ,$desc, $fullseq) = ($seq->id, $seq->desc, $seq->seq);
# any other processing. Straightforward.
}
The reason why I don't want to use BioPerl is that not every computer I use will get BioPerl module installed. And this code reminds me of the "Locate non-zero numbers start and end from a string of numbers". Somehow we can treat the >
as the first none-zero number ( meanwhile as the start ), and the sequence as zero. But...I cannot figure it out because the >
actually induce output, while in the numbers_code.pl
the output-inducing duty is on zero number.
Thus, any suggestion?