Biostar Beta. Not for public use.
How to match a FASTA header for extraction using Perl?
0
Entering edit mode
15 months ago

Hi!

So I have a FASTA file containing sequences, I want to replace old FASTA headers with new ones, and the first step to do so is to match with the header names. It's the name I want the match with, so after the '>'. How do I do this? All sequences have headers somewhat like this:

>Halobacterium_salinarum

This is the part of the code where I find the headers:

     while (my $line = <$IN>) {  if ($line =~ /^>/) {
     my $x =           # Here I want to match with "Halobacterium_salinarum" 
                       # and all the other different species names

I have tried for hours to find out in the right match characters. Is it "any word character": \w? I also want to save the old species name in a hash, then I should save it like this: (\w+) and finish with \s cause thats where the name ends, right?

Perl • 417 views
ADD COMMENTlink
0
Entering edit mode

Try the script form following article.

https://www.perlmonks.org/?node_id=975419

ADD REPLYlink
0
Entering edit mode

So, people still use Perl for Bioinformatics!

ADD REPLYlink
0
Entering edit mode

Probably using bioperl will ease your life:

use Bio::SeqIO;
use strict;
use warnings;

my $fasta  = Bio::SeqIO->new(-file => $file , -format => 'Fasta');
while ( my $seq = $fasta->next_seq() ) {
  my $header = $seq->id;
  if ($header =~ m/>(.+)/){
     print "My species name = $1\n";
  }
}
ADD REPLYlink
1
Entering edit mode
14 months ago
Juke-34 ♦ 2.2k
Sweden
while (my $line = <$IN>) {
  if ($header =~ m/>(.+)/){
     print "My species name = $1\n";
 }
}
ADD COMMENTlink
0
Entering edit mode
13 months ago
JC 7.9k
Mexico

The \w in Perl matches any alphanumeric char and the underscore, and using (\w+) should match any word and stop to the first no-word char (space or new line). If you want to save this in a hash:

#!/usr/bin/perl

use strict;
use warnings;

my %species = ();
while (<>) {
    if ( m/^>(\w+)/ ) {
         $species{$1}++;
}

print "Species\tCount\n";
while (my ($sp, $cnt) = each %species) {
    print "$sp\t$cnt\n";
}
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1