Splitting large fasta file according domains in HUMAN
0
0
Entering edit mode
8.8 years ago
hb273 • 0

Hi all,

I have a FASTA file which contains protein domain sequences (Pfam-A.fasta) in different species, and need to split the file into multiple FASTAs, one domain in HUMAN per file. What's the best way to go about this? Ideally each file will be named with the name of the domain. I am using Perl.

Example of the sequence in fasta file:

>B4DW62_HUMAN/123-274  PF00198.17;2-oxoacid_dh;
WDGEGPKQLPFIDISVAVATDKGLLTPIIKDAAAKGIQEIADSVKALSKK
ARDGKLLPEEYQGGSFSISNLGMFGIDEFTAVINPPQACILAVGRFRPVL
KLTEDEEGNAKLQQRQLITVTMSSDSRVVDDELATRFLKSFKANLENPIR
LA
>D2HF00_AILME/272-501  PF00198.17;2-oxoacid_dh;
PGTFTEIPASNIRRVIAKRLTESKSTVPHAYATADCDLGAVLKARQSLVR
DDIKVSVNDFIIKAAAVTLKQMPDVNVSWDGEGPKQLPFIDISVAVATDK
GLITPIIKDAAAKGVQEIADSVKALSKKARDGKLLPEEYQGGSFSISNLG
MFGIDEFTAVINPPQACILAVGRFRPVLKLEQDEEGNARLQPHQLITVTM
SSDSRVVDDELATRFLENFKANLENPIRLA

Many Thanks

Hanadi

sequence • 2.1k views
ADD COMMENT
0
Entering edit mode

Can you post your perl code?

ADD REPLY
0
Entering edit mode

I am a new in Perl.

I used this script to split the file

#!/usr/bin/perl
use warnings;
use 5.12.4;
use File::Basename;

my $file = "Pfam-A.fasta"; #enter name of your FASTA file here
my $record_per_file = 4;    #Enter how many record you want per file / chunk size
my $file_number = 1;    # a part of your new file names.
my $counter = 0;    #counts number of records

open (FASTA, "<", "$file" ) or die "Cannot open file $file $!";

while (<FASTA>) {
    if (/^>/) {
        if ($counter++ % $record_per_file == 0) {
            my $basename = basename($file);
            my $new_file_name = $basename. $file_number++ . ".fasta";
            close(NEW_FASTA);
            open(NEW_FASTA, ">", $new_file_name) or die "Cannot open file $new_file_name $!";
        }
    }
    print NEW_FASTA $_;
}
ADD REPLY

Login before adding your answer.

Traffic: 2609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6