Biostar Beta. Not for public use.
select a specific length from protein file
0
Entering edit mode
12 months ago
Jason • 0

Hey everyone, I want to use shell command or Perl script

I have 200 protein sequences with different length.

I want to select only sequences with length size less than 310. Any sequences size more than or equal 310 will not save to the new file.

For example, here, I have three proteins with different lengths. All sequences are in one line.

>sp|P01892|1A02_HUMAN HLA class I histocompatibility antigen, A-2 alpha chain OS=Homo sapiens OX=9606 GN=HLA-A PE=1 SV=1
MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGILVLFGAVITGAVVAAVMWRRKSSDRKGGSYSQAASSDSAQGSDVSLTACKV

>sp|P61916|NPC2_HUMAN NPC intracellular cholesterol transporter 2 OS=Homo sapiens OX=9606 GN=NPC2 PE=1 SV=1
MRFLAATFLLLALSTAAQAEPVQFKDCGSVDGVIKEVNVSPCPTQPCQLSKGQSYSVNVT
FTSNIQSKSSKAVVHGILMGVPVPFPIPEPDGCKSGINCPIQKDKTYSYLNKLPVKSEYP
SIKLVVEWQLQDDKNQSLFCWEIPVQIVSHL

>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVL
TAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKR
TRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCESDLRHY
YDSTIELCVGDPEIKKTSFKGDSGGPLVCNKVAQGIVSYGRNNGMPPRACTKVSSFVHWI
KKTMKRY

Result:

>sp|P61916|NPC2_HUMAN NPC intracellular cholesterol transporter 2 OS=Homo sapiens OX=9606 GN=NPC2 PE=1 SV=1
MRFLAATFLLLALSTAAQAEPVQFKDCGSVDGVIKEVNVSPCPTQPCQLSKGQSYSVNVT
FTSNIQSKSSKAVVHGILMGVPVPFPIPEPDGCKSGINCPIQKDKTYSYLNKLPVKSEYP
SIKLVVEWQLQDDKNQSLFCWEIPVQIVSHL

>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVL
TAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKR
TRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCESDLRHY
YDSTIELCVGDPEIKKTSFKGDSGGPLVCNKVAQGIVSYGRNNGMPPRACTKVSSFVHWI
shell perl • 121 views
ADD COMMENTlink
1
Entering edit mode

Take a look at seqkit (https://bioinf.shenwei.me/seqkit/usage/ ). You should be able to find a tool that will do what you need.

ADD REPLYlink
0
Entering edit mode

Asked from time to time: How To Filter Multi Fasta By Length??

There are other posts with solutions, please search the site in case the ones from the link above (or genomax's) doesn't suit your needs.

ADD REPLYlink
0
Entering edit mode
11 months ago
JC 7.9k
Mexico
#!/usr/bin/perl

use strict;
use warnings;

my $min_size = 310;
$/="\n>"; # read fasta per sequence block

while (<>) {
    my ($header, @seq) = split (/\n/, $_);
    my $seq = join ("", @seq);
    print $_ if ( lenght $seq <= $min_size);
}

Usage:

perl filterBySize.pl < FASTA_IN > FASTA_OUT
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1