Biostar Beta. Not for public use.
How to extract a region of protein or small domain '150-300' from a long sequences of protein from a protein length of 2000AA. through command.
0
Entering edit mode
19 months ago
United States

hello all,

i have a large number of protein made up of 1000 to 2000 amino acid. now in these i want to extract only a small region for example '150-300' from each protein. how can i do this through terminal.i need to extract different region from different protein what i have to do.

ADD COMMENTlink
0
Entering edit mode

well do you know which sequences are involved in your processing and which regions you want to splice out ?? if yes then :

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 0,100; @a = ()}; if( $1 =~ /1/){$l = 1;print $_}else{$l=0;}}; ' in_file

NOTE: code not tested!!

the idea is to locate a sequence :$1 =~ /my_seq_id/ and then extract a region starting at position 0 of length 100 ( print splice @a, 0,100;) Please note that the above line of code can be extended to extract the same position within any sequence sharing soem regex in its header.

ADD REPLYlink
0
Entering edit mode

Thank you sir..

but if i need to extract different region from different protein what i have to do

ADD REPLYlink
0
Entering edit mode

repeat the procedure :

extract 122-222 from sequence xxcx :

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 121,100; @a = ()}; if( $1 =~ /xxcx/){$l = 1;print $_}else{$l=0;}}; ' in_file

extract 22-522 from sequence yyy :

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 21,500; @a = ()}; if( $1 =~ /yyy/){$l = 1;print $_}else{$l=0;}}; ' in_file 
ADD REPLYlink
0
Entering edit mode

write a script like python or perl and run it from terminal , which will do your job!!!

ADD REPLYlink
1
Entering edit mode
19 months ago
Lesley Sitter • 460
Netherlands

in python you could do something like this

import sys
file_open = open(sys.argv[1],'r').readlines()
file_out = open("outputfile.txt",'w')
for lines in file_open:
    if '>' in lines:
        file_out.write(lines)
    else:
        file_out.write(lines[149:300]+"\n")

You can then start this script by typing
Python my_script_name.py my_input_file.fasta

ADD COMMENTlink
1
Entering edit mode
18 months ago
France/Nantes/Institut du Thorax - INSE…

prepare a list with the regions you want to extract (fasta-header, start,end)

$ cat list.txt
>gi|72201761    10    20
>gi|14666145    11    40
>gi|83671954    13    50

linearize the fasta , join both sorted files, extract the substring with awk

join -t '      ' -1 1 -2 1 <(awk '/>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  | sort -k1,1) <(sort -k1,1 list.tsv) | awk -F '     ' '{printf("%s(%s-%s)\n%s\n",$1,$3,$4,substr($2,int($3),int($4)-int($3)));}'
>gi|14666145(11-40)
CATTCATCATGATAACAGCACCCTAAATA
>gi|72201761(10-20)
GTGCCATTTA
>gi|83671954(13-50)
CCGGAATTCCCGGGATATCGTCGACCCACGCGTCCGC
ADD COMMENTlink
0
Entering edit mode
20 months ago
venu 6.2k
Germany

You can use this perl script. It also works if it is MSA file.

$ perl select_sites.pl  file_name.faa  <start_position>  <end_position>

or

grep -v '^>' file_name.faa | tr -d "\n" | cut -c 150-300

but this works well when there is one sequence per one file. However you can loop it over all the files.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1