Biostar Test Site

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

How to extract a region of protein or small domain '150-300' from a long sequences of protein from a protein length of 2000AA. through command.
3
0
Entering edit mode
5.9 years ago

hello all,

i have a large number of protein made up of 1000 to 2000 amino acid. now in these i want to extract only a small region for example '150-300' from each protein. how can i do this through terminal.i need to extract different region from different protein what i have to do.

sequence protein command terminal programming • 1.9k views
ADD COMMENT
0
Entering edit mode

well do you know which sequences are involved in your processing and which regions you want to splice out ?? if yes then :

 

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 0,100; @a = ()}; if( $1 =~ /1/){$l = 1;print $_}else{$l=0;}}; ' in_file

 

NOTE: code not tested!!

 

the idea is to locate a sequence :$1 =~ /my_seq_id/  and then extract a region starting at position 0 of length 100 ( print splice @a, 0,100;) Please note that the above line of code can be extended to extract the same position within any sequence sharing soem regex in its header.

ADD REPLY
0
Entering edit mode

Thank you sir..

but if i need to extract different region from different protein what i have to do

ADD REPLY
0
Entering edit mode

repeat the procedure :

extract 122-222 from  sequence xxcx :

 
perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 121,100; @a = ()}; if( $1 =~ /xxcx/){$l = 1;print $_}else{$l=0;}}; ' in_file

 

extract 22-522 from  sequence yyy :

 

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 21,500; @a = ()}; if( $1 =~ /yyy/){$l = 1;print $_}else{$l=0;}}; ' in_file 

 

ADD REPLY
0
Entering edit mode

write a script like python or perl and run it from terminal , which will do your job!!!

ADD REPLY
1
Entering edit mode
5.9 years ago
Lesley Sitter ▴ 560

in python you could do something like this

 

import sys
file_open = open(sys.argv[1],'r').readlines()
file_out = open("outputfile.txt",'w')
for lines in file_open:
    if '>' in lines:
        file_out.write(lines)
    else:
        file_out.write(lines[149:300]+"\n")

 

You can then start this script by typing
Python my_script_name.py my_input_file.fasta

ADD COMMENT
1
Entering edit mode
5.9 years ago

prepare a list with the regions  you want to extract (fasta-header, start,end)

$ cat list.txt
>gi|72201761    10    20
>gi|14666145    11    40
>gi|83671954    13    50

 

linearize the fasta , join both sorted files, extract the substring with awk

join -t '      ' -1 1 -2 1 <(awk '/>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  | sort -k1,1) <(sort -k1,1 list.tsv) | awk -F '     ' '{printf("%s(%s-%s)\n%s\n",$1,$3,$4,substr($2,int($3),int($4)-int($3)));}'
>gi|14666145(11-40)
CATTCATCATGATAACAGCACCCTAAATA
>gi|72201761(10-20)
GTGCCATTTA
>gi|83671954(13-50)
CCGGAATTCCCGGGATATCGTCGACCCACGCGTCCGC

 

ADD COMMENT
0
Entering edit mode
5.9 years ago
venu 6.8k

You can use this perl script. It also works if it is MSA file.

$ perl select_sites.pl  file_name.faa  <start_position>  <end_position>

or

grep -v '^>' file_name.faa | tr -d "\n" | cut -c 150-300

but this works well when there is one sequence per one file. However you can loop it over all the files.

 

 

ADD COMMENT

Login before adding your answer.

Traffic: 181 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6