Remove hypothetical proteins
1
0
Entering edit mode
5.8 years ago
biodano.geo ▴ 20

Hi,

i want remove all hypothetical proteins from proteome fasta file.

i'm used grep -v "hypothetical" proteome.fasta > filtered.fasta , but only remove head line not all (head and sequence of hypothetical proteins).

Please help me with awk, perl or python script.

FASTA script grep • 2.1k views
ADD COMMENT
2
Entering edit mode

Please use appropriate tags. software error is not an appropriate tag here.

ADD REPLY
0
Entering edit mode

biodano.geo - See how the tags are more relevant now. Please invest more effort into future posts.

ADD REPLY
3
Entering edit mode
5.8 years ago

This should work (tested on BASH on linux):

cat test.fasta 
> GENE
TTTT
CCCC
ATGC
> hypothetical A
ATCG
ATCG
ATTT
AAAA
> LOC5,HypothetICAL
ATCG
ATCG
ATTT
AAAA
> TP53
ATCG
ATCG
ATTT
AAAA
> hypothetical, LOC12354
ATCG
ATCG
ATTT
AAAA
> BRCA1
ATCG
ATCG
ATTT
AAAA



awk '/^>/ && toupper($0) ~ /HYPOTHETICAL/ {bool=1}; /^>/ && toupper($0) !~ /HYPOTHETICAL/ {bool=0}; {if (bool==0) print}' test.fasta 
> GENE
TTTT
CCCC
ATGC
> TP53
ATCG
ATCG
ATTT
AAAA
> BRCA1
ATCG
ATCG
ATTT
AAAA

This makes the use of a boolean flag that is set to '1' when a header with 'hypothetical' is found, and set to '0' when not found.

This is case insensitive through the use of the toupper function. So, hypothetical can be written any way.

Kevin

ADD COMMENT
0
Entering edit mode

Thank so much. Please you can teach me as understand this awk script.

ADD REPLY
1
Entering edit mode
/^>/ && toupper($0) ~ /HYPOTHETICAL/ {bool=1};

If the line starts with > and the to upper case transformed text of the line contains the word HYPOTHETICAL set the value of bool to 1.

/^>/ && toupper($0) !~ /HYPOTHETICAL/ {bool=0};

If the line starts with > and the to upper case transformed text of the line doesn't contain the word HYPOTHETICAL set the value of bool to 0.

{if (bool==0) print}

If bool is set to 0 print the line. This is the case for the header line that doesn't have HYPOTHETICAL in and all lines that follows until the next line start starts with > and the check about the word HYPOTHETICAL is done again.

fin swimmer

ADD REPLY
0
Entering edit mode

Thank so much i understand.

ADD REPLY
0
Entering edit mode

Hello biodano.geo,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

Thanks finswimmer! :)

ADD REPLY

Login before adding your answer.

Traffic: 1563 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6