How can I pull out specific protein fastas from one file using information from the protein header?
1
0
Entering edit mode
2.9 years ago
ges29 ▴ 50

I have a file containing all the protein fastas from my genome of interest (downloaded from NCBI):

example of my .faa file

I've been conducting some analysis and have a list of genes I'm interested in; however, I don't have the accession numbers, I have a list of partial gene IDs assigned by the people who did the annotation:

list of gene IDs

Is there a way I can search the protein fastas file for a list of proteins using information from the protein headers?

I've tried grep -f genes_of_interest.txt protein_fastas_file.faa > output.fa

in an attempt to at least pull out the headers so I can get the accession numbers but all it did was return every single protein fasta (ie. exactly the same file as protein_fastas_file.faa). I'm assuming this is because the actual sequence part doesn't actually exist on a new line or something?

Thanks in advance for any help anyone can give!

protein grep accession fasta • 1.8k views
ADD COMMENT
0
Entering edit mode

Input files are taken from GenoMax post

input:

$ cat test.fa                            

>QBMSHHS Hypotherical protein METSCH_A00100 someting
MPPVGGKKAKKGILERLNAGEIVIGDGGFVFALEKRGYVKAGPWTPEAAVEHPEAVRQLHREFLRAGSNV
MQTFTFYASEDKLENRGNYVLEKISGQEVNEAACDIARQVADEGDALVAGGVSQTPSYLSCKSETEVKKV
>QBMSMMS Hypotherical protein METSCH_A00101 someting
FLQQLEVFMKKNVDFLIAEYFEHVEEAVWAVETLIASGKPVAATMCIGPEGDLHGVPPGECAVRLVKAGA
SIIGVNCHFDPTISLKTVKLMKEGLEAARLKAHLMSQPLAYHTPDCNKQGFIDLPEFPFGLEPRVATRWD
>QQDSHHS Hypotherical protein METSCH_A20100 someting
IQKYAREAYNLGVRYIGGCCGFEPYHIRAIAEELAPERGFLPPASEKHGSWGSGLDMHTKPWVRARARKE
YWENLRIASGRPYNPSMSKPDGWGVTKGTAELMQQKEATTEQQLKELFEKQKFKSQ

$ cat ids.txt

A00100

ouput:

$ seqkit -w 0 grep -nr -f ids.txt test.fa

>QBMSHHS Hypotherical protein METSCH_A00100 someting
MPPVGGKKAKKGILERLNAGEIVIGDGGFVFALEKRGYVKAGPWTPEAAVEHPEAVRQLHREFLRAGSNVMQTFTFYASEDKLENRGNYVLEKISGQEVNEAACDIARQVADEGDALVAGGVSQTPSYLSCKSETEVKKV

Download seqkit from here: https://bioinf.shenwei.me/seqkit/download/

ADD REPLY
2
Entering edit mode
2.9 years ago
GenoMax 141k
$ more prot.fa 
>QBMSHHS Hypotherical protein METSCH_A00100 someting
MPPVGGKKAKKGILERLNAGEIVIGDGGFVFALEKRGYVKAGPWTPEAAVEHPEAVRQLHREFLRAGSNV
MQTFTFYASEDKLENRGNYVLEKISGQEVNEAACDIARQVADEGDALVAGGVSQTPSYLSCKSETEVKKV
>QBMSMMS Hypotherical protein METSCH_A00101 someting
FLQQLEVFMKKNVDFLIAEYFEHVEEAVWAVETLIASGKPVAATMCIGPEGDLHGVPPGECAVRLVKAGA
SIIGVNCHFDPTISLKTVKLMKEGLEAARLKAHLMSQPLAYHTPDCNKQGFIDLPEFPFGLEPRVATRWD
>QQDSHHS Hypotherical protein METSCH_A20100 someting
IQKYAREAYNLGVRYIGGCCGFEPYHIRAIAEELAPERGFLPPASEKHGSWGSGLDMHTKPWVRARARKE
YWENLRIASGRPYNPSMSKPDGWGVTKGTAELMQQKEATTEQQLKELFEKQKFKSQ

$ more id
A00100
A20100

# linearize fasta (code from @Pierre) 

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' prot.fa | grep -f id | tr "\t" "\n" > new.fa

$ more new.fa 
>QBMSHHS Hypotherical protein METSCH_A00100 someting
MPPVGGKKAKKGILERLNAGEIVIGDGGFVFALEKRGYVKAGPWTPEAAVEHPEAVRQLHREFLRAGSNVMQTFTFYASEDKLENRGNYVLEKISGQEVNEAACDIARQVADEGDALVAGGVSQTPSYLSCKSETEVKKV
>QQDSHHS Hypotherical protein METSCH_A20100 someting
IQKYAREAYNLGVRYIGGCCGFEPYHIRAIAEELAPERGFLPPASEKHGSWGSGLDMHTKPWVRARARKEYWENLRIASGRPYNPSMSKPDGWGVTKGTAELMQQKEATTEQQLKELFEKQKFKSQ

Add fold -w 70 if you want the sequence lines folded every 70 aa.

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' prot.fa | grep -f id | tr "\t" "\n"| fold -w 70 > new.fa
ADD COMMENT
0
Entering edit mode

Hey Genomax, thanks for your answer.

I tried awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' prot_one.fa | grep -f id | tr "\t" "\n" > new.fa

but this has resulted in the same issue as before, the output file (new.fa) has all genes in it.

I checked this with grep -c "^>" new.fa - which returns 5800 genes.

but my list of genes I'm interested in only has 1550

ADD REPLY
0
Entering edit mode

I showed a working example modeled after what your screen shot had (BTW you should copy paste data rather than screen-shots). It should work. Not sure what is going on.

ADD REPLY
0
Entering edit mode

Apologies, I did try to do that with my original question but whenever I copied in the data the ">" symbol for the next gene causes half of it to turn into a quote rather than code. As below:

 >QBM85393.1 hypothetical protein METSCH_A00100 [Metschnikowia aff. pulcherrima]
MIQAHESLQFHQGFTQQMVKYLHKIVVEDFENFPGFHVLPKMHKSPPSSRPIVPTHSWGTMRISKVLSFMLRPQLEHLPW
VILDISIVRDRLREFQMHKDKDYYIMTGDVTSLYTNIPLAAGYQAIREMTVIALRALKPTDLNEVDQRVYEREAALMADT
LGTLTKFVMENNYFTSGKRLFRQIQGTAMGTNMAPEFANIYMGVAEKQKLTEVVNMGSCLYLRYIDDVLIIGTKAELDRV
AEQEHFLDLPLEVNWEKPAKSLPFLDLQLSISEHSIVYELYRKPINTYQYLEWDSDHPRATKKGFVIGEIVRIFRNCSER
ATALKHVNFFHQKIRARGYPPRLTNTWIKAAIDSIQERELGHRQAPPAENSALKIFRLKVKYSPSWEIFDAKVLRKRIEQ
TLIREGILTQGEYPRYSELHVQVSKGRNRNILDFAGSNQKQVLHTHFDRADSIYFPTLSSNPDLEAQFRYHVFWQREDLD
LGIRRHAKRARSGRNRRLRPRKRHVRRTVVRRRKRARGNTDPKAPKRLRYQPTLDSLRGVHTS
>QBM85394.1 hypothetical protein METSCH_A00110 [Metschnikowia aff. pulcherrima]
MLEVSLKLWVSVLLNLILAAVTFYHSNQRADIQNYVFGVHSPDQSLEIQYENATENYLQFPIDLANATAVFNSVYGAGRE
KDSNLNPVGVNFFPAYIPPNTLMYHSREEPNIPELFEWLAMDYEFSYSFARFTRGKSRGGPHKPPKDPGDPGQPGIPGGR
SLRKDPPEKNKGLSLDEKTQKRDHLGAGPSYLFTFRNKTPLNRLILLDGASAAKSSTGEMDQQMILSQQKDVEGYVDEEI
AANNICNWGKLFGLQGVIRLEVGFEIILCDFHKDIELVSNVTLSNVTKMVGFPYEPSEPTTDLEKSRTKLIDRWQAMSEF
EWMQAGGRVNNGDGRILIDYANMVTPLNKTWVNSDPYQRRINKLSEPLKSQIITQLENTILKGVDPFYKTDWPLVIGAIV
EKFSPMLVGVNATFTVFEHEANINLSSALKNLSDNLSSSTFNFVRRYTDGNAHEAIQGTCARRDAVEDYVHNTFAISTTL
ERLIYSSIYRVTSEIVSMIFDLFDLCEKIFADRYVTPSDAHDIKLREIVLAKSADIKALIDTLKWSSFIRCTQVCDWNEV
CAVPGWGPGPMGWSFGGRKSLHFENGSYRINNELECINVDQLRLR
ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

As I was saying above, I tried that but when I pasted it in, the formatting messed up. Apologies, I'm relatively new to the site and don't know how to fix the issue.

ADD REPLY
0
Entering edit mode

This sequence works for me in same was as my fake example. Are you using one ID per line in file where you are specifying ID's you want to keep? What OS are you working on?

ADD REPLY
0
Entering edit mode

Given that your worked example is working fine, I suspect there's something wrong with the files I've got.

A00120
A00140
A00170
A00280

This is directly copied from my file. As above, I don't understand the formatting for this forum but should I assume that the values aren't separated by a new line but rather a tab?

EDIT: Sorry, forgot to also say I'm on MacOS Catalina 10.15.7

EDIT2: Ok, nope, the lines are definitely separated by a newline

ADD REPLY
1
Entering edit mode

Ok I got it working. I had two trailing newlines at the end of the id file which meant grep -f was returning everything rather than lines that matched my search terms.

Thanks GenoMax!

ADD REPLY
0
Entering edit mode

Just tested this on macOS big sur (v.11.3.1). Code works for me.

ADD REPLY

Login before adding your answer.

Traffic: 1874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6