Dear ALL, Is there a direct way to transform several bacterial protein locus_tags to the protein sequences? I’ve found a few posts about genes, locus_tags and biopython.
NCBI : Obtain "gene symbol" and fasta sequence from locus_tag list
Parsing Genbank File: Get Locus Tag Vs Product
Get locus_tag list from gene list using genbank file and Biopython, but I need protein sequences.
esearch seems to work for genes and proteins:
NCBI : Obtain "gene symbol" and fasta sequence from locus_tag list
esearch -db gene -query "CGI_10028029"|elink -target nuccore|efetch -format fasta
But I don’t know my protein names, only their locus_tags .
esearch -db protein -query "lycopene cyclase" | efetch -format fasta
I can choose a particular bacterium, following my answer in this post
where can I get environmental bacteria genome in fasta format (as many as possible)?
from
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt,
and scan its GPFF –file, but it a long way since I have 30 bacteria.
Is there a better solution to transform locus_tags to the corresponding protein sequences?
Many thanks!
Natasha
Are you asking that you need to do this for every
gene_locus
for entire bacterial genomes (since this post that you linked does what you seem to be asking for for one locus: NCBI : Obtain "gene symbol" and fasta sequence from locus_tag list )? If that is so why not get the.faa
files for the genome?Thanks for your reply! I tried it and started with *.faa files, I hoped it's a solution.
It turned out there is no locus_tags in their headers. I looked at a bacterium from 2016.
I looked at this post. NCBI : Obtain "gene symbol" and fasta sequence from locus_tag list.
I don't have a gene symbol, just locus_tag. Right now I do it manually, submit a locus_tag to ncbi.nlm.nih.gov
and finally have a protein sequence, but I would prefer an automatic way.