Question

How to transform locus_tags to protein sequences (bacteria)

0

Entering edit mode

7.2 years ago

natasha.sernova ★ 4.0k

Dear ALL, Is there a direct way to transform several bacterial protein locus_tags to the protein sequences? I’ve found a few posts about genes, locus_tags and biopython.

NCBI : Obtain "gene symbol" and fasta sequence from locus_tag list

Parsing Genbank File: Get Locus Tag Vs Product

Get locus_tag list from gene list using genbank file and Biopython, but I need protein sequences.

esearch seems to work for genes and proteins:

NCBI : Obtain "gene symbol" and fasta sequence from locus_tag list

esearch -db gene -query "CGI_10028029"|elink -target nuccore|efetch -format fasta

But I don’t know my protein names, only their locus_tags .

esearch -db protein -query "lycopene cyclase" | efetch -format fasta

I can choose a particular bacterium, following my answer in this post

where can I get environmental bacteria genome in fasta format (as many as possible)?

from

ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt,

and scan its GPFF –file, but it a long way since I have 30 bacteria.

Is there a better solution to transform locus_tags to the corresponding protein sequences?

Many thanks!

Natasha

bacteria locus eutils • 3.3k views

ADD COMMENT • link updated 7.2 years ago by pure_reason ▴ 10 • written 7.2 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Are you asking that you need to do this for every gene_locus for entire bacterial genomes (since this post that you linked does what you seem to be asking for for one locus: NCBI : Obtain "gene symbol" and fasta sequence from locus_tag list )? If that is so why not get the .faa files for the genome?

ADD REPLY • link 7.2 years ago by GenoMax 142k

0

Entering edit mode

Thanks for your reply! I tried it and started with *.faa files, I hoped it's a solution.

It turned out there is no locus_tags in their headers. I looked at a bacterium from 2016.

I looked at this post. NCBI : Obtain "gene symbol" and fasta sequence from locus_tag list.

I don't have a gene symbol, just locus_tag. Right now I do it manually, submit a locus_tag to ncbi.nlm.nih.gov

and finally have a protein sequence, but I would prefer an automatic way.

ADD REPLY • link 7.2 years ago by natasha.sernova ★ 4.0k

score 1 · Answer 1 · 2017-03-09

1

Entering edit mode

7.2 years ago

pure_reason ▴ 10

Hmm, usually, .faa files do have locus_tags in their headers.

Do you have genbank files for these bacteria? If you do, you can go in two ways: 1) make a dictionary [locus_tag:protein_id] from genbank file, then select protein_ids by your locus_tag list and then extract them from .faa file 2) either you can extract your sequences directly from genbank file if it includes the sequence itself. In Biopython, you can go through features, select necessary CDSs by locus_tags, extract sequences by coordinates using 'location' and 'strand' attributes and translate them.

ADD COMMENT • link 7.2 years ago by pure_reason ▴ 10

0

Entering edit mode

My bacterial *.faa - files do not have locus-tags in their headers. It may be some exception, but they don't.

I have a couple of ideas,

To search GPFF-files. It's a new version of *.gbk files, they have both lovus-tag and protein sequence somewhere nearby. But I have only a list of locus-tags and have no idea about the hosts. So it's not a solution.

What I do right now - I take my locus_tag, go to www.ncbi.nlm.nih.gov, put the locus_tag to the upper empty window and press "OK". Then I manually select "fasta" somewhere in the upper menu from the window I've gotten and this is my protein. But I don't know how to make it automatically for the whole list. It's possible, I've done in biopython once, but then succesfully forgot it.

ADD REPLY • link 7.2 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

Can you post a selection of locus_tags you are interested in? There must be a better way but we need to know what exactly you have.

ADD REPLY • link 7.2 years ago by GenoMax 142k

0

Entering edit mode

Thank you for your question! It turned out that old locus_tags are not recognized in ncbi-site.

This is a modern example:

SEEE7246_00015

SEEE7246_00040

SEEE7246_00050

SEEE7246_00075

SEEE7246_00080

All of them have some corresponding protein sequences.

But how to find them automatically? Thank you!

ADD REPLY • link 7.2 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

These are working as expected using your example above.

$ esearch -db protein -query "SEEE7246_00015" | efetch -format fasta
>EJI43197.1 hypothetical protein SEEE7246_00015 [Salmonella enterica subsp. enterica serovar Enteritidis str. 639672-46]
MFAFTAKIQLQQTINTMDPFVIPRVSLPPQNLEKFWKTVPWVSLCHCLQRDNNGFVTRVIRTVMVNRSAQ
VQTPAGLTDTESKCRYQMSDQLTLKGWF

So you want to convert the old tags to these new ones? Can you post the old examples that correspond to these new ones?

ADD REPLY • link 7.2 years ago by GenoMax 142k

0

Entering edit mode

No, that's perfect! Thousand thanks!

I meant, NCBI site does not recognize locus_tags from old gbk-files.

I don't need their old versions, but they are reachable from old ncbi-versions, I hope.

ADD REPLY • link 7.2 years ago by natasha.sernova ★ 4.0k