How can I get sequencing data from NCBI with uniprot taxonomy identifiers? Automating with an API
1
0
Entering edit mode
9.5 years ago
wewolf • 0

Hello,

I am interested in downloading complete genomes to create a phylogenetic tree. The NCBI has a whole toolkit which they call Entrez Programming Utilities or eutils for short. (I found an EXCELLENT resource that walks me through everything I would need to know. Complete with a script in python to automate downloading these genomes off of NCBI.)

http://angus.readthedocs.org/en/2014/howe-ncbi.html#comment-1660809538

I have an "interesting-genomes.txt" file I'd like to find complete genomes for, HOWEVER this list of ID's contain the taxonomy identifier from uniprot ( ie http://www.uniprot.org/taxonomy/1000588).

For example, Streptococcus mitis bv. 2 str. SK95, has the corresponding taxonomy number of 1000588 in uniprot. In NCBI, it's ID is NC_013853.

I have a file containing a long list of taxonomy identifiers like 1000588, and not the NCBI ID's of NC_013853. Any ideas on how I can get around this?

Thank you!

sequencing genome biopython NCBI enterez • 4.0k views
ADD COMMENT
0
Entering edit mode

The NCBI ID you provided seems like the contig number and not the taxID. The taxid for your organism of interest is: Streptococcus mitis bv. 2 str. SK95 (taxid:1000588). Which is the same as the Uniprot database.

ADD REPLY
0
Entering edit mode
9.5 years ago
onuralp ▴ 190

This is tricky because there are usually many assemblies or genomes available for a given taxon. When you try to map a taxon id back to genomes using, say, Batch Entrez, you will end up retrieving a huge amount of sequences associated with this taxon id.

A possible way to get around this is to stick to representative genomes / assemblies, which guarantees you a one-to-one correspondence between taxon id and genome. In principle, this should work for almost all cases in your list excluding those that are sequenced very recently or have some weird strain-specific complications.

Download the following file including information on species names and refseq complete genome ids: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_representative_genomes.txt

Then you can write a simple script to parse this file and extract the corresponding accession number (e.g., NC_013853) for a given species (e.g., Streptococcus mitis).

ADD COMMENT
1
Entering edit mode

I think the new genomes ftp has a representative directory for every species like ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Streptococcus_mitis/representative/

ADD REPLY

Login before adding your answer.

Traffic: 2584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6