Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier
5
15
Entering edit mode
12.7 years ago
Alf ▴ 490

The question is whether, given a (long) list of Genbank identifiers, is possible to get the ncbi taxonomy identifier for each one. I know it may seem very easy, but I have not found any web service which makes this, and I wouldn't like to do this manually.

genbank conversion ncbi taxonomy • 34k views
ADD COMMENT
24
Entering edit mode
12.7 years ago

NCBI efetch can use an accession number instead of a gi. and the XML/Fasta returned by efetch contains the taxonomy-ID:

$ curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=HQ844023&rettype=fasta&retmode=xml" 

 http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
 <TSeqSet>
<TSeq>
  <TSeq_seqtype value="nucleotide"/>
  <TSeq_gi>341832806</TSeq_gi>
  <TSeq_accver>HQ844023.1</TSeq_accver>
  <TSeq_taxid>1056490</TSeq_taxid>
  <TSeq_orgname>Rotavirus A HC91xUK reassortant (UKg9KC-1)</TSeq_orgname>
  <TSeq_defline>Rotavirus A HC91xUK reassortant (UKg9KC-1) NSP3 protein gene, complete cds</TSeq_defline>
  <TSeq_length>942</TSeq_length>
  <TSeq_sequence>(...)</TSeq_sequence>
</TSeq>

</TSeqSet>

So you can use efetch and egrep to get the taxonomic-id for each of your ACNs.

for ACC in A00002 X53307 BB145968 CAA42669 V00181  AH002406  HQ844023
do
   echo -n -e "$ACC\t"
   curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml" |\
   grep TSeq_taxid |\
   cut -d '>' -f 2 |\
   cut -d '<' -f 1 |\
   tr -d "\n"
   echo
 done


A00002    9913
X53307    1423
BB145968    10090
CAA42669    9913
V00181    7154
AH002406    538120
HQ844023    1056490
ADD COMMENT
3
Entering edit mode

Man, the script kick asses. I only was looking for some orientation, but the script is _exactly_ what I want. Thanks a lot!

ADD REPLY
2
Entering edit mode

I modified your script slightly to retrieve these data for some accession IDs that I failed to recover when running blastdbcmd -entry_batch against my local blast database install, even though I could get information on them when using Entrez (my local data was the latest release from the FTP). I ran the following, where failed_accessions.txt was a list of 662 accession IDs, one per line:

cat failed_accessions.txt | xargs ./get_failed_acc.sh > failed_acctax.txt

The get_failed_acc.sh script was a modification of yours as follows:

for ACC in $@
do
    echo -n "$ACC\t"
    curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml" |\
    grep TSeq_taxid |\
    cut -d '>' -f 2 |\
    cut -d '<' -f 1 |\
    tr -d "\n"
echo
sleep 0.25
done

I added the sleep in so I didn't hammer the server too much!

ADD REPLY
2
Entering edit mode

NCBI is now using https:// instead of http://

Make sure you include the "s" in your link!

ADD REPLY
1
Entering edit mode

Hi thanks for the script! I am new to linux and I got a question. I wanna use my own accession numbers to replace "A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023". But my accession numbers are in a .CSV file. There are several hundreds of them. How can I copy them to the for loop? Or is it possible to read the .csv file directly? Thanks a lot!

ADD REPLY
2
Entering edit mode

Just do tr ',' ' ' < filename.csv | xargs ./get_failed_acc.sh. The tr command will replace all comma's by spaces an convert them to arguments.

ADD REPLY
1
Entering edit mode

This may work only if the url goes https not http, as ncbi has turned https only in 2017.

ADD REPLY
0
Entering edit mode

I've fixed this, thanks.

ADD REPLY
0
Entering edit mode

How to use this script for multiple file as input, I want to extract the locus_tag by using start and stop genomic position

For example ACC=accession.txt START=start.txt STOP=stop.txt

curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&seq_start=$(START)&$(STOP)&rettype=gb" |\ grep locus_tag |\

I could only manage with - for ACC in $(cat accession.txt), and could not make a nested for loop to take other variables from input files.

If the loop works, i am getting duplicate/triplicates hits like, without accession number. So, I could not use the retrieved data :( curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AE006468.2&seq_start=8718&seq_stop=9319&rettype=gb" | grep "/locus_tag "

Output is
                 /locus_tag="STM0008"
                 /locus_tag="STM0008"
                 /locus_tag="STM0008"

What to do ?

ADD REPLY
7
Entering edit mode
12.7 years ago
Yogesh Pandit ▴ 520

The BioIDMapper package in R might be helpful

 library(BioIDMapper)
 data(glist)

 # 1=GI number, 31=NCBI Taxid
 myMap <- bio.convert(glist, 1, 31)

Quick Guide for the BioIDMapper package

ADD COMMENT
1
Entering edit mode

Hi!!

I didn't know about this package... !

Can you explain me please, how should I load my ID_list file? I have been tried with the function data("myfile") and read.table("myfile") but does not work...

Thank you so much :)

ADD REPLY
5
Entering edit mode
12.7 years ago
Neilfws 49k

There's a discussion of this problem on the Bioperl mailing list.

The best solution is to use ELink from EUtils. If you have a GenBank GI number such as 341926284, you construct a query linking nucleotide and taxonomy like this:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=341926284

Result: taxid 9606 (Homo sapiens).

 
http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd"> 
    <eLinkResult>

        <LinkSet> 
            <DbFrom>nuccore</DbFrom> 
            <IdList> 
                <Id>341926284</Id> 
        </IdList> 
        <LinkSetDb> 
            <DbTo>taxonomy</DbTo> 
            <LinkName>nuccore_taxonomy</LinkName> 
            <Link> 
                <Id>9606</Id> 
            </Link> 
        </LinkSetDb> 
    </LinkSet> 
</eLinkResult>

A second solution is to use the file gi2taxonid which can be downloaded from the NCBI FTP site (I forget where exactly; see the Bioperl thread or search Google for it).

Of course if by GenBank identifier, you mean something other than GI, then you'll need a separate initial query to find the GI for whatever identifier you have.

ADD COMMENT
5
Entering edit mode

Here is the link to download the gi2taxid table ftp://ftp.ncbi.nih.gov/pub/taxonomy/

ADD REPLY
5
Entering edit mode
7.2 years ago
-_- ★ 1.1k

First, you need to map accession numbers (GI is deprecated) to tax ids based on nucl_*accession2taxid.gz files from here, ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/. Then based on a tax id, you can trace its whole lineage.

The whole NCBI taxonomy database is not that big. I have written some code to convert NCBI taxdump into lineages identified by tax ids, https://github.com/zyxue/ncbitax2lin. You may find it useful.

ADD COMMENT
3
Entering edit mode
12.7 years ago
Will 4.5k

Just for completeness ...

If you have a LARGE number of IDS and don't want to be limited by the EUtils pipe there is a set of files in the ftp directory: ftp://ftp.ncbi.nih.gov/pub/taxonomy/ that map TaxIDs to various other identifiers. This would only be helpful if you had too many IDS to submit to EUtils.

ADD COMMENT
1
Entering edit mode

Where do I find accession numbers?

ADD REPLY

Login before adding your answer.

Traffic: 2623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6