Question

Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier

15

Entering edit mode

12.7 years ago

Alf ▴ 490

The question is whether, given a (long) list of Genbank identifiers, is possible to get the ncbi taxonomy identifier for each one. I know it may seem very easy, but I have not found any web service which makes this, and I wouldn't like to do this manually.

genbank conversion ncbi taxonomy • 34k views

ADD COMMENT • link updated 7.2 years ago by -_- ★ 1.1k • written 12.7 years ago by Alf ▴ 490

score 24 · Answer 1 · 2011-08-07

24

Entering edit mode

12.7 years ago

Pierre Lindenbaum 161k

NCBI efetch can use an accession number instead of a gi. and the XML/Fasta returned by efetch contains the taxonomy-ID:

$ curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=HQ844023&rettype=fasta&retmode=xml" 

 http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
 <TSeqSet>
<TSeq>
  <TSeq_seqtype value="nucleotide"/>
  <TSeq_gi>341832806</TSeq_gi>
  <TSeq_accver>HQ844023.1</TSeq_accver>
  <TSeq_taxid>1056490</TSeq_taxid>
  <TSeq_orgname>Rotavirus A HC91xUK reassortant (UKg9KC-1)</TSeq_orgname>
  <TSeq_defline>Rotavirus A HC91xUK reassortant (UKg9KC-1) NSP3 protein gene, complete cds</TSeq_defline>
  <TSeq_length>942</TSeq_length>
  <TSeq_sequence>(...)</TSeq_sequence>
</TSeq>

</TSeqSet>

So you can use efetch and egrep to get the taxonomic-id for each of your ACNs.

for ACC in A00002 X53307 BB145968 CAA42669 V00181  AH002406  HQ844023
do
   echo -n -e "$ACC\t"
   curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml" |\
   grep TSeq_taxid |\
   cut -d '>' -f 2 |\
   cut -d '<' -f 1 |\
   tr -d "\n"
   echo
 done


A00002    9913
X53307    1423
BB145968    10090
CAA42669    9913
V00181    7154
AH002406    538120
HQ844023    1056490

ADD COMMENT • link 6.4 years ago by Pierre Lindenbaum 161k

3

Entering edit mode

Man, the script kick asses. I only was looking for some orientation, but the script is _exactly_ what I want. Thanks a lot!

ADD REPLY • link 12.7 years ago by Alf ▴ 490

2

Entering edit mode

I modified your script slightly to retrieve these data for some accession IDs that I failed to recover when running blastdbcmd -entry_batch against my local blast database install, even though I could get information on them when using Entrez (my local data was the latest release from the FTP). I ran the following, where failed_accessions.txt was a list of 662 accession IDs, one per line:

cat failed_accessions.txt | xargs ./get_failed_acc.sh > failed_acctax.txt

The get_failed_acc.sh script was a modification of yours as follows:

for ACC in $@
do
    echo -n "$ACC\t"
    curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml" |\
    grep TSeq_taxid |\
    cut -d '>' -f 2 |\
    cut -d '<' -f 1 |\
    tr -d "\n"
echo
sleep 0.25
done

I added the sleep in so I didn't hammer the server too much!

ADD REPLY • link 10.6 years ago by Steve Moss 2.3k

2

Entering edit mode

NCBI is now using https:// instead of http://

Make sure you include the "s" in your link!

ADD REPLY • link 7.4 years ago by toddknutson ▴ 60

1

Entering edit mode

Hi thanks for the script! I am new to linux and I got a question. I wanna use my own accession numbers to replace "A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023". But my accession numbers are in a .CSV file. There are several hundreds of them. How can I copy them to the for loop? Or is it possible to read the .csv file directly? Thanks a lot!

ADD REPLY • link 8.1 years ago by wintergb ▴ 10

2

Entering edit mode

Just do tr ',' ' ' < filename.csv | xargs ./get_failed_acc.sh. The tr command will replace all comma's by spaces an convert them to arguments.

ADD REPLY • link 8.0 years ago by steven.vanmaldeghem ▴ 20

1

Entering edit mode

This may work only if the url goes https not http, as ncbi has turned https only in 2017.

ADD REPLY • link 6.4 years ago by Naren ▴ 990

0

Entering edit mode

I've fixed this, thanks.

ADD REPLY • link 6.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

How to use this script for multiple file as input, I want to extract the locus_tag by using start and stop genomic position

For example ACC=accession.txt START=start.txt STOP=stop.txt

curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&seq_start=$(START)&$(STOP)&rettype=gb" |\ grep locus_tag |\

I could only manage with - for ACC in $(cat accession.txt), and could not make a nested for loop to take other variables from input files.

If the loop works, i am getting duplicate/triplicates hits like, without accession number. So, I could not use the retrieved data :( curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AE006468.2&seq_start=8718&seq_stop=9319&rettype=gb" | grep "/locus_tag "

Output is
                 /locus_tag="STM0008"
                 /locus_tag="STM0008"
                 /locus_tag="STM0008"

What to do ?

ADD REPLY • link 5.7 years ago by vishwaas1704 ▴ 30

score 7 · Answer 2 · 2011-08-08

7

Entering edit mode

12.7 years ago

Yogesh Pandit ▴ 520

The BioIDMapper package in R might be helpful

 library(BioIDMapper)
 data(glist)

 # 1=GI number, 31=NCBI Taxid
 myMap <- bio.convert(glist, 1, 31)

Quick Guide for the BioIDMapper package

ADD COMMENT • link 12.7 years ago by Yogesh Pandit ▴ 520

1

Entering edit mode

Hi!!

I didn't know about this package... !

Can you explain me please, how should I load my ID_list file? I have been tried with the function data("myfile") and read.table("myfile") but does not work...

Thank you so much :)

ADD REPLY • link 8.4 years ago by marcelavillegasp ▴ 10

score 5 · Answer 3 · 2011-08-07

There's a discussion of this problem on the Bioperl mailing list.

The best solution is to use ELink from EUtils. If you have a GenBank GI number such as 341926284, you construct a query linking nucleotide and taxonomy like this:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=341926284

Result: taxid 9606 (Homo sapiens).

 
http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd"> 
    <eLinkResult>

        <LinkSet> 
            <DbFrom>nuccore</DbFrom> 
            <IdList> 
                <Id>341926284</Id> 
        </IdList> 
        <LinkSetDb> 
            <DbTo>taxonomy</DbTo> 
            <LinkName>nuccore_taxonomy</LinkName> 
            <Link> 
                <Id>9606</Id> 
            </Link> 
        </LinkSetDb> 
    </LinkSet> 
</eLinkResult>

A second solution is to use the file gi2taxonid which can be downloaded from the NCBI FTP site (I forget where exactly; see the Bioperl thread or search Google for it).

Of course if by GenBank identifier, you mean something other than GI, then you'll need a separate initial query to find the GI for whatever identifier you have.

score 5 · Answer 4 · 2017-02-13

First, you need to map accession numbers (GI is deprecated) to tax ids based on nucl_*accession2taxid.gz files from here, ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/. Then based on a tax id, you can trace its whole lineage.

The whole NCBI taxonomy database is not that big. I have written some code to convert NCBI taxdump into lineages identified by tax ids, https://github.com/zyxue/ncbitax2lin. You may find it useful.

score 3 · Answer 5 · 2011-08-08

3

Entering edit mode

12.7 years ago

Will 4.5k

Just for completeness ...

If you have a LARGE number of IDS and don't want to be limited by the EUtils pipe there is a set of files in the ftp directory: ftp://ftp.ncbi.nih.gov/pub/taxonomy/ that map TaxIDs to various other identifiers. This would only be helpful if you had too many IDS to submit to EUtils.