Biostar Beta. Not for public use.
Question: Getting taxid from accession number
1
Entering edit mode

Hi - I have a question that is related to this thread:

Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier

I assumed if I had a new question, I needed to open a new thread. I apologize if I should have posted under the original question.

I have a file with a list of accession numbers in it:

KU587513
KU587514
KU605633

I have a bash shell script I am using from that referenced thread to get the taxid from each accession using efetch:

#!/bin/bash
file="accessions_out.txt"
while read -r ACC
do
     touch taxid_out.txt
     curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id={ACC}&rettype=fasta&retmode=xml" |\
     grep TSeq_taxid |\
     cut -d '>' -f 2 |\
     cut -d '>' -f 1 >> taxid_out.txt
done <"$file"

This works and gives me an output file taxid_out.txt containing:

1844429
1636871
129076

The problem I have is when the file with the accession numbers includes an accession number that has no corresponding taxid, I can't figure out how to ouptut something like "no_match". For example, this accessions file:

KU587513
KUBOGUS
KU605633

gives the following output file:

1844429
129076

I want it to give:

1844429
no_match
129076

I don't know how to code this correctly. This is what I have tried:

 #!/bin/bash
file="accessions_out.txt"
while read -r ACC
do
     touch taxid_out.txt
     VAR = $(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id={ACC}&rettype=fasta&retmode=xml")
     if [ -z "$VAR" ]
     then
        "no_match"  >> taxid_out.txt 
     else
         taxid="$(grep 'TSeq_taxid' $VAR)" #at this point I'm happy to just print the whole TSeq_taxid
         "$taxid" >> taxid_out.txt
     fi
done <"$file"

The problem with this is if the accession number has no corresponding taxid, the curl returns something with <error> but I am not sure how to grab that. My real list of accession numbers is ~10000, and there are a few "bad apples" in the list that I am having trouble finding. Getting "no_match" printed would help me find those. I am not very familiar with shell scripting, but someone who is could probably figure this out very quickly. I would sincerely appreciate any help. Thanks -

1
Entering edit mode

Without testing this, my first guess would be that your VAR test is failing. Also, you shouldn't have spaces around the VAR =.

You'll want to try something like this instead, to tell the differece:

if [ -z ${VAR+x} ]; then echo "VAR is unset"; else echo "var is set to '$VAR'"; fi

There is a difference between a variable that is unset, versus empty.

See:

https://stackoverflow.com/questions/3601515/how-to-check-if-a-variable-is-set-in-bash

As an alternative approach, you might be interested in parsing the tabular assembly_summary files which contain the accessions and the TaxIDs:

e.g.

$ wget --no-check-certificate https://ftp.ncbi.nih.gov/genomes/genbank/assembly_summary_genbank.txt

or

$ wget --no-check-certificate https://ftp.ncbi.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Provides the following (see columns 1 and 6-7)

$ head assembly_summary_genbank.txt
#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material
GCA_000001215.4 PRJNA13812  SAMN02803731        reference genome    7227    7227    Drosophila melanogaster         latest  Chromosome  Major   Full    2014/08/01  Release 6 plus ISO1 MT  The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics   GCF_000001215.4 identicalftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT
GCA_000001405.27    PRJNA31257          reference genome    9606    9606    Homo sapiens            latest  Chromosome  Patch   Full    2017/12/21  GRCh38.p12  Genome Reference Consortium GCF_000001405.38    different   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.27_GRCh38.p12
GCA_000001515.5 PRJNA13184  SAMN02981217    AACZ00000000.4  na  9598    9598    Pan troglodytes     Yerkes chimp pedigree #C0471 (Clint)    latest  Chromosome  Major   Full    2016/05/03  Pan_tro 3.0 Chimpanzee Sequencing and Analysis Consortium   GCF_000001515.7 different   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/515/GCA_000001515.5_Pan_tro_3.0
GCA_000001545.3 PRJNA20869  SAMN02981238    ABGA00000000.1  na  9601    9601    Pongo abelii        ISIS 71 latest  Chromosome  Major   Full    2008/11/13  P_pygmaeus_2.0.2    Orangutan Genome Sequencing Consortium  na  na  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/545/GCA_000001545.3_P_pygmaeus_2.0.2
GCA_000001635.8 PRJNA20689          reference genome    10090   10090   Mus musculus            latest  Chromosome  Patch   Full    2017/09/15  GRCm38.p6   Genome Reference Consortium GCF_000001635.26    identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.8_GRCm38.p6
ADD COMMENTlink 22 months ago Joe 12k
Entering edit mode
0

Thanks for taking the time to respond to me, and the information about the unset versus empty variable, plus the error in the spacing in my code. I think I have it figured out and will post the updated code below.

I will also look into the assembly summary files as well as an alternative to my approach. Thanks again -

ADD REPLYlink 22 months ago
auninsaw
• 10
Entering edit mode
0

This seems to get the job done, but there are probably still errors present or ways to make it shorter:

 #!/bin/bash
file="accessions_out.txt"
while read -r ACC
do
     touch taxid_out.txt
     VAR=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id {ACC}&rettype=fasta&retmode=xml")
     if  echo "$VAR" | grep -q "<ERROR>"; then
          echo "Not present" >> taxid_out.txt         
    elif echo "$VAR" | grep -q "TSeq_taxid"; then
        taxid=$(echo "$VAR" | grep TSeq_taxid)
        echo "$taxid" |\
        cut -d '>' -f 2 |\
        cut -d '<' -f 1 >> taxid_out.txt
 fi
done <"$file"
ADD REPLYlink 22 months ago
auninsaw
• 10

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0