Mistakes in a bash script for adding taxid for nt database
1
0
Entering edit mode
6.9 years ago
wyc661217 ▴ 10

Hi, I want to format the whole nt database with adding the taxid and scientific name via the following script. $file is the split files.

cat $file | while read line
    do 
    if [[ "$line" == ">"* ]]; then
        GI=`echo "$line" | cut -d '>' -f 2 | cut -d ' ' -f 1`
        taxID=`curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${GI}&rettype=fasta&retmode=xml" | grep TSeq_taxid | cut -d '>' -f 2 | cut -d '<' -f 1`
        orgname_name=`curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${GI}&rettype=fasta&retmode=xml" | grep TSeq_orgname | cut -d '>' -f 2 | cut -d '<' -f 1`
        echo ">${GI} scientific_name=${orgname_name}; taxid=${taxID};" >>  t_${file}
    else
        echo "$line" >> t_${file}
    fi
    done

But it is not stable, and always produced some records that returned with no taxid value, like the second one in the following example:

>JQ288447.1 scientific_name=uncultured Geobacter sp.; taxid=186741;
>JQ288458.1 scientific_name=uncultured Geobacter sp.; taxid=;

I tested manually on the failed records and it worked well with returning the correct taxID.

What happened there for those mistakes?

Thank you!!

Bash nt database taxid shell efetch • 1.6k views
ADD COMMENT
1
Entering edit mode

You should try first with some echo statements, e.g.

echo curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${GI}&rettype=fasta&retmode=xml"

To see how it turns out and why it doesn't work.

ADD REPLY
1
Entering edit mode
6.9 years ago

What you are doing is very inefficient, you are bombarding NCBI repeatedly for the same id. You are downloading the same data twice (and many times more when you repeat this) to extract two different pieces of information.

What you should do instead is first download each XML into a file then process those files separately. This, in turn will help you understand what might be going wrong.

I think you are being denied access for overloading the service.

ADD COMMENT
0
Entering edit mode

Thanks Istvan!

I will do as your suggestion, to download the file and search for those two pieces information based on this file for each record. But if the mistake was denied by the NCBI server, I guess this won't help as it still needs to access the server for every record.

I know this is not a smart way. Could you offer some other possible solutions?

Thanks again!!

ADD REPLY
0
Entering edit mode

I think you are being denied because you are asking for the exact same data in two consecutive immediate requests. If you slow down your script a bit it might not run into that problem

Some tips:

  • perform a search on a common field and pipe that into a fetch, you might be able to download the whole data in one shot
  • put multiple IDs in the same query, this will download all those entries in one shot
  • have the program sleep a few seconds before downloading the next entry
ADD REPLY

Login before adding your answer.

Traffic: 2998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6