Question

Mistakes in a bash script for adding taxid for nt database

0

Entering edit mode

6.9 years ago

wyc661217 ▴ 10

Hi, I want to format the whole nt database with adding the taxid and scientific name via the following script. $file is the split files.

cat $file | while read line
    do 
    if [[ "$line" == ">"* ]]; then
        GI=`echo "$line" | cut -d '>' -f 2 | cut -d ' ' -f 1`
        taxID=`curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${GI}&rettype=fasta&retmode=xml" | grep TSeq_taxid | cut -d '>' -f 2 | cut -d '<' -f 1`
        orgname_name=`curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${GI}&rettype=fasta&retmode=xml" | grep TSeq_orgname | cut -d '>' -f 2 | cut -d '<' -f 1`
        echo ">${GI} scientific_name=${orgname_name}; taxid=${taxID};" >>  t_${file}
    else
        echo "$line" >> t_${file}
    fi
    done

But it is not stable, and always produced some records that returned with no taxid value, like the second one in the following example:

>JQ288447.1 scientific_name=uncultured Geobacter sp.; taxid=186741;
>JQ288458.1 scientific_name=uncultured Geobacter sp.; taxid=;

I tested manually on the failed records and it worked well with returning the correct taxID.

What happened there for those mistakes?

Thank you!!

Bash nt database taxid shell efetch • 1.6k views

ADD COMMENT • link updated 6.9 years ago by Istvan Albert 100k • written 6.9 years ago by wyc661217 ▴ 10

1

Entering edit mode

You should try first with some echo statements, e.g.

echo curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${GI}&rettype=fasta&retmode=xml"

To see how it turns out and why it doesn't work.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

score 1 · Answer 1 · 2017-06-20

1

Entering edit mode

6.9 years ago

Istvan Albert 100k

What you are doing is very inefficient, you are bombarding NCBI repeatedly for the same id. You are downloading the same data twice (and many times more when you repeat this) to extract two different pieces of information.

What you should do instead is first download each XML into a file then process those files separately. This, in turn will help you understand what might be going wrong.

I think you are being denied access for overloading the service.

ADD COMMENT • link 6.9 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks Istvan!

I will do as your suggestion, to download the file and search for those two pieces information based on this file for each record. But if the mistake was denied by the NCBI server, I guess this won't help as it still needs to access the server for every record.

I know this is not a smart way. Could you offer some other possible solutions?

Thanks again!!

ADD REPLY • link 6.8 years ago by wyc661217 ▴ 10

0

Entering edit mode

I think you are being denied because you are asking for the exact same data in two consecutive immediate requests. If you slow down your script a bit it might not run into that problem

Some tips:

perform a search on a common field and pipe that into a fetch, you might be able to download the whole data in one shot
put multiple IDs in the same query, this will download all those entries in one shot
have the program sleep a few seconds before downloading the next entry

ADD REPLY • link 6.8 years ago by Istvan Albert 100k