Hi, I want to format the whole nt database with adding the taxid and scientific name via the following script. $file is the split files.
cat $file | while read line
do
if [[ "$line" == ">"* ]]; then
GI=`echo "$line" | cut -d '>' -f 2 | cut -d ' ' -f 1`
taxID=`curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${GI}&rettype=fasta&retmode=xml" | grep TSeq_taxid | cut -d '>' -f 2 | cut -d '<' -f 1`
orgname_name=`curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${GI}&rettype=fasta&retmode=xml" | grep TSeq_orgname | cut -d '>' -f 2 | cut -d '<' -f 1`
echo ">${GI} scientific_name=${orgname_name}; taxid=${taxID};" >> t_${file}
else
echo "$line" >> t_${file}
fi
done
But it is not stable, and always produced some records that returned with no taxid value, like the second one in the following example:
>JQ288447.1 scientific_name=uncultured Geobacter sp.; taxid=186741;
>JQ288458.1 scientific_name=uncultured Geobacter sp.; taxid=;
I tested manually on the failed records and it worked well with returning the correct taxID.
What happened there for those mistakes?
Thank you!!
You should try first with some echo statements, e.g.
To see how it turns out and why it doesn't work.