Question

How to get annotation information from identifiers in NCBI nt database?

0

Entering edit mode

5.0 years ago

O.rka ▴ 710

I have some Otus that I suspect are contaminants from faulty primers. How can I blasted them against the nt database and the results look like this (w/ outfmt=6):

Otu000056       gi|1163074592|gb|CP020116.1|    99.52   209     1       0       4       212     4182852 4183060 4e-102  381
Otu000056       gi|1163034213|gb|CP020058.1|    99.52   209     1       0       4       212     2494582 2494790 4e-102  381
Otu000056       gi|1163027546|gb|CP020055.1|    99.52   209     1       0       4       212     2523673 2523465 4e-102  381
Otu000056       gi|1163005269|gb|CP020048.1|    99.52   209     1       0       4       212     2155734 2155942 4e-102  381
Otu000056       gi|1162933287|gb|CP020107.1|    99.52   209     1       0       4       212     559506  559298  4e-102  381
Otu000056       gi|1162922101|gb|CP020106.1|    99.52   209     1       0       4       212     5006373 5006581 4e-102  381
Otu000056       gi|1162894325|gb|CP020092.1|    99.52   209     1       0       4       212     4746448 4746656 4e-102  381

I have thousands of these hits. How can I go from gi|1163074592|gb|CP020116.1| to Escherichia coli strain AR_0104, complete genome for example w/ the actual annotated hit? Preferably a command line tool that I can just feed my blast6 output into if possible.

assembly • 718 views

ADD COMMENT • link updated 5.0 years ago by GenoMax 141k • written 5.0 years ago by O.rka ▴ 710

score 1 · Answer 1 · 2019-05-08

$ for G in 1163074592 1163034213 1163027546 1163005269; do echo -n "$G " && wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=nucleotide&id=${G}" | xmllint --xpath '//eSummaryResult/DocSum/Item[@Name="Title"]/text()' - && echo ; done

1163074592 Escherichia coli strain AR_0104, complete genome
1163034213 Escherichia coli strain AR_0061, complete genome
1163027546 Escherichia coli strain AR_0069, complete genome
1163005269 Escherichia coli strain AR_0118, complete genome

score 1 · Answer 2 · 2019-05-08

Using Entrezdirect:

$ for G in 1163074592 1163034213 1163027546 1163005269; do esearch -db nuccore -query $G | efetch -format docsum | xtract -pattern DocumentSummary -element Caption,Title; sleep 3; done
CP020116    Escherichia coli strain AR_0104, complete genome
CP020058    Escherichia coli strain AR_0061, complete genome
CP020055    Escherichia coli strain AR_0069, complete genome
CP020048    Escherichia coli strain AR_0118, complete genome