Genbank - Finding Unique Taxon Id'S For A Specific Locus
3
0
Entering edit mode
11.5 years ago

Does anyone know of any way to retrieve quantitative information on specific taxa for a specific locus on GenBank. For example, I can easily pull out the number of COI sequences for Annelida from GenBank but what if I want to know how many species of Annelida are represented by COI in GenBank. If I search for Annelida AND COI, I will get all sequences for (e.g.,) Eisenia fetida but I only want this to count as 1 if I want to know how many unique species have COI's in GenBank. Does that make sense? I appreciate all ideas on this.

genbank id taxonomy • 2.8k views
ADD COMMENT
3
Entering edit mode
11.5 years ago
Neilfws 49k

I don't know that there is an easy way to do this within the NCBI website. However, you can get some of the way with a well-defined search and some scripting.

I would start with a taxonomy search for Annelida. Clicking through (just keep clicking on "Annelida") gets to the summary page, showing (at present) 31 606 nucleotide records.

We can then qualify that search further:

txid6340[Organism:exp] AND COI[Gene]

to retrieve 11 208 records.

At this point, you should click on "Send to -> File" and choose an output format. "Genbank" or "XML" are good choices, since they are structured formats and will contain a field for the species. However, this may result in a large download.

Genbank format includes the field ORGANISM, e.g.

ORGANISM  Drilonereis longa

So we can count up the terms that come after the word ORGANISM using grep, cut and awk:

grep ORGANISM sequence.gb | 
cut -d " " -f 4- | 
awk 'BEGIN{OFS="\t"} {n[$0]++} END {for (i in n) {print i, n[i]}}'
> wormcount.txt

Result (first 5 lines only):

Nephtys sp. CMC05    1
Nephtys sp. CMC06    1
Amynthas lini    9
Mesenchytraeus solifugus    64
Satchellius mammalis    11

Unique ORGANISM with COI:

wc -l wormcount.txt
# 1995 wormcount.txt
ADD COMMENT
0
Entering edit mode
11.5 years ago

Thanks a lot, I thought that it might come down to a bit of simple scripting. This seems straight forward, I'll post if I run into trouble. Thanks again!

ADD COMMENT
0
Entering edit mode
11.4 years ago

Should I ever publish this information, I would like to reference you in the acknowledgements. What name would you want me to put in the acknowledgements? Thanks a lot again, this saved me a lot of time!! Sebastian

ADD COMMENT

Login before adding your answer.

Traffic: 1511 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6