Biostar Beta. Not for public use.
Get number of available genomes per taxon - NCBI
0
Entering edit mode
2.3 years ago
tlorin • 250
Switzerland

Dear all,

I am blasting (tblastn) a protein onto WGS on NCBI to search directly into the genomes of some taxa.

The protein is not present in every genome and I would like to be able to say "Protein X is present in n organisms out of the N in this lineage." (so, to be able to count N, the total number of sequenced genomes per taxon).

I have found two ways that give quite different results.

1. tblastn the protein on, say "arthropoda", and retrieve the number appearing in the corresponding field in the output page: "wgs (676 databases)"

2. use this page and retrieve the number. Here, for "Arthropoda", it is 552

Would you know any other command line or online tool to get N? The ideal way would be to use a taxon number as input (6656 for Arthropoda).

ncbi genome • 428 views
0
Entering edit mode
 tail -n+2 assembly_summary_genbank.txt | datamash -sH  -g 6 count 6 collapse 8

0
Entering edit mode

@cpad0112 thank you for your help. What is this line doing exactly? I cannot get the number of available genomes for Arthropoda for instance.

0
Entering edit mode

It is doing something similar to what I did below using a different program and eliminating a couple of lines at the beginning of that file.

Did you see my note below?

2
Entering edit mode
2.3 years ago
tlorin • 250
Switzerland

I found a way.

Count the number of available genomes for a given taxon (here, arthropods; note the wgs):

w3m -dump https://www.ncbi.nlm.nih.gov/nuccore/?term=wgs-master+%5Bprop%5D+AND+arthropoda+%5Borgn%5D|grep "Items:"|rev|cut -f1 -d" "|rev


Count the number of available transcriptomes for a given taxon (here, arthropods; note the tsa):

w3m -dump https://www.ncbi.nlm.nih.gov/nuccore/?term=tsa-master+%5Bprop%5D+AND+arthropoda+%5Borgn%5D|grep "Items:"|rev|cut -f1 -d" "|rev

1
Entering edit mode

These links are for whole genome shotgun sequence records. As such there is no guarantee that these genomes are complete or usable. You would want to include this source of "genome" records in your paper when you mention X out of Y genomes.

0
Entering edit mode

@genomax That's true, thanks for mentioning this. For a list of "complete or usable" genomes, what would you suggest instead?

1
Entering edit mode
4 weeks ago
genomax 68k
United States

Get the assembly_summary_genbank.txt from here. awk -F '\t' '{print \$6}' assembly_summary_genbank.txt | sort | uniq -c > file will give you counts of the genomes for various taxid. Similar files can be found for RefSeq genomes here.

I see 20 genomes for arthropoda (taxid: 552) as of this writing. taxid annotations are at species level.

0
Entering edit mode

@genomax thanks! If I understand correctly, in this file each line corresponds to one species. How would I count for any taxonomic level (say, "Arthropoda" = taxon ID 6656)?

0
Entering edit mode

taxid annotations in that file seem to be provided at genus/species level.

0
Entering edit mode

OK so there is no direct way to get the number of genomes for any given taxa based on this file: it has to be at the species level.

0
Entering edit mode

Using NCBI unix utils the information still seems to be at the same level. If you want to confirm it another way.

esearch -db genome -query genome | esummary | xtract -pattern DocumentSummary -element Organism_Name TaxId > file