Biostar Beta. Not for public use.
Tool:SubDBer: Flexible taxonomy-based subsampling of sequence databases
Entering edit mode
5.7 years ago
qiyunzhu • 130

Dear community,

I present SubDBer, a small program that allows users to build customized sequence databases that only include data of interest at designated comprehensiveness, based on resampling of large, standard databases.

The sampling criteria are variable. For example, starting with NCBI nr, one wants to only include Proteobacteria sequences, and wants to keep one organism per genus only, but meanwhile wants all _Escherichia_ spp. retained. The program will do the job.

This program may be useful in various situations. It not only greatly saves computation time by discarding large amount of unwanted sequences, but also concentrates firepower to a narrow range that actually draws the user's interest (evenly truncated databases cannot do this). Moreover, with the ease of creating user-defined databases, one could always make their toolkit up-to-date and standardized.


python -in nt -out sub_nt -outfmt blast -within 2 -exclude 1117 -rank genus -size 1 -keep 816,838,1263

This command takes the NCBI nt database as input -> starts with all organisms from Bacteria (TaxID: 2) except for Cyanobacteria (1117) -> picks one representative organism per genus -> except for _Bacteroides_ (816), _Prevotella_ (838) and _Ruminococcus_ (1263), in which all organisms are included - > creates a new BLAST database sub_nt that contains sequence data from the selected organisms.

The program is available at GitHub. There is no dependency.

taxonomy blast database genome subsampling Tool • 1.1k views
Entering edit mode

Is there a publication with more information about your program?


Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3