This is a method I followed to get an organism's complete lineage from NCBI taxonomy data. The procedure involves:
- Creating a MySQL database.
- Initialising the database with the BioSQL schema.
- Downloading and importing the NCBI taxonomy data into the database.
- Using a python script to query the database and return an organism's complete lineage.
Much of the information in this post comes from the documentation of the BioSQL and BioPython projects. In addition, I wrote a python script to query the database and output results.
Here is how the script works:
python3 lineager.py -n Bos taurus
Output:
2018-07-14 04:47 INFO Processing organism name provided
at the command line: Bos taurus
Organism,Lineage
Bos taurus,cellular organisms;Eukaryota;Opisthokonta;Metazoa;
Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;
Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;
Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;
Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;
Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
Instead of a name, an input file containing names of organisms can be provided using the -f option.
Requirements
I'm doing this on BioVM (a virtual machine based on Ubuntu Linux 16.04) but these steps can be performed on any system where the following programs can be installed. Versions I used are in brackets:
- MySQL (5.7.22) for the database.
- Perl (5.22.1) including the DBI module and the DBD module for MySQL for initialising the database with the biosql schema and for the load_taxonomy.pl script.
- Python (3.5.2) and the MySQLdb module (1.3.7) for the lineager.py script.
All these steps are done as a user (here it is biovm). sudo privileges are required for installing packages, creating the database and creating the database user.
Install the MySQL database server:
sudo apt install -y mysql-server
During the installation, there will be a request to set a password for the mysql admin user (-u root in mysql commands). You will need this password to connect and create the database later.
After installation, you can check if the MySQL server is up and running using:
sudo netstat -tlpn | grep mysql
tcp 0 0 127.0.0.1:3306 0.0.0.0:* LISTEN 3113/mysqld
Install the Perl DBI, DBD modules and the Python MySQL client libraries required for interacting with the MySQL database:
sudo apt install -y libdbi-perl libdbd-mysql-perl \
libmysqlclient-dev python3-mysqldb
Create the biosql database
Login as the mysql admin user:
mysql -u root -p
Create the database. Note the use of back ticks ` instead of single quotes ' for the database name:
create database `biosql` collate 'utf8_general_ci';
Create the database user and grant permissions to the biosql database. note: set a strong password under identified by:
create user `biovm`@`localhost` identified by 'your-password-here';
grant all privileges on `biosql`.* to `biovm`@`localhost`;
Quit the mysql shell by typing exit
.
For the database connection to work in the lineager.py script, create a file .my.cnf in the user's HOME directory with the following content:
[client]
user = biovm
password = your-password-here
database = biosql
Download the latest release of the BioSQL schema:
wget -c https://github.com/biosql/biosql/archive/biosql-release-1-0-1.tar.gz
Decompress the archive, change directory:
tar zxvf biosql-release-1-0-1.tar.gz
cd biosql-biosql-release-1-0-1
Initialise the database with the BioSQL schema
Edit file biosqldb-mysql.sql in the sql directory and remove TYPE=INNODB from all CREATE TABLE statements. See footnotes for the reason.
Initialise database by executing the SQL script:
mysql -u biovm -D biosql < sql/biosqldb-mysql.sql
Import NCBI taxonomy data
While still in the biosql-biosql-release-1-0-1 directory, create a directory to store the taxonmy data:
mkdir taxdata
Download the taxonomy database, verify integrity of the downloaded archive and then uncompress the archive:
cd taxdata
wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.md5
md5sum -c taxdump.tar.gz.md5
# this should return
taxdump.tar.gz: OK
# uncompress archive
tar zxvf taxdump.tar.gz
# Move to parent directory
cd ..
Import taxonomy data into the biosql database. For the script to connect to the database, the DBI_PASSWORD environment variable will need to be set. The password is the one set for the database user earlier:
export DBI_PASSWORD='your-password-here'
Now load the data:
perl scripts/load_ncbi_taxonomy.pl --dbname biosql --dbuser biovm
This will take some time to complete:
Loading NCBI taxon database in taxdata:
... retrieving all taxon nodes in the database
... reading in taxon nodes from nodes.dmp
... insert / update / delete taxon nodes
... updating new parent IDs
... (committing nodes)
... rebuilding nested set left/right values
... reading in taxon names from names.dmp
... deleting old taxon names
... inserting new taxon names
... cleaning up
Done.
To view all the options the script supports, install the perl-doc program and then use the --help option:
perl scripts/load_ncbi_taxonomy.pl --help
Run lineager.py to get lineage
Clone the git repository:
cd
git clone --depth 1 https://gitlab.com/vimalkvn/lineager
Run script:
cd lineager
python3 lineager.py -n Escherichia coli
If you have a file input.txt containing names like this:
Canis lupus familiaris
Bos taurus
Escherichia
AMBIGUOUS
Arabidopsis thaliana
then run the script like this:
python3 lineager.py -f input.txt
At the end of the run, an output file lineage.csv will be generated in the same directory.
Footnotes
Without this, the following error will occur when attempting to initialise the database:
ERROR 1064 (42000) at line 49: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'TYPE=INNODB' at line 8.
This happens because the TYPE=INNODB clause was changed to ENGINE=INNODB since MySQL version 5.5. As INNODB is the default, it is not necessary to specify it in the CREATE TABLE statements.
This looks great! I was looking for such functionality. I am surprised this cannot be done by using a function from BioSQL.py. Is it possible to modify the code to be compatible with
import mysql.connector
?