Question

Downloading All The Incomplete Bacterial Genomes

1

Entering edit mode

10.5 years ago

Eric Normandeau 11k

Following the post at download all the bacterial genomes from ncbi, I was able to download all the completed bacterial genomes easily from here: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/

However, there are a lot of bacteria for which only genome drafts of varying qualities exist.

The 'draft' portion of the ncbi bacterial genomes (ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/) also lists some, but is this complete? Plus, there is no compiled (eg: all_draft_bacterial_genomes.fna) file like in the ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. There seems to be 6970 drafts in there.

My question is: where could I download all the sequences (contigs / scaffolds) from all those incomplete genomes?

I would exclude species where only a small proportion of the genome, say less than 5 or 10%, is available.

For now, it looks like I will have to retrieve all of the file ending in scaffold.fna.tgz from the 6970 draft folders with wget. This is satisfying for ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/, but are there other sources I should consider?

bacteria • 4.4k views

ADD COMMENT • link updated 8.1 years ago by rattus8 ▴ 40 • written 10.5 years ago by Eric Normandeau 11k

score 1 · Answer 1 · 2016-03-18

I had to write to NCBI about this.

Here is my recipe, adapted from Case 1 in this document: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

-- at Bash/Mac OSX prompt in the desired directory:

curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | \ awk '{FS="\t”}  \!/^#/ {print $20} '  | \ sed ‐r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA_.+)|\1\2/\2_genomic.fna.gz|' >genomic_file

-- final command, in the same directory, where you want to install the files:

wget -i genomic_file

Genbank is where all current complete and incomplete sequences are being stored and updated since Dec 12, 2015. Note that if you want a different taxonomic branch, you have to look at the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank) and replace "bacteria" in the ftp address above with the folder you'd like. To get only Refseq genomes (complete and canonically curated sequences), replace "genbank" with "refseq". (This is also described in the factsheet.) "NOTE: if you need the assembly submiƩed to GenBank, you will need to change the curl command’s “refseq” to “genbank”. Since these assembly’s accession iniƟal are different, you will need change sed command’s “GCF” to “GCA”

If copying and pasting the lines above, or from the factsheet, gives you an error, I would paste them into a text editor with syntax highlighting (Emacs, textedit, etc.) and re-type any weirdly colored characters, and quotes on principle (right-slanted double quotes are interpreted differently than left-slanted, etc.).

Certain Bash prompts may require an escape character ("\") for special characters used within AWK commands. If you still have errors, as a second round of treatment, try removing the escape \ in front of the !.

Finally, the folder content is not a final database for standalone BLAST. (I used blastn, part of the Entrez Direct suite of tools provided by NCBI). You will have to use makeblastdb (included if you downloaded the suite to get blastn) to alter the format of and index the files for use by blastn. Furthermore (I had to write to NCBI about this too), the number of files in such a complete taxonomic database is too much for makeblastdb to handle. However, if you cat them into one file, it's fine!

cat *.fna > all_bacteria_fna_files.fna

makeblastdb -in all_bacteria_fna_files.fna -parse_seqids -dbtype nucl -title bacteria -out bacteria

Then, you have to make sure blastn has the folder containing the new database designated as a database variable.

export $BLASTDB=":$HOME/genomes/bacteria/genbank_2_3_2016"

Then you can run blastn on your new bacterial database. (Or, as you can see, this should work with any complete taxonomic download.) Good luck!!

Kim

Here are some of my extremely messy notes on the process. Feel free to ignore.

bacteria - Use awk/sed/curl recipe from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf to get files by parsing the local genome/...assembly_summary.txt file for directories for species of interest - get subdirectory “bacteria” from genbank (content of this directory: NCBI ftp genomes/genbank README, ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/README.txt: "2) genbank: content includes primary submissions of assembled genome sequence and associated annotation data, if any, as exchanged among members of the International Nucleotide Sequence Database Collaboration, of which NCBI's GenBank database is a member. The GenBank directory area includes genome sequence data for a larger number of organisms than the RefSeq directory area; however, some assemblies are unannotated. The sub-directory structure includes: a. archaea b. bacteria c. fungi d. invertebrate e. other - this directory includes synthetic genomes f. plant g. protozoa h. vertebrate_mammalian i. vertebrate_other”) - http://www.linuxtopia.org/online_books/linux_tool_guides/the_sed_faq/sedfaq5_010.html - 5.11. My script aborts with an error message, "event not found". This error is generated by the csh or tcsh shells, not by sed. The exclamation mark (!) is special to csh/tcsh, and if you use it in command-line or shell scripts--even within single quotes--it must be preceded by a backslash. Thus, under the csh/tcsh shell: sed '/regex/!d' # will fail sed '/regex/!d' # will succeed The exclamation mark should not be prefixed with a backslash when the script is called from a file, as "-f script.file". - put into emacs and re-typed anything that it colored as being a… strange character (some underscores were replacing spaces), as well as all single and double quotes - final command: wget -i genomic_file - FINISHED --2016-02-10 01:56:15-- - Downloaded: 58953 files, 62G in 18h 8m 34s (1002 KB/s)

score 0 · Answer 2 · 2013-10-09

0

Entering edit mode

10.5 years ago

irinagaranina24 ▴ 10

Very usefull site for work with bacterial genes and genomes is the MicrobesOnline http://meta.microbesonline.org/programmers.html#Locus I gave you a link to SQL server, where you can download scaffolds from tables Scaffol, ScaffoldSeq etc.

ADD COMMENT • link 10.5 years ago by irinagaranina24 ▴ 10

0

Entering edit mode

After scanning the site, it appears to contain information about few bacteria and only a handful of metagenome data sets. Am I missing something?

ADD REPLY • link 10.5 years ago by Eric Normandeau 11k

0

Entering edit mode

Eric, try for example this query to get strain names and scaffold id: mysql -h pub.microbesonline.org -u guest -pguest genomics -B -e ' source scaf.sql' > scaf.out "scaf.sql": SELECT Taxonomy.name, Scaffold.scaffoldId FROM ScaffoldSeq INNER JOIN Scaffold ON Scaffold.scaffoldId=ScaffoldSeq.scaffoldId INNER JOIN Taxonomy ON Taxonomy.taxonomyId=Scaffold.taxonomyId; To get scaffold sequence add ScaffoldSeq.sequence in first line Try to explore this page http://meta.microbesonline.org/programmers.html#Taxonomy

ADD REPLY • link 10.5 years ago by irinagaranina24 ▴ 10

0

Entering edit mode

All I get in scaf.out is the mysql help, so it looks like there is a mistake somewhere. At this point, I am not sure that this ressource will help me.

ADD REPLY • link 10.5 years ago by Eric Normandeau 11k