Download Assembly Files from NCBI Genomes Site in Batch
1
0
Entering edit mode
6.6 years ago
taraeicher ▴ 50

I'd like to download the assembly files for bacteria, archaea, virus, fungi, and protozoa from the NCBI website. Since there are so many files, it isn't practical for me to download each one manually. Using wget, I'm able to download at the directory level. For instance, using wget -r -l 20 --no-parent --reject "index.html*" "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/" gives me everything in the archaea directory for each species. The problem is that it skips the assembly directory, which is the part I really need. For instance, I get everything in ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Acidianus_hospitalis/ except for latest_assembly_versions/GCF_000213215.1_ASM21321v1, which is the assembly directory. Does anybody know how I can download this data in batch?

Assembly NCBI genome • 4.3k views
ADD COMMENT
4
Entering edit mode
6.6 years ago

This requires is a series of convoluted (as well as ridiculous) steps, as described in:

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete

Approximately this for bacteria:

# Get the summary as a tabular text file.
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

# Filter for complete genomes.
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths

# Identify the FASTA files (.fna.) other files may also be downloaded here.
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths

# Download everything in parallel
mkdir -p all
cat ftpfilepaths | parallel -j 20 --verbose --progress "cd all && curl -O {}"
ADD COMMENT
0
Entering edit mode

Hi, I just wanted to say thanks for the solution. This worked well.

ADD REPLY

Login before adding your answer.

Traffic: 1848 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6