Biostar Beta. Not for public use.
download refseq of thousand of assembly file from NCBI
0
Entering edit mode
2.1 years ago
Shelle • 0

I want to download many bacteria fasta files with the .fna.gz extension from NCBI i have tried the commands below but none of them is working as it should. I do get the directory not the fasta files. Can anyone let me know what i should change to get the the ref seq of fasta files?

wget -b -r --no-parent -A 'GCF_*_genomic.fna.gz'  ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/

wget -b -r --no-parent accept-regex=*/latest_assembly_versions/*/*_genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria

0
Entering edit mode

Modify @5heikki's solution in how to download all the complete genomes for mycobacteria from NCBI?. It refers to just Mycobaterial genomes but you can remove that restriction.

$wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt$ cat assembly_summary_refseq.txt \
| awk 'BEGIN{FS="\t"}{print $20}' \ | awk 'BEGIN{OFS=FS="/"}{print$0,\$NF"_genomic.fna.gz"}' \
> urls.txt


Limit to list you have from bacterial directory.

0
Entering edit mode

Thanks for your answer but what i want is the fasta file from this website : ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ link below can be one of the files that i am interested in:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Abditibacterium_utsteinense/latest_assembly_versions/GCF_002973605.1_ASM297360v1/GCF_002973605.1_ASM297360v1_genomic.fna.gz

I used the command you mentioned in your post but it doesn't give me right files!

0
Entering edit mode

That is not the final command. As I said you will need to limit the URL's produced if you need just bacterial data.

You may be better off using the program @joe included below. There you can just say ncbi-genome-download bacteria to get all bacterial genomes.

0
Entering edit mode

Thanks. I am using this. It gives me bunch of different directories. When i actually go into each directory, there is a file with content like below:

d3d4a4c01a15dee5a054b38a3178bf12  ./GCF_000007725.1_ASM772v1_assembly_report.txt
c132f1a3ba2b00383f2a1d92e4460e2b  ./GCF_000007725.1_ASM772v1_assembly_stats.txt
7e65c3da25f5a35d8a7860d6c478bf67  ./GCF_000007725.1_ASM772v1_feature_count.txt.gz
2d82d4315ca7a2004a3b03bc55aa42af  ./GCF_000007725.1_ASM772v1_feature_table.txt.gz
576cc643ef00d289009c95518f3792f5  ./GCF_000007725.1_ASM772v1_genomic.fna.gz
5a491b9ae2550dd9b6379e4f9054c4a2  ./GCF_000007725.1_ASM772v1_genomic.gbff.gz
25b139d63e6cd46484ac27daa8532b79  ./GCF_000007725.1_ASM772v1_genomic.gff.gz
10e61215025d12b872b28847e4a389fa  ./GCF_000007725.1_ASM772v1_protein.faa.gz
88b5db0e6f27fd5455e76d2b9180a67b  ./GCF_000007725.1_ASM772v1_protein.gpff.gz
02c785fd2336a0cc3fd20687d3053460  ./GCF_000007725.1_ASM772v1_translated_cds.faa.gz
313d29e74f85d37ae6d701f606f1acac  ./annotation_hashes.txt


I am only interested in "./GCF_000007725.1_ASM772v1_genomic.fna.gz". Does anyone know how i can extract this and work with it separately?

0
Entering edit mode

You could use something like find . -name "*.fna.gz" and move those files to a new location and then delete the rest of the files if you don't want to keep them.

0
Entering edit mode

The command you mentioned doesn't work unfortunately. I used this command "grep -w "GCF_000007365.1_ASM736v1_genomic.fna.gz" MD5SUMS.txt >> newfile/new.txt" to separate the fasta file but as a result the content of new.txt file would be something like below:

576cc643ef00d289009c95518f3792f5  ./GCF_000007725.1_ASM772v1_genomic.fna.gz


which isn't useful again as i want to work with this fasta file later on like decompress it...

0
Entering edit mode

For genomax's command to work you need to be working in a directory at the top of the tree. find is absolutely one of the best ways to do what you want, so you'll need to give us more info about what didn't work.

0
Entering edit mode

It worked thanks for your comment about on the top of tree! But how i can decompress them when they are in the format like below:

./bacteria/bacteriagz/GCF_000008885.1_ASM888v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000009305.1_ASM930v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000019705.1_ASM1970v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000024505.1_ASM2450v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000147695.2_ASM14769v3_genomic.fna.gz
./bacteria/bacteriagz/GCF_000156275.1_ASM15627v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000285255.1_ASM28525v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000287295.1_ASM28729v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000831225.1_ASM83122v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_001700895.1_ASM170089v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_001705605.1_ASM170560v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002083165.2_ASM208316v2_genomic.fna.gz
./bacteria/bacteriagz/GCF_002257505.1_ASM225750v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002849875.1_ASM284987v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002855775.1_ASM285577v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003019755.1_ASM301975v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003019785.1_ASM301978v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003034925.1_ASM303492v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003043915.1_ASM304391v1_genomic.fna.gz

0
Entering edit mode

Are you familiar with genomax's use of the . syntax with find? It's shorthand for my current directory.

More fully that command would look like:

find /path/to/search/from -name "some_string"

0
Entering edit mode
0
Entering edit mode

Hi jrj.healey, I noticed this command 'find . -name ".fna.gz" ' is not working at all. Even if i changed '.' to my current path where all the downloaded bacteria directories are, no result when running the command. Even if i was at the top of directory, it gives me nothing! Earlier that i commented it worked, it was my mistake. I had some other fasta bacteria in a different directory and when i used 'find . -name ".fna.gz" ' those fasta files showed up. I deleted those to confirm if this command is working or not. It turned out when i downloaded the bacteria directories with this ncbi command, the find command in any of the format i have used is not working. Any idea to solve this issue?

1
Entering edit mode
8 months ago
Joe 12k
United Kingdom

You should be able to use ncbi-genome-download for this I think.

Similar Posts