This site is a beta test.
Question: download refseq of thousand of assembly file from NCBI
0
Entering edit mode
14 months ago
Shelle • 0

I want to download many bacteria fasta files with the .fna.gz extension from NCBI i have tried the commands below but none of them is working as it should. I do get the directory not the fasta files. Can anyone let me know what i should change to get the the ref seq of fasta files?

wget -b -r --no-parent -A 'GCF_*_genomic.fna.gz'  ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ 

wget -b -r --no-parent accept-regex=*/latest_assembly_versions/*/*_genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria
ADD COMMENTlink 14 months ago Shelle • 0
Entering edit mode
0

Modify @5heikki's solution in how to download all the complete genomes for mycobacteria from NCBI?. It refers to just Mycobaterial genomes but you can remove that restriction.

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
$ cat assembly_summary_refseq.txt \
    | awk 'BEGIN{FS="\t"}{print $20}' \
    | awk 'BEGIN{OFS=FS="/"}{print $0,$NF"_genomic.fna.gz"}' \
    > urls.txt

Limit to list you have from bacterial directory.

ADD REPLYlink 14 months ago
genomax
68k
Entering edit mode
0

Thanks for your answer but what i want is the fasta file from this website : ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ link below can be one of the files that i am interested in:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Abditibacterium_utsteinense/latest_assembly_versions/GCF_002973605.1_ASM297360v1/GCF_002973605.1_ASM297360v1_genomic.fna.gz

I used the command you mentioned in your post but it doesn't give me right files!

ADD REPLYlink 14 months ago
Shelle
• 0
Entering edit mode
0

That is not the final command. As I said you will need to limit the URL's produced if you need just bacterial data.

You may be better off using the program @joe included below. There you can just say ncbi-genome-download bacteria to get all bacterial genomes.

ADD REPLYlink 14 months ago
genomax
68k
Entering edit mode
0

Thanks. I am using this. It gives me bunch of different directories. When i actually go into each directory, there is a file with content like below:

d3d4a4c01a15dee5a054b38a3178bf12  ./GCF_000007725.1_ASM772v1_assembly_report.txt
c132f1a3ba2b00383f2a1d92e4460e2b  ./GCF_000007725.1_ASM772v1_assembly_stats.txt
7a2f6dc85caefaf326362077f72bb1ad  ./GCF_000007725.1_ASM772v1_cds_from_genomic.fna.gz
7e65c3da25f5a35d8a7860d6c478bf67  ./GCF_000007725.1_ASM772v1_feature_count.txt.gz
2d82d4315ca7a2004a3b03bc55aa42af  ./GCF_000007725.1_ASM772v1_feature_table.txt.gz
576cc643ef00d289009c95518f3792f5  ./GCF_000007725.1_ASM772v1_genomic.fna.gz
5a491b9ae2550dd9b6379e4f9054c4a2  ./GCF_000007725.1_ASM772v1_genomic.gbff.gz
25b139d63e6cd46484ac27daa8532b79  ./GCF_000007725.1_ASM772v1_genomic.gff.gz
10e61215025d12b872b28847e4a389fa  ./GCF_000007725.1_ASM772v1_protein.faa.gz
88b5db0e6f27fd5455e76d2b9180a67b  ./GCF_000007725.1_ASM772v1_protein.gpff.gz
64ae08d0ceff2696234aded52fdf8955  ./GCF_000007725.1_ASM772v1_rna_from_genomic.fna.gz
02c785fd2336a0cc3fd20687d3053460  ./GCF_000007725.1_ASM772v1_translated_cds.faa.gz
313d29e74f85d37ae6d701f606f1acac  ./annotation_hashes.txt

I am only interested in "./GCF_000007725.1_ASM772v1_genomic.fna.gz". Does anyone know how i can extract this and work with it separately?

ADD REPLYlink 14 months ago
Shelle
• 0
Entering edit mode
0

You could use something like find . -name "*.fna.gz" and move those files to a new location and then delete the rest of the files if you don't want to keep them.

ADD REPLYlink 14 months ago
genomax
68k
Entering edit mode
0

The command you mentioned doesn't work unfortunately. I used this command "grep -w "GCF_000007365.1_ASM736v1_genomic.fna.gz" MD5SUMS.txt >> newfile/new.txt" to separate the fasta file but as a result the content of new.txt file would be something like below:

576cc643ef00d289009c95518f3792f5  ./GCF_000007725.1_ASM772v1_genomic.fna.gz

which isn't useful again as i want to work with this fasta file later on like decompress it...

ADD REPLYlink 14 months ago
Shelle
• 0
Entering edit mode
0

For genomax's command to work you need to be working in a directory at the top of the tree. find is absolutely one of the best ways to do what you want, so you'll need to give us more info about what didn't work.

ADD REPLYlink 14 months ago
jrj.healey
12k
Entering edit mode
0

It worked thanks for your comment about on the top of tree! But how i can decompress them when they are in the format like below:

./bacteria/bacteriagz/GCF_000008885.1_ASM888v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000009305.1_ASM930v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000019705.1_ASM1970v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000024505.1_ASM2450v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000147695.2_ASM14769v3_genomic.fna.gz
./bacteria/bacteriagz/GCF_000156275.1_ASM15627v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000285255.1_ASM28525v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000287295.1_ASM28729v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_000831225.1_ASM83122v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_001700895.1_ASM170089v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_001705605.1_ASM170560v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002083165.2_ASM208316v2_genomic.fna.gz
./bacteria/bacteriagz/GCF_002257505.1_ASM225750v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002849875.1_ASM284987v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_002855775.1_ASM285577v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003019755.1_ASM301975v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003019785.1_ASM301978v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003034925.1_ASM303492v1_genomic.fna.gz
./bacteria/bacteriagz/GCF_003043915.1_ASM304391v1_genomic.fna.gz
ADD REPLYlink 14 months ago
Shelle
• 0 • updated 14 months ago
jrj.healey
12k
Entering edit mode
0

Are you familiar with genomax's use of the . syntax with find? It's shorthand for my current directory.

More fully that command would look like:

find /path/to/search/from -name "some_string"
ADD REPLYlink 14 months ago
jrj.healey
12k
Entering edit mode
Entering edit mode
0

Hi jrj.healey, I noticed this command 'find . -name ".fna.gz" ' is not working at all. Even if i changed '.' to my current path where all the downloaded bacteria directories are, no result when running the command. Even if i was at the top of directory, it gives me nothing! Earlier that i commented it worked, it was my mistake. I had some other fasta bacteria in a different directory and when i used 'find . -name ".fna.gz" ' those fasta files showed up. I deleted those to confirm if this command is working or not. It turned out when i downloaded the bacteria directories with this ncbi command, the find command in any of the format i have used is not working. Any idea to solve this issue?

ADD REPLYlink 14 months ago
Shelle
• 0
1
Entering edit mode
14 months ago
jrj.healey 12k
United Kingdom

You should be able to use ncbi-genome-download for this I think.

https://github.com/kblin/ncbi-genome-download

ADD COMMENTlink 14 months ago jrj.healey 12k

Login before adding your answer.

Powered by the version 1.5.2