Biostar Beta. Not for public use.
How to best get ALL Bacterial proteins from NCBI
0
Entering edit mode
17 months ago

Hey all,

I already have a head start on this question (following this tutorial.) However that method is taking a _really_ long time since I have a list of ~0.5 Billion sequences to get. Additionally, some of my threads during sequence filtering are throwing errors and I'm afraid this method might not work.

So! I'm asking you if you have a better idea on how to get every bacterial protein sequence from NCBI. I don't think Edirect will work (I'll be blocked). One idea I had was if I could use esearch and efetch on a local copy of the all protein record (nr.fa). However Edirect doesn't support local queries out of the box (at least to my knowledge).

Any advice on how to wrangle Edirect to do local queries or any other ideas would be much appreciated.

protein big data • 282 views
ADD COMMENTlink
0
Entering edit mode

You can also download .faa.gz files for every bacterium in RefSeq, check another tutorial

ADD REPLYlink
0
Entering edit mode

how to get every bacterial protein sequence from NCBI

That requirement, if absolute, will not be satisfied by these two things.

ADD REPLYlink
0
Entering edit mode

Yes I know, I guess proteins of bacteria in RefSeq are enough for his/her purpose, before knowing for what he/she use the data.

Anyway, one can try

# downlaod
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

# reformat
cat assembly_summary.txt | sed 1d | sed '1s/^# //' \
    | sed 's/"/$/g' > assembly_summary.tsv

# where to download
dir=download
mkdir -p $dir

cat assembly_summary.tsv \
    | csvtk cut -t -f ftp_path | sed 1d \
    | rush -v prefix='{}/{%}' -v dir=$dir \
        ' \
            wget -c {prefix}_protein.faa.gz -O {dir}/{%}_protein.faa.gz \
        ' \
        -j 10 -c -C download.rush

ADD REPLYlink
0
Entering edit mode

"all protein" sequences is a moving target, anyway...

ADD REPLYlink
2
Entering edit mode
4 months ago
genomax 68k
United States

You could download nr blast indexes and then use blastdbcmd from BLAST+ (v. 2.8.1) package to do something like this:

 blastdbcmd -db /path_to/nr_v5 -taxids 2 -outfmt %f -out file.fa

This may not be completely foolproof but should mostly work.

Note: You will need to get new v.5 blast indexes for this to work.

ADD COMMENTlink
0
Entering edit mode

I may try this. I am looking for the most sequences possible right now, not just RefSeq.

ADD REPLYlink
0
Entering edit mode

Just occurred to me to ask: What would be the difference between the blast index filtered for bacteria and all of the RefSeq bacterial protein faa files?

ADD REPLYlink
1
Entering edit mode

Blast index will have data for all bacteria where as RefSeq will likely be restricted to well characterized manually curated datasets.

ADD REPLYlink
2
Entering edit mode
14 months ago
Carambakaracho ♦ 1.2k
Switzerland/Basel

From blast/db/README

  1. Contents of the /blast/db/FASTA directory

    [...]

    nr.gz* | non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq

From README.genbank

Protein sequences

The protein sequences present in GenBank releases, via coding regions annotated on GenBank records, are made available via files located elsewhere at the NCBI FTP site:

These files replace the single, comprehensive protein FASTA which used to be provided in this directory ( relNNN.fsa_aa.gz ).

Please see the README in the /protein_fasta directory for further information.

This is what it points to: ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/ and its README

Is this what you're looking for?

ADD COMMENTlink
0
Entering edit mode

The gbbct* files in this directory would work but there is going to be a lot of redundancy. It may still be worth using the nr database to avoid this issue but that is something original poster will have to decide.

ADD REPLYlink
0
Entering edit mode

This may be a good backup to using the nr_v5 database.

ADD REPLYlink
0
Entering edit mode

I didn't believe it wasn't there anymore:

ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

specifically the nr.gz file (links to 45GB file). Still requires a filter on the bacterial entries, though...

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3