Question

Downloading the RefSeq proteins complete data set

0

Entering edit mode

2.9 years ago

Dunois ★ 2.5k

The RefSeq "complete" dataset is available for download via FTP here. I am interested in the protein sequences therein (the *.protein.faa.gz files).

There seem to be two "sets" of files for the "complete" division:

complete.[0-9]+.protein.faa.gz complete.nonredundant_protein.[0-9]+.protein.faa.gz

According to the relevant NCBI documentation these non-redundant data sets cover bacterial and archaeal sequences.

My question is, to get the complete "complete" dataset, do I need to download all the complete.nonredundant_protein.[0-9]+.protein.faa.gz alongside all the complete.[0-9]+.protein.faa.gz, or would this be (double?) double-counting? Or does the complete.[0-9]+.protein.faa.gz on its own cover all protein sequence data available at NCBI?

ftp refseq • 1.1k views

ADD COMMENT • link updated 2.9 years ago by GenoMax 142k • written 2.9 years ago by Dunois ★ 2.5k

score 1 · Answer 1 · 2021-07-07

You will need to download all protein files to get the "complete" dataset. Looking at the summary stats there is only one "protein" line for that directory. You can always email NCBI help desk and confirm.

Directory: complete

    Number of taxids: 111743

    Number of Accessions and total length per molecule type:

    Genomic:    36631677    2303089292160
    RNA:        38417656    100386516185
    Protein:    204185448   79078139531
    Wgs master: 191069  0