The RefSeq
"complete" dataset is available for download via FTP
here. I am interested in the protein sequences therein (the *.protein.faa.gz
files).
There seem to be two "sets" of files for the "complete" division:
complete.[0-9]+.protein.faa.gz
complete.nonredundant_protein.[0-9]+.protein.faa.gz
According to the relevant NCBI documentation these non-redundant data sets cover bacterial and archaeal sequences.
My question is, to get the complete "complete" dataset, do I need to download all the complete.nonredundant_protein.[0-9]+.protein.faa.gz
alongside all the complete.[0-9]+.protein.faa.gz
, or would this be (double?) double-counting? Or does the complete.[0-9]+.protein.faa.gz
on its own cover all protein sequence data available at NCBI?