If I downloaded all of the data accessible from NCBI's FTP site, how much space would I need?
1
0
Entering edit mode
5.6 years ago
breckuh ▴ 30

I love how NCBI has a simple FTP server setup (ftp://ftp.ncbi.nih.gov/) where I can access data from GEO, Genbank, et cetera.

I was just curious---if one were to download all of that data, how much space would I need? My guess is it would be in petabytes or exabytes.

Does NCBI publish any kind of stat like that?

data • 1.7k views
ADD COMMENT
1
Entering edit mode

I guess you could write an FTP crawler and add up file sizes, or establish an SFTP connection and check the size of the top folder, if that's possible.

ADD REPLY
0
Entering edit mode

I was thinking of doing something like that. I know they don't support SFTP.

I haven't done any FTP crawling before but I imagine it's straightforward. I don't know how feasible it is. If there were 1B folders and I did 1k requests a second, it would take 10 days. Perhaps there are a few simple commands and the whole thing would take an hour. Perhaps I'd be rate limited to 100 requests per second and it would take 3 months :).

Figured asking around first would be a good starting place.

ADD REPLY
1
Entering edit mode

Asking around is always better than jumping head first. Good luck!

ADD REPLY
1
Entering edit mode
5.6 years ago
Carambakaracho ★ 3.2k

At least Genbank is still in the terabyte range

GenBank exceeds 3 Terabases in release 224

GenBank release 225: Over 1 billion sequence records stored!

GenBank release 225.0 (4/14/2018) has 208,452,303 traditional records (including non-bulk-oriented TSA) containing 260,189,141,631 base pairs of sequence data. In addition, there are 621,379,029 WGS records containing 2,784,740,996,536 base pairs of sequence data, 227,364,990 TSA records containing 205,232,396,043 base pairs of sequence data, and 14,782,654 TLS records containing 5,612,769,448 base pairs of sequence data.

ADD COMMENT
1
Entering edit mode

Thanks!

Looks like SRA is about 8.7 Petabytes:

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?

ADD REPLY

Login before adding your answer.

Traffic: 1977 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6