Question

Suggestions on Sharing VERY LARGE files?

1

Entering edit mode

7.4 years ago

jeremy.cox.2 ▴ 130

I am building a large database for my analysis tool ~3 Terabytes.

The issue is distributing this database for people to use it.

A solution we have learned from working with sequencing cores is to ship a hard drive by FedEx. A possibly expensive and annoying proposition.

Another solution is to host the files for download, but this seems possibly expensive or possibly impractical for the audience. (On this point of practicality -- most Universities have gigabit level bandwidth, perhaps 3 Terabytes is no longer a ludicrous download size, simply a big one.) At 1 gigabit, I estimate the download would take 400 minutes.

Since this is a niche' problem, I am hoping someone may have had a similar issue and can provide advice?

Thank you Jeremy Cox

hosting files • 2.0k views

ADD COMMENT • link updated 7.4 years ago by dyollluap ▴ 310 • written 7.4 years ago by jeremy.cox.2 ▴ 130

0

Entering edit mode

Any way the database can be split up? Without knowing what your tool is for I can only guess, but maybe a given researcher may only be interested in a specific species (e.g. human), or large groups of species (e.g. bacteria). This could reduce the size of the file needed to be downloaded. It may improve the accessibility of your tool and make it approachable. I know I wouldn't want to download a 3TB database just to use an exceedingly small portion of it.

As far as the download being too large for slower connections to handle, this what weekends are for: set up the download Friday evening and let it run over the weekend. Put your database on a FTP server, and if someone wants to download it they'll just have to live with whatever connection speed they get. Worst case, they can mail you a drive with return postage.

ADD REPLY • link 7.4 years ago by pld 5.1k

0

Entering edit mode

Good question. I could offer smaller versions of the database, but I fear some people will want "the whole enchillada" as it were. Previewing the database through the web is also an excellent suggestion. But for high volume computations, they'll want to download database and use their own machines. So ultimately, the problem doesn't go away. But if we can cut down the number of people who want to ship a disk, that would be a big success. Good thinking.

ADD REPLY • link 7.4 years ago by jeremy.cox.2 ▴ 130

0

Entering edit mode

7.4 years ago

Joseph Hughes ★ 3.0k

I would suggest setting up an FTP server on Amazon. Here are some notes on how we did it: sharing big files I have shared large file (500Gb) in this way. I don't know about 3Tb size files though.

ADD COMMENT • link 7.4 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

7.4 years ago

dyollluap ▴ 310

My current best solution for sharing terabyte quantities of files (without having dedicated servers) is to host them on Amazon s3 and setup permissions as appropriate. It is possible to store datasets at the fixed cost per Tb per month and have individuals pay for their own downloads/transfers. If you have the data already hosted on s3 it makes download to collaborator's local resources / bandwidth requirements less necessary because they can use the database/tool using ec2 machines as needed and without duplicating a snapshot of the source data if you later update the database.

ADD COMMENT • link 7.4 years ago by dyollluap ▴ 310

score 2 · Accepted Answer · 2016-11-15

You have already covered all bases. There is no reduced cost alternative here.

Amazon or Google storage (perhaps the cheapest data buckets if no compute needs to be done on the data). Storing it in one place and sharing as needed is the best solution. Google storage has good access control options that work in google compute.

score 1 · Accepted Answer · 2016-11-15

Shipping disk is the most reliable solution once you've reached the TB range. The problem with sharing over the network is that the speed is only going to be as high as the network bottleneck between the two ends (including bottleneck inside the institutional networks) then there's some overhead depending on the protocol. Also the connections are often not stable over the many hours (up to days) required to transfer a large data set so you need mechanisms that can automatically reconnect and resume where the transfer was interrupted eventually replacing files that were corrupted when the connection was lost. There are technologies for large data transfer over the network (e.g. Aspera, GridFTP) but you end up running into costs and deployment issues.
You should also consider not transferring the data. Maybe you can provide local compute to your collaborators or allow them to extract relevant subsets and only transfer those.