Biostar Beta. Not for public use.
Question: Suggestions on Sharing VERY LARGE files?
1
Entering edit mode

I am building a large database for my analysis tool ~3 Terabytes.

The issue is distributing this database for people to use it.

A solution we have learned from working with sequencing cores is to ship a hard drive by FedEx. A possibly expensive and annoying proposition.

Another solution is to host the files for download, but this seems possibly expensive or possibly impractical for the audience. (On this point of practicality -- most Universities have gigabit level bandwidth, perhaps 3 Terabytes is no longer a ludicrous download size, simply a big one.) At 1 gigabit, I estimate the download would take 400 minutes.

Since this is a niche' problem, I am hoping someone may have had a similar issue and can provide advice?

Thank you Jeremy Cox

ADD COMMENTlink 3.3 years ago jeremy.cox.2 • 90 • updated 3.3 years ago dyollluap • 300
Entering edit mode
0

Any way the database can be split up? Without knowing what your tool is for I can only guess, but maybe a given researcher may only be interested in a specific species (e.g. human), or large groups of species (e.g. bacteria). This could reduce the size of the file needed to be downloaded. It may improve the accessibility of your tool and make it approachable. I know I wouldn't want to download a 3TB database just to use an exceedingly small portion of it.

As far as the download being too large for slower connections to handle, this what weekends are for: set up the download Friday evening and let it run over the weekend. Put your database on a FTP server, and if someone wants to download it they'll just have to live with whatever connection speed they get. Worst case, they can mail you a drive with return postage.

ADD REPLYlink 3.3 years ago
pld
4.8k
Entering edit mode
0

Good question. I could offer smaller versions of the database, but I fear some people will want "the whole enchillada" as it were. Previewing the database through the web is also an excellent suggestion. But for high volume computations, they'll want to download database and use their own machines. So ultimately, the problem doesn't go away. But if we can cut down the number of people who want to ship a disk, that would be a big success. Good thinking.

ADD REPLYlink 3.3 years ago
jeremy.cox.2
• 90
2
Entering edit mode

You have already covered all bases. There is no reduced cost alternative here.

Amazon or Google storage (perhaps the cheapest data buckets if no compute needs to be done on the data). Storing it in one place and sharing as needed is the best solution. Google storage has good access control options that work in google compute.

ADD COMMENTlink 3.3 years ago genomax 68k
1
Entering edit mode

Shipping disk is the most reliable solution once you've reached the TB range. The problem with sharing over the network is that the speed is only going to be as high as the network bottleneck between the two ends (including bottleneck inside the institutional networks) then there's some overhead depending on the protocol. Also the connections are often not stable over the many hours (up to days) required to transfer a large data set so you need mechanisms that can automatically reconnect and resume where the transfer was interrupted eventually replacing files that were corrupted when the connection was lost. There are technologies for large data transfer over the network (e.g. Aspera, GridFTP) but you end up running into costs and deployment issues.
You should also consider not transferring the data. Maybe you can provide local compute to your collaborators or allow them to extract relevant subsets and only transfer those.

ADD COMMENTlink 3.3 years ago Jean-Karim Heriche 19k
Entering edit mode
1

""You should also consider not transferring the data. Maybe you can provide local compute to your collaborators or allow them to extract relevant subsets and only transfer those.""

There's the solution. Run a web-service for users to browse the data without needing a full download.

ADD REPLYlink 3.3 years ago
karl.stamm
3.5k
0
Entering edit mode

I would suggest setting up an FTP server on Amazon. Here are some notes on how we did it: sharing big files I have shared large file (500Gb) in this way. I don't know about 3Tb size files though.

ADD COMMENTlink 3.3 years ago Joseph Hughes ♦ 2.7k
0
Entering edit mode

My current best solution for sharing terabyte quantities of files (without having dedicated servers) is to host them on Amazon s3 and setup permissions as appropriate. It is possible to store datasets at the fixed cost per Tb per month and have individuals pay for their own downloads/transfers. If you have the data already hosted on s3 it makes download to collaborator's local resources / bandwidth requirements less necessary because they can use the database/tool using ec2 machines as needed and without duplicating a snapshot of the source data if you later update the database.

ADD COMMENTlink 3.3 years ago dyollluap • 300

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0