What Is A Good Strategy For Distributing Ngs Data From A Core Facility?
4
5
Entering edit mode
11.1 years ago
Dan D 7.4k

Let's say you're a core facility serving customers both internal and external to your institution. You want to have an efficient way to distribute NGS data such as BAM files, FASTQ, and custom QC reports to your customers. You're open to pretty much any method as long as it is reliable, cost-effective, and convenient for both you and your customers. How would you accomplish this?

• 4.3k views
ADD COMMENT
3
Entering edit mode
11.1 years ago

I would love to see more core facilities storing their runs for short term on Amazon S3. Keep the data on S3 for a few months, after which time the data can automatically be moved to Glacier for low cost long term off-site storage. The costs for such a system are easily low enough that they may be amortized in the facility fee. The upshots to this system:

  • The data are on a fast, high availability distribution network
  • The data can immediately be imported to an Amazon EC2 instance for analysis
  • There are no up front hardware costs
  • Costs for storage and archival can be pushed on to the end user via "requester pays" billing
ADD COMMENT
0
Entering edit mode

In principle, I like this idea too.

However, there are two main drawbacks: 1) confidentiality. People often don't like data leaving the institution. 2) doing analysis in the cloud is hardly routine at the moment, so you'd be wasting time and money putting the data in the cloud only for it be downloaded to local storage again.

ADD REPLY
0
Entering edit mode

1) True, see USA Patriot Act : data encryption should be considered. 2) The question is not about the computation but sharing data.

ADD REPLY
0
Entering edit mode

Indeed, the original question was about sharing, but I replied to the answer which suggested EC2 storage was a good solution because of computation (amongst other reasons). I was refering to that, not the original question.

ADD REPLY
0
Entering edit mode

I appreciate the thoughts. You are absolutely correct that people don't like data to leave the institution. S3 is FISMA "moderate" certified, which good enough for protected health information. Public key encryption would secure data transmission, and some kind of passkey encryption should be used for securing access to the files.

With respect to the comment about "putting data in the cloud": S3 is a content distribution network plain an simple. You are likely using it every day without realizing it (especially if you use Dropbox). EC2, on the other hand, is a platform for "cloud" computing. It would definitely be wasteful to store your genomic data on an EC2 server for distribution, just as it is wasteful to transfer files "in" to the cloud for analysis. Placing the data on S3 eliminate both issues by providing fast and reliable distribution to the wider internet, as well as immediate transfer within all of Amazon's web services.

ADD REPLY
0
Entering edit mode

Ah, ok. I may have misunderstood what S3 is. Thanks for the clarification.

ADD REPLY
1
Entering edit mode
11.1 years ago

You can use the Amazon S3 + boto python library in order to organize data buckets. S3 allows to generate on the fly "shared / public" URL in order to share your data with customers (you can set a lifetime...)

Cost effective. You can choose a datacenter close to customer.

ADD COMMENT
0
Entering edit mode

I like this idea a whole lot. I like the idea of uploading the data to a central location and then just distributing links to customers. No worries about upkeep of storage resources and all that. It also takes the burden off of tracking access permissions as well. I'm going to research S3 and similar providers and see how robust the various APIs are, because it would be nice to automate uploading the process of sending data to the cloud storage and then generate reports and download links accordingly. Thanks!

ADD REPLY
1
Entering edit mode

Take a look at boto-core : https://github.com/boto/botocore and AWS-CLI : https://github.com/aws/aws-cli

ADD REPLY
1
Entering edit mode
11.1 years ago
Chris Cole ▴ 800

I personally like ftp.

I'm mostly a user and ftp is completely cross-platform so I can use any tool I like to access the data. It can be slow, but it doesn't matter as I only need to do it once per dataset. It's easy to leave a wget session open for a few hours.

I like the idea of aspera - it's much faster than std ftp. However, as a user you need to install clients (tricky for the commandline tools) and as a facility it's non-trivial to setup. Plus it costs.

ADD COMMENT
0
Entering edit mode

+1 for aspera. Order of magnitude faster transfer comparing to ftp.

ADD REPLY
0
Entering edit mode

As much as I hate to admit it, Aspera, or something like it, is really the way of the future. I worry that Aspera costs too much, is geared too much toward browser plugin-based downloads, and the configuration seems steep. But, aside from raw speed, Aspera calculates block-based checksums on the server, and checks all the data you download for consistency. The lack of a reasonable checksumming strategy for most FTP transfers concerns me. I would like to see something like rsync but optimized for throughput and included in every Linux OS.

ADD REPLY
0
Entering edit mode

Good point an rsync like tool would be great.

Regarding check summing too many sites don't have any checksum info, so you've no idea whether your download is correct or not. I've tried asking, but not got much interest.

ADD REPLY
0
Entering edit mode
11.1 years ago
DG 7.3k

In my experience as an and user receiving data from a core facility externally we often do straight ftp transfers of raw data (FastQ) and QC reports. In collaborating with colleagues we've used Dropbox for sharing small BAM files (exome data but not whole genome) because we both have paid accounts where it is feasible. For sharing massive amounts of data sneakernet is often a better alternative to large data transfers over ftp. Sending a hard drive by courier is often worth the slight extra cost depending on the amount of data that needs to be transfered.

ADD COMMENT
0
Entering edit mode

I second the shipping a hard-drive option, recently I have to transfer ~100GB, it was better to use DHL.

ADD REPLY

Login before adding your answer.

Traffic: 1539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6