Question

NGS storage on AWS S3

3

Entering edit mode

8.0 years ago

fanli.gcb ▴ 730

Hi all,

Does anyone have recommendations/experience on using Amazon S3 or Glacier for NGS data storage? We're currently exploring migrating from local (lab-owned) servers to AWS, but I'm just not that comfortable with the logistics and risk/reward trade-off.

A bit of background info: We operate a sequencing core (low-ish throughput, say 250GB a month) and are currently managing everything on a few locally-hosted machines. To start, I'd probably want to migrate all of the sequencing data (e.g. BCLs and FASTQs, currently about 10TB) to S3, and also set up some mechanism for automatically transferring sequencing runs to S3 as well. We have enough local storage so that we could do this every month or so, perhaps. We also have enough local and university compute infrastructure where I don't envision needing EC2 any time soon.

So then, how should I proceed? I'm thinking to use S3 with Lifecycle rules to automatically move to Glacier storage, as this data will only be retrieved in rare cases. This seems to have the advantage of the S3 API with an rsync-like syntax. I'm not familiar enough with Glacier's API to know exactly how to transfer Illumina run folders directly.

On a more detailed note, how would you guys recommend structuring on S3? Would you create a separate bucket for each sequencing instrument? Or for each run?

Sorry for the length of the post - any advice would be greatly appreciated.

next-gen cloud storage • 2.8k views

ADD COMMENT • link updated 7.8 years ago by DG 7.3k • written 8.0 years ago by fanli.gcb ▴ 730

0

Entering edit mode

Since this is archival storage, you may also want look at nearline storage at Google. Google cloud is HIPAA compliant and they will sign a business associate agreement (check for local policy restrictions before you plan to use any cloud provider). It is also possible to use NetBackup to directly do backups to google storage.

ADD REPLY • link 8.0 years ago by GenoMax 141k

0

Entering edit mode

A note is that AWS has lots of resources for making your storage HIAA complaint and they also have a special zone that for US government institutions, or those operating under similar constraints around security/privacy

ADD REPLY • link 7.8 years ago by DG 7.3k

0

Entering edit mode

You should look into Backblaze: https://www.backblaze.com/b2/cloud-storage-providers.html

ADD REPLY • link 8.0 years ago by igor 13k

0

Entering edit mode

I like Backblaze (I use them for personal backup), but I wonder if it's better to future-proof and stick with Amazon in case we migrate to EC2 in a few years time. Also, has anyone actually done a migration? I'm really curious about the inevitable oops and "oh I didn't know that"s that will come up

ADD REPLY • link 8.0 years ago by fanli.gcb ▴ 730

score 2 · Answer 1 · 2016-06-15

As well as the logistical things (setting up coding infrastructure, how to automate movements, etc) you want to do a complete costing. Keep in mind that you have your S3 storage costs plus the data transfer costs as well.It can be cheaper than local when you factor in infrastructure costs and maintenance of your servers, but you also need to think about cash flow and funding sources. Sometimes it can be hard for us in science, even as sequencing cores, to be paying monthly storage costs versus larger upfront capital investments.

For the S3 buckets you can go either way really. Per instrument makes it easy to retrieve data when you know the instrument, per run makes it easier when you have the run specification. Keep in mind that S3 bucket names need to be globally unique across all of AWS. Per run buckets also means at a certain point you are dealing with a very large number of buckets.