Biostar Beta. Not for public use.
Forum:NGS Data storage solutions for small organisations or big labs
1
Entering edit mode
13 months ago
bioinfo • 700
New Zealand

Hi all,

I’m trying to find out and compile info on how big labs (e.g. 40+ people) or small organisations who produce large amounts of NGS data manage the data storage issue. As a small organisation, we produce 10-15 TBs of HTS data every year from large genomics and metagenomics projects. We are often dependent on an external HPC cluster service provider for storing our TBs of data in parallel to doing analysis there, along with a few more Tbs on local server. We have both 'active' and ‘cold’ (not actively used anymore) datasets from our research projects and we try to maintain these datasets for at least 7 years.

Since many of us are dealing with a large bunch of NGS data from projects with TBs in size, I was wondering

  • how/where do you guys store your "active" and "cold" HTS data? Do you have an in-house server with TBs of storage facility or online Cloud (e.g. Amazon) or other options?
  • is it cost-effective to use cloud-based servers (e.g. Amazon, ASIA) for "cold" data storage and build a small server locally for working with "active" data to mitigate this issue? Any ideas on the cost involved to build a small cluster/server with Tbs of storage?

I have talked to a few of my colleagues and it sounds like everybody is doing it somehow but looking for better options. Since many of us are struggling with HTS data storage with some backup facility, I was wondering if there are any cost-effective solutions? Many organisations are investing quite a lot on eResearch (i.e. data science) and many of us are already know that genomics-based data storage is really a big issue across organisations for researchers and needs more attention.

We might share our ideas and see if we can follow others’ approaches.

ADD COMMENTlink
0
Entering edit mode

For us active data is always on high performance local cluster storage. We are a bigger organization/sequencing center and have access to plenty of storage (not infinite but adequate for ~6 months, hundreds of TB). We also use a large quantum tape library solution that is presented as storage partition. Data copied there automatically goes on tapes. We keep them for 3 years.

You can consider cold storage on google or AWS. While cold cloud storage is cheap, you will incur a cost to retrieve the data, which can be expensive. You can consider converting data to uBAM or CRAM (if a reference is available) to save on space in general.

If data is going to be published you would eventually want to submit it to SRA/ENA/DDBJ so you can store a copy there. There is a facility to embargo it until publication (or at least 1 yr I think) so you are covered.

ADD REPLYlink
0
Entering edit mode
14 months ago
United States

My decidedly heterodox position (probably due to my foundational training in molecular biology) is to store the 'cold' data as library DNA in a -80˚ freezer. DNA is a technologically stable and incredibly information-dense platform - a small freezer could easily accommodate petabytes-to-exabytes equivalent of data at a fraction of the price of digital media. Plus, storage costs for most cold datasets are wasted, in the sense that they'll never be reanalyzed, which makes resequencing of the few reusable ones cost-effective.

But I've found that most users of our sequencing facility are strongly opposed to this suggestion - I would be interested to hear feedback from the Biostars community.

ADD COMMENTlink
0
Entering edit mode

I agree in principal with your solution but it may only be viable for an individual lab. Sequencing facilities deal with tens of thousands of samples and storage of libraries at -80C for years quickly becomes unwieldy.

ADD REPLYlink
0
Entering edit mode

I think that DNA should be stable long term at RT under proper storage conditions.

ADD REPLYlink
0
Entering edit mode

You'd be surprised how easy it is. 40K+ clones (two whole-genome RNAi libraries for C. elegans) fit in a couple freezer racks, and we retrieve samples from those regularly (much more frequently than cold datasets).

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3