Hi all,
I’m trying to find out and compile info on how big labs (e.g. 40+ people) or small organisations who produce large amounts of NGS data manage the data storage issue. As a small organisation, we produce 10-15 TBs of HTS data every year from large genomics and metagenomics projects. We are often dependent on an external HPC cluster service provider for storing our TBs of data in parallel to doing analysis there, along with a few more Tbs on local server. We have both 'active' and ‘cold’ (not actively used anymore) datasets from our research projects and we try to maintain these datasets for at least 7 years.
Since many of us are dealing with a large bunch of NGS data from projects with TBs in size, I was wondering
- how/where do you guys store your "active" and "cold" HTS data? Do you have an in-house server with TBs of storage facility or online Cloud (e.g. Amazon) or other options?
- is it cost-effective to use cloud-based servers (e.g. Amazon, ASIA) for "cold" data storage and build a small server locally for working with "active" data to mitigate this issue? Any ideas on the cost involved to build a small cluster/server with Tbs of storage?
I have talked to a few of my colleagues and it sounds like everybody is doing it somehow but looking for better options. Since many of us are struggling with HTS data storage with some backup facility, I was wondering if there are any cost-effective solutions? Many organisations are investing quite a lot on eResearch (i.e. data science) and many of us are already know that genomics-based data storage is really a big issue across organisations for researchers and needs more attention.
We might share our ideas and see if we can follow others’ approaches.
For us active data is always on high performance local cluster storage. We are a bigger organization/sequencing center and have access to plenty of storage (not infinite but adequate for ~6 months, hundreds of TB). We also use a large quantum tape library solution that is presented as storage partition. Data copied there automatically goes on tapes. We keep them for 3 years.
You can consider cold storage on google or AWS. While cold cloud storage is cheap, you will incur a cost to retrieve the data, which can be expensive. You can consider converting data to uBAM or CRAM (if a reference is available) to save on space in general.
If data is going to be published you would eventually want to submit it to SRA/ENA/DDBJ so you can store a copy there. There is a facility to embargo it until publication (or at least 1 yr I think) so you are covered.