Question

Tips To Build A Data Storage For Bioinformatics

20

Entering edit mode

14.1 years ago

Jarretinha 3.4k

Storing large amounts of data will become a problem for the bioinformatics, sooner or later. I've faced this problem recently and a lot of questions that I've never thought before just surfaced. The most obvious are: How to decide the filesystem? How to partition a large (TB range) HD? When is a cheap solution (e. g. a bunch of low-end HDs) inappropriate?

These are pressing issues here at brazilian medical community. Everyone wants to buy a NGS machine, mass spec or microarray but no one perceives the forthcomming data flood.

In practical terms, how do you store your data? A good reason for a given decision would be great too.

Edit:

I've asked this question not so long ago and thing got HOT here. They just finished to build a whole facility to deal with cancer. A lot of people aquired NGS machines and TB scale seems be a thing of the past. Now we are discussing what to keep and how to manage the process of data triage/filtering. So, I do really need new tips from the community. Is someone facing a similar problem (too many data)?

Another edit:

Well, things are pretty fast paced these days. 4TB HDDs are the standard, SDDs are common, servers with onboard Infiniband abound. Also, projects with huge throughput (e. g. Genomics England and it's presumed 300GB per tumour sample). Annotation got way too many layers. Outsourcing sequencing is rather common. This question seems a bit obsolete at the moment.

data • 9.9k views

ADD COMMENT • link updated 5 months ago by Ram 43k • written 14.1 years ago by Jarretinha 3.4k

Ram · Answer 1 · 2010-03-26

13

Entering edit mode

14.1 years ago

Mndoci ★ 1.2k

The number one tenet of storage at scale is "things fail". When you scale up, you will find that 2-5% of your disks are going to fail and when you have a lot of spindles that's a pretty large number. You have to manage against such failures, so it isn't just about buying TBs of disk. You need to design your systems to be fail gracefully. Depending on your applications/goals you might need to make any storage solution highly available, which means you need redundancy, and to scale reads you will almost certainly need to partition your data.

I recommend checking out some of the presentations by Chris Dagdigian, e.g. this one

ADD COMMENT • link updated 5 months ago by Ram 43k • written 14.1 years ago by Mndoci ★ 1.2k

2

Entering edit mode

When I read through the slides, it struck me how complex storage solutions are (maybe he is exaggerating a bit because they want to sell their own competence?).Anyway, I believe the most crucial part of the storage is not the vendor or technology but the competence represented by the people planning and running it, with full-time sys-admins. The bioinformaticians role is to understand and specify the requirements. Disaster is guaranteed if only one poor bioinformatician is hired to do the research part and to build up the infrastructure.

ADD REPLY • link 14.1 years ago by Michael 54k

0

Entering edit mode

When I read through the slides, I just struck me how complex storage solutions are. I think the most crucial part of the storage is not the vendor or technology but the competence represented by the people planning and running it, with full-time sys-admins. The bioinformaticians role is to understand and specify the requirements. Disaster is guaranteed if only one poor bioinformatician is hired to do the research and to build up the infrastructure.

ADD REPLY • link 14.1 years ago by Michael 54k

0

Entering edit mode

Wow !!! This Bioteam is very nice. Thank ya, mndoci! I really appreciate case studies. BTW, I'm the poor bioinformatician. Not alone as we have good IT infra/people. But, NGS/arrays/related will hit hard the diagnostic barrier this year. You can imagine what a very large/rich reference hospital will do. Anyway, storage solutions are complex in our scale and needs. None of us have the required experience. Our cardio division uses a complete proprietary solution with a proprietary database and still suffer from problems regularly. They didn't get the specific needs. So, any tip is handy!

ADD REPLY • link 14.1 years ago by Jarretinha 3.4k

0

Entering edit mode

Michael, storage solutions are extremely complex and very finicky. There is a reason some of the big storage vendors can charge as much as they can, cause they are essentially selling performance and reliability as a brand. At scale though that starts breaking down and you are better served by commodity hardware with the software layer handling failure. And yes, you can't live in a world with a non-expert handling your infrastructure needs.

ADD REPLY • link 14.1 years ago by Mndoci ★ 1.2k

Ram · Answer 2 · 2010-03-24

3

Entering edit mode

14.1 years ago

Andrew Su 4.9k

You might check out this blog post about using Amazon Web Services for analysis of NGS data.

Not directly about data storage per se, but certainly your NGS analysis strategy affects your data storage needs...

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 14.1 years ago by Andrew Su 4.9k

Ram · Answer 3 · 2010-03-24

3

Entering edit mode

14.1 years ago

Michael 54k

You are very much right, and secure and reliable storage is already a problem. I'm not a sysadmin person but got some insight from my work. Here is a little overview how it might have looked in 2007 for a medium size setup (this is a bit dated but I am sure it became just much bigger ;) ).

BTW: I don't believe there much difference in requirements between bioinformatics storage and any other large scientific/business data. So, to get recommendations about top brands in storage better ask this on a sys-admin board, I guess there are some.

Maybe most important about storage:

Storage is nothing without a proper backup solution. And the backup systems are often much more expensive than the disks because you need some overhead for incremental backups. And that need to be taken care of by some admin.

A tape archive system could also be used to store rarely used data.

If possible use an extensible solution and at that time that was some hotswappable RAID (5?? or so) array disks. Why hot-swappable? Because disks fail and then it's nice to be able to replace them.

For data transfer of terabytes, fast connections are needed and that was fibre-channel. Redundant file servers are also nice to have. As you are working in a hospital, there might be even sensitive person related data, so you might even have to think about cryptographic file systems.

Of course this is sort of a maximum scenario and one can work with less. On my small linux machines I am using Ext3 fs on some TB without many problems, also UFS worked very robust on FreeBSD and Mac.

but as you say, your institute bought a/some highly expensive machines, the follow up costs must be considered. So if there is an application for research grants for a new -omics-machine (can call it ferrari too) then you have to be willing to pay the fuel, sorry the infrastructure and sysadmins.

ADD COMMENT • link updated 5 months ago by Ram 43k • written 14.1 years ago by Michael 54k

1

Entering edit mode

A site like BioStar and StackOverflow for sysadmins is http://serverfault.com/

ADD REPLY • link updated 5 months ago by Ram 43k • written 14.1 years ago by Jeroen Van Goey 2.3k

0

Entering edit mode

...better ask this on a sys-admin board, I guess there are some. A site like BioStar and StackOverflow for sys-admins is http://serverfault.com/

ADD REPLY • link 14.1 years ago by Jeroen Van Goey 2.3k

0

Entering edit mode

This is a great and nice answer! Thanks Michael! But, top brands are underrepresented in Brazil. Most solution specialists are friends of mine from grad/undergrad times. For instance, IBM regularly hires people like myself for solution deployment, validation and maintenance. There are very few HPC (official) vendors, too. Our connection is already fibre (nice!!!). It's good to hear that in the end bioinformatics data are not so different. I'll talk with other sysadmins right now. The system accepts hotswap already. Using ext4. I'm not very comfortable with cryptography. Any recomendation?

ADD REPLY • link 14.1 years ago by Jarretinha 3.4k

0

Entering edit mode

Hmm, maybe not the worst idea to buy a ready-made rack from e.g. HAL, Sun/Oracle, but of course, it's maybe hard to find the retailer of your trust. Reg. bioinformatics data, maybe the main difference in the future (high-th. seq, -omics) is that one has to make the decision, which data you can afford to store and which you can afford to throw away. That's the bioinformatics decision. Also transfer of a TB data from sequencer to storage is a problem. Some sequencers (eg 454 titanium) have compute clusters attached that due the processing and (temp.) storage.

ADD REPLY • link 14.1 years ago by Michael 54k

0

Entering edit mode

Oh, and btw. no experiance with crypt. filesystems except existence proof. I have nothing to hide ;)

ADD REPLY • link 14.1 years ago by Michael 54k

0

Entering edit mode

I really prefer the cluster way to solve the question, i.e., to build the solution around the data. As I said, HPC vendors in Brazil are lame. It's much easier to deploy your own solution, even with abundant funding. Data triage and other biocuration tasks will be addressed soon. I think that transfer will not be much of a problem. But, long term storage for legal reasons will be a pain in the ass, certainly.

ADD REPLY • link 14.1 years ago by Jarretinha 3.4k

Ram · Answer 4 · 2010-05-19

2

Entering edit mode

13.9 years ago

Sean Davis 26k

Here is what we have put together for dealing with four Illumina GAII machines.

http://watson.nci.nih.gov/infrastructure.html

In addition, we mirror hot data off-site via rsync and backup to tape (single tape drive).

ADD COMMENT • link updated 5 months ago by Ram 43k • written 13.9 years ago by Sean Davis 26k

0

Entering edit mode

Very nice example !!! I'm studying it right now. By the way, thanks for the disown tip, I didn't know it :)

ADD REPLY • link 13.9 years ago by Jarretinha 3.4k

score 1 · Answer 5 · 2010-03-24

I'm not an expert sysadmin, but here's something to consider: If you're connecting a file server to a cluster of compute nodes, don't forget to provide scratch space on each compute machine. This will allow users to do I/O intensive operations locally, rather than saturating the network and your disks with requests.