Question

NGS data storage and management

1

Entering edit mode

6.6 years ago

olavur ▴ 150

I'm looking for a system to store human NGS data and metadata, and to retrieve data. We have a storage server with a proper distributed filesystem (Isilon OneFS).

There are some other posts discussing this topic, for example:

But I wanted to make a new post because (1) those posts are several years old, and I imagine practices are different today, and (2) they discuss file formats and distributed file systems a lot, while I'm more interested in ways to access data.

I would like to have a system, preferably with a GUI (browser is also fine), where I can search for an individual (pseudonym ID), and retrieve their data:

Raw NGS data (FASTQ)
Aligned reads (BAM)
Variants (VCF)
Metadata, for example whether the individual is part of a trio, was the individual sequenced more than once, how was the individual sequenced, etc.

I also want to be able to retrieve data (VCF or BAM or whatever is specified) from a list of individual IDs.

Some nice-to-haves:

Retrieve variants from individual lists in specified gene(s), loci or type of variation.
Incorporating genome browsers such as ExAC.
Or a different kind of genome browser like IGV.
Familial relationships, for example as in Family Genome Browser (FBG)

Some examples of software I am unsure of:

BASE
HDF/BioHDF

Any input on this topic would be greatly appreciated.

next-gen NGS databases genome browsers • 4.2k views

ADD COMMENT • link 6.6 years ago by olavur ▴ 150

1

Entering edit mode

Are you willing to purchase or looking for a free solution?

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Purchasing is an option.

ADD REPLY • link 6.6 years ago by olavur ▴ 150

1

Entering edit mode

While a system of this type sounds simple, getting freeware or an off the shelf commercial solution to fit your internal business practices can easily become a huge pain in the you know what. Most times this is because of unwillingness of locals to change their business practices/inability of map existing practices onto a ready-made solution. This is guaranteed to cause pain for many unless you have plenty of resources (i.e. developers) to throw at this.

Looking at your user profile you seem to be at an institution that is in this for the long term. So if you have internal developer resources, then putting a solution together that fits your needs (keeping very simple/realistic goals, which is extremely important) may prove to be the best solution.

Also take a look at this old thread: Is there a Lims that doesn't suck? Issues mentioned in that thread (unfortunately) remain current. But it does have useful information about various packages.

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Most times this is because of unwillingness of locals to change their business practices/inability of map existing practices onto a ready-made solution.

We are sort of building everything from the ground up, so I don't know how much we have to adapt to existing practices.

This is guaranteed to cause pain for many unless you have plenty of resources (i.e. developers) to throw at this.

We don't have a lot of resources, so we can't expect to develop a complex system for ourselves.

So if you have internal developer resources, then putting a solution together that fits your needs (keeping very simple/realistic goals, which is extremely important) may prove to be the best solution.

If there is no suitable system we can buy, then perhaps we need to consider developing our own. But most likely I will have to do it myself, so it will have to be very simple.

ADD REPLY • link 6.6 years ago by olavur ▴ 150

1

Entering edit mode

I think you should separate raw data from accessible information. The raw data can be stored in bam files for instance on a slow file system. The accessible data would be the SNPs, coverage etc. It should be pre-computed or computed on request and then loaded to the database. In my opinion there is no reason in having bam files accessible. You obviously narrow down your analysis results to pre-defined questions or request some time to generate the relevant data but the saving in fast storage is huge. If you are looking for a commercial solution you can check out SQREAM, they have (or at least had) dedicated solutions for systems like you described. Good luck

ADD REPLY • link 6.6 years ago by Asaf 10k

0

Entering edit mode

Good point, accessing sequence formats such as FASTQ and BAM will be rare, but not non-existent.

ADD REPLY • link 6.6 years ago by olavur ▴ 150

0

Entering edit mode

PathOS has some of the functionality you are looking for. You can search for patients (maybe their metadata?), VCFs are displayed, IGV is incorporated for aligned read display. Their paper here.

Also, molgenis and their NGS modules might be of use.

ADD REPLY • link 6.6 years ago by Robert Sicko ▴ 630

0

Entering edit mode

Thanks, Molgenis is exactly the kind of thing I need. It seems to have very advanced data management features, and is also geared towards biobanks.

ADD REPLY • link 6.6 years ago by olavur ▴ 150

0

Entering edit mode

However, it seems very complex, and being an open source and most likely government funded project, I'm not sure I can expect much in terms of stability and long-term support.

ADD REPLY • link 6.6 years ago by olavur ▴ 150

0

Entering edit mode

I'm not sure I can expect much in terms of stability and long-term support.

That is a given for pretty much all software. That is one of the reasons one is expected to pay for the value-add that a supporting entity guarantees, even though the software itself may be free.

ADD REPLY • link 6.6 years ago by GenoMax 141k