Question

why aren't relational DB's used more for genomic data?

0

Entering edit mode

9.2 years ago

brentp 24k

A non-biologist asked me why this field doesn't use relational db's more. They have built-in indexing and nice query support and are widely used, etc. In addition, a database like postgres allows custom data-types and indexes so interval queries could be optimized.

My only thoughts were related to the number of records--there could be billions for even a single-sample BAM. Once you get to hundreds of samples, this becomes a lot of records. I don't know the practical limit for RDB.

What are the other reasons?

Why couldn't we be using relational database instead of BAM?

data sql database • 4.4k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by brentp 24k

1

Entering edit mode

See also: Why are we still using Bam files? And not Cram, HDF5 or improved Bam files?

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Pierre Lindenbaum 161k

Ram · Answer 1 · 2015-01-23

4

Entering edit mode

9.2 years ago

Ram 43k

These are the reasons I see:

Normalization is impractical
Quite a few non-trivial datatypes
Computed fields need to be stored as computation is more expensive than storage
Biological data has extremely flexible schema
Data is more transient than typical transaction data stored in RDBs
Rarely do we need to implement granular security on the underlying files
RDB is a layer added to files, makes processes slower in our field. We spend longer getting it to relational shape than in actually using this data
Concurrent per-record write access (and locking hierarchies) is rarely necessary
Most of our operations involve at least a GROUP BY operation, making structured DBs with map-reduce frameworks more relevant

That being said, relational DBs can be really useful for reference datasets, such as 1000 genomes or ESP. Storing commonly accessed values (such as population specific allele counts and frequencies) and indexing make lookup operations easier and automation friendly.

Addressing your example, BAM is binary. I'd store it in a BLOB field or convert it to SAM and store it in a normal relation, compromising both size as well as access speed. And I use it perhaps a few times for variant calling, coverage analysis and expression analyses. I do not see a huge gain in a relational BAM format.

Another fact is the difference between structured and relational data. BAM is structured, most formats in bioinformatics are, but I'm yet to see any relational data in the field.

ADD COMMENT • link 2.0 years ago by Ram 43k

0

Entering edit mode

In the case of BAM, it does seem like you could get a lot of the benefit of the query language and engine by storing the data. And, it's not exactly text-format, you'd store the start, end, etc and ints and you can store the seq as bytes...

You could also refer to another alignment in the same table to associate pairs of reads.

But yeah, I agree for the most part.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by brentp 24k

3

Entering edit mode

Traditional RDB solutions are not efficient enough. Traversing a BAM is as fast as gzip decompression. Reading through an RDB table can be much slower sometimes (see leveldb vs. sqlite). Random access to a BAM mostly requires to seek once. Doing that with RDB will incur many seek calls, which can be very slow on some filesystems. BAM can be streamed. RDB not. BAM can be moved around easily. Heavy RDB implementations cannot (sqlite is an exception). BAM is much simpler. The major advantage of RDB over BAM is that the data is mutable, but modifying a table frequently may make performance even worse. You know, in the very early days of NGS, someone have tried to put alignments into RDBMS in exactly the same way you were describing (it is quite natural to think of this approach). If it were a success, there wouldn't be a BAM format.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by lh3 33k

0

Entering edit mode

Interesting. That could be done indeed. But I largely think of how we store genotypes and loci in VCF and then dream of a relational database per VCF file - not a table, but a set of tables, because VCF is 3D.

ADD REPLY • link 2.0 years ago by Ram 43k

Ram · Answer 2 · 2015-01-24

Aside from what you and RamRS mentioned, the fact that you can transfer BAMs without an ETL process is convenient. Also, there are some hybrid approaches: at one place where I worked, they built a postgres foreign data wrapper for BAMs, so you could query them with an SQL select (and do joins between a BAM and other tables in the database). It was slow, though, partly because it was written in python and partly due to limitations on predicate pushdown with postgres FDWs. And it was kind of insecure, because you had to be a superuser to use the BAM FDW (only superusers are allowed to read arbitrary files on the postgres server).

Re: optimizing interval queries using a custom postgres data type, there's bioseg.