why aren't relational DB's used more for genomic data?
2
0
Entering edit mode
9.2 years ago
brentp 24k

A non-biologist asked me why this field doesn't use relational db's more. They have built-in indexing and nice query support and are widely used, etc. In addition, a database like postgres allows custom data-types and indexes so interval queries could be optimized.

My only thoughts were related to the number of records--there could be billions for even a single-sample BAM. Once you get to hundreds of samples, this becomes a lot of records. I don't know the practical limit for RDB.

What are the other reasons?

Why couldn't we be using relational database instead of BAM?

data sql database • 4.4k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
4
Entering edit mode
9.2 years ago
Ram 43k

These are the reasons I see:

  • Normalization is impractical
  • Quite a few non-trivial datatypes
  • Computed fields need to be stored as computation is more expensive than storage
  • Biological data has extremely flexible schema
  • Data is more transient than typical transaction data stored in RDBs
  • Rarely do we need to implement granular security on the underlying files
  • RDB is a layer added to files, makes processes slower in our field. We spend longer getting it to relational shape than in actually using this data
  • Concurrent per-record write access (and locking hierarchies) is rarely necessary
  • Most of our operations involve at least a GROUP BY operation, making structured DBs with map-reduce frameworks more relevant

That being said, relational DBs can be really useful for reference datasets, such as 1000 genomes or ESP. Storing commonly accessed values (such as population specific allele counts and frequencies) and indexing make lookup operations easier and automation friendly.

Addressing your example, BAM is binary. I'd store it in a BLOB field or convert it to SAM and store it in a normal relation, compromising both size as well as access speed. And I use it perhaps a few times for variant calling, coverage analysis and expression analyses. I do not see a huge gain in a relational BAM format.

Another fact is the difference between structured and relational data. BAM is structured, most formats in bioinformatics are, but I'm yet to see any relational data in the field.

ADD COMMENT
0
Entering edit mode

In the case of BAM, it does seem like you could get a lot of the benefit of the query language and engine by storing the data. And, it's not exactly text-format, you'd store the start, end, etc and ints and you can store the seq as bytes...

You could also refer to another alignment in the same table to associate pairs of reads.

But yeah, I agree for the most part.

ADD REPLY
3
Entering edit mode

Traditional RDB solutions are not efficient enough. Traversing a BAM is as fast as gzip decompression. Reading through an RDB table can be much slower sometimes (see leveldb vs. sqlite). Random access to a BAM mostly requires to seek once. Doing that with RDB will incur many seek calls, which can be very slow on some filesystems. BAM can be streamed. RDB not. BAM can be moved around easily. Heavy RDB implementations cannot (sqlite is an exception). BAM is much simpler. The major advantage of RDB over BAM is that the data is mutable, but modifying a table frequently may make performance even worse. You know, in the very early days of NGS, someone have tried to put alignments into RDBMS in exactly the same way you were describing (it is quite natural to think of this approach). If it were a success, there wouldn't be a BAM format.

ADD REPLY
0
Entering edit mode

Interesting. That could be done indeed. But I largely think of how we store genotypes and loci in VCF and then dream of a relational database per VCF file - not a table, but a set of tables, because VCF is 3D.

ADD REPLY
1
Entering edit mode
9.2 years ago
Mitch Skinner ▴ 660

Aside from what you and RamRS mentioned, the fact that you can transfer BAMs without an ETL process is convenient. Also, there are some hybrid approaches: at one place where I worked, they built a postgres foreign data wrapper for BAMs, so you could query them with an SQL select (and do joins between a BAM and other tables in the database). It was slow, though, partly because it was written in python and partly due to limitations on predicate pushdown with postgres FDWs. And it was kind of insecure, because you had to be a superuser to use the BAM FDW (only superusers are allowed to read arbitrary files on the postgres server).

Re: optimizing interval queries using a custom postgres data type, there's bioseg.

ADD COMMENT

Login before adding your answer.

Traffic: 2649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6