Forum:Fundamentals Of Designing Biological Databases?
10
12
Entering edit mode
13.4 years ago
anon1234z ▴ 130

I'm currently a postdoc with a solid background in bioinformatics, and I've gotten by for the last few years by storing everything I use in flat files. So far it hasn't been a problem, but I can't help feeling like I should know more about databases so that I can use them where appropriate.

Sure, I've used databases. I can write basic SQL statements (select * from blah) or insert and delete rows. I've used this to design uber-simple web apps in the past. Where I'm deficient is understanding how to design them well. What columns should be indexed? What's all this stuff about normalization? Why would I prefer one database type over another? (MySQL, SQLlite, MongoDB, etc). Which are most commonly used for bioinformatics work, and in what subdomains (genomics, proteomics, MD, etc)?

I'd like the community's help in finding some good tutorials that will get an experienced coder and bioinformatician up to speed in the minimum amount of time possible. Bonus points if they specifically address the types of big-data problems that we're all facing in the new high-throughput world of bioinformatics.

database • 16k views
ADD COMMENT
7
Entering edit mode

Your question would be a lot easier to answer (and research as well) if you reduced its scope. The expression "biological databases" is too generic, one would most certainly use different techniques when storing intervals vs storing a reads or microarray data etc

ADD REPLY
0
Entering edit mode

The most important question is what problems you want to solve. Learning purely for the purpose of learning will not teach you much. Are you going to design a database targeting at NAR database issue? If not, do not waste your time on SQL.

ADD REPLY
8
Entering edit mode
13.4 years ago
Gww ★ 2.7k

There are some outstanding tips this article on stackoverflow. It's a nice summary of common mistakes people make when designing databases.

ADD COMMENT
7
Entering edit mode
13.4 years ago
Andrea_Bio ★ 2.8k

Here is a link to a wiki book called Bioinformatics data management which has explains ER theory and normalisation and has some exercises

As for indexes, that really does depend on your database. As a basic example if you have a database storing millions of SNPs and you have a table SNPs with fields like chromosome and locus representing the location of the SNP, and you might want to do a query like 'select * from SNPs where chromosome=10 and locus >1000000'. Here you want to index those fields so the DBMS can go straight to the SNPs on chromosome 10, and find the SNPs which are after base 1 million, rather than having to wade through all the other SNPs found 'before' that. As a rule of thumb, if you think you were doing the query manually and you would find it easier to have the field indexed, then index it. Some people say you sholud index every primary key, foreign key and column in a where clause.

ADD COMMENT
0
Entering edit mode

Indexing primary keys is mandatory at least in the databases I'm familiar with. Indexing all foreign keys is just wrong advice. Indexes should be added only if they increase performance. Databases usually have rich set of tools to measure the performance.

ADD REPLY
0
Entering edit mode

"Indexing all foreign keys is just wrong advice". Yes. Especially, if one works with MyISAM table (the most popular mySQL engine) where a foreign key has no meaning for the engine.

ADD REPLY
0
Entering edit mode

That depends on your database. I never said it applied to every database. From the mysql user manual:

"InnoDB requires indexes on foreign keys and referenced keys so that foreign key checks can be fast and not require a table scan. "

ADD REPLY
0
Entering edit mode

I'm exaggerating a bit here, but it depends on the application, not the database engine. For example, it does not make any sense indexing foreign keys on data warehouse application to speed up foreign key checks because the data is mostly read only. And when the data changes, full scan may be the fastest method for checking referential integrity.

ADD REPLY
0
Entering edit mode

I didn't think the question author would be writing data warehouse applications given their introductory question on databases. I apologise for being too simplistic. I expect the author will have learnt a bit more about indexes from our discussion so all it not lost.

ADD REPLY
7
Entering edit mode
13.4 years ago
Neilfws 49k

I would start by reading about the design of some successful databases used in biology. For example:

Most resources currently employ SQL; it's very early days for so-called "NoSQL" deployment in biology. Search Google for "bioinformatics mongodb", for example: you will find a few blog posts, some by BioStar members, exploring the topic.

ADD COMMENT
4
Entering edit mode
13.4 years ago

Each database design is a new problem and there is no universal way to design it. I guess StackOverflow would be a better place to answer your question:

ADD COMMENT
4
Entering edit mode
13.4 years ago
Jussi ▴ 180

The database applications fall into two main categories: Online Transaction processing (OLTP) and Data warehousing (DW). Most of the advice given here are related to OLTP only and should not applied to DW. I'm not touching the OLTP and modeling since other people have already covered those.

Designing a DW is quite different from OLTP. If you have "big data" problems, you are probably building a DW for that. Then you should

  • Choose the database engine based on your requirements. "Big data" often requires Hash based algorithms: hash join, hash group by, index hash joins, etc.
  • Design the physical structure based on your requirements: rows that are queried together should be close to each other. Use Clustered Indexes (MS SQL), Index organized tables, cluster organized tables (Oracle), etc. to control that.
  • Do not try to make all queries to use indexes, full scanning a table is by far the most efficient access method if you process large portion of the rows.
  • Limit the amount of data processed by queries, use partitioning
  • Use parallel full scans if you have enough disk throughput
  • Use materialized views to speed up large queries (example).
  • Try to minimize the size of the storage: use table compression, index key compression, bitmap indexes, function based indexes (example), etc.
  • Understand the database engine you are using and use the features it provides
  • Validate the tips you read on the internet, those are often wrong or out of context. Including this one.
ADD COMMENT
3
Entering edit mode
13.4 years ago
D. Puthier ▴ 350

Concerning database modeling the easiest way to get something with few redondancy (that is "normalized") and reasonable performance is to use the MERISE method (I think this is the french equivalent to E/A modelling). You will have to draw a conceptual data model (MCD) (a diagram with entities and associations) that will also include cardinalities between those entities (one to one, one to many, many to many). Based on the diagram you can create a MLD (logical model of data). The MLD is almost the definitive database shema. To create a MLD there are few rules. The three most important to my knowledge are:

  • if a many-to-many relationship exists between entities A and B, a new table C is created that will contain the two PK a,b of A and B (thus they will be foreign keys). The PK of C will be a composite of a and b.
  • if a one-to-many relationship exists between entities A and B the PK of B is copied in A and becomes a FK.
  • if a one-to-one relationship exists, A and B should be part of the same entity.

From the MLD the next step is to adapt it to your own database system (mySQL, PostgreSQL, ORACLE, DB2). This last step is the MPD (physical model of data).

You can use AnalseSI or Power AMC as a tool to go from MCD>MLD>MPD. However for a more sophisticated MPD, you should draw MPD from MLD using a dedicated tool such as mysql-workbench (if working with mySQL). I have discovered the MERISE method ~ 5 years ago. It has know become extremly simple to modelize anything. However one should concider that this are the canonical rules but sometimes you can decide not to follow them because your database shema would not fit with your data structure (you would have to create many perl scripts to make them fit with the database model.

Hope this will give you some clues.

Sorry but some terms (MLD, MCD,..) are translated from french. i'm not sure about their english translation.

ADD COMMENT
0
Entering edit mode

Just another thing. In some cases you are tempted to de-normalized because of performance issues. Just keep in mind that there are several ways of improving the overall performance of a query (where, join on multiple keys, temporary table, no imbricated select, procedures,...).

ADD REPLY
0
Entering edit mode

as far as I know Merise an old spec(?) and only used in France. People looking for a modeling language should learn UML.

ADD REPLY
0
Entering edit mode

You are probably right. The method is a little bit old fashioned and one should better learn a more universal language. However MERISE is very straightforward and fits with most conventional database modeling problems.

ADD REPLY
3
Entering edit mode
13.4 years ago
Shigeta ▴ 470

Just to reiterate some of the comments - scope is everything for a bioinformatics database.

If the DB does everything, its likely that it won't do anything.

Normalization is useful, but like most programming maxims you have to use it only where appropriate and break it where it kills performance. The truth about any relational data system is that you can't have to many joins. If you have more than 20 tables in your database and you are probably going to have performance problems. If you use more than maybe 4-5 tables in any given query it could take hours to get a response back.

Bioinformatics people are often 5-7 years out of date with these ideas. MongoDB and other new data retrieval systems are great - they will be useful for NGS applications and no doubt there will be better ones coming up soon.

But one of the best performance features they have is that they do not let developers have many tables. After that, giving up ACID compliance in exchange for speed is usually the second bump they have.

Mysql may give the same performance as MongoDB when you don't use joins.

For starters, learn what a star schema is and its variants, read up a little bit on RDBMS data mining and you will get an idea of how data retrieval performance works. I also think MongoDB is a great place to look next because of its auto scaling abilities. Not sure its a great place to store sequence data you want to search though.

ADD COMMENT
0
Entering edit mode

I had 1500 tables in one application and it caused no performance problems. The 20 table limit is based on nothing. Also, I can give you example queries with > 5 joins and sub-second performance. Databases are born to join.

ADD REPLY
1
Entering edit mode
13.1 years ago
Melanie ▴ 660

You've gotten some good pointers on where to find more info. I work in industry, so my DB design problems and constraints are probably a little different than yours. But my general advice is:

  1. Learn from the general database design resources out there. Biology data is different, but not THAT different. I have watched many people struggle to solve database design problems that have been solved in other domains, just because they were too arrogant to admit that they could learn something from a non-scientific application. I've also watched people develop databases with staggeringly stupid designs because they were just ignorant of relational theory. C.J. Date has an O'Reilly book that is a pretty good and brief intro to relational theory.

  2. Be suspicious of anyone who tells you that you "always" have to break normalization because of performance issues. That just isn't true. It depends on the size of your tables, the details of what you are doing with them, the details of your hardware, and the options provided by your database software. I tend to think that people who make these sorts of blanket statements are either lazy or unaware of the risks associated with denormalizing.

  3. If you do decide to denormalize, make sure you aren't hiding ignorance about the structure of the data. I generally capture the normalized data model first, and then document denormalizations and why they were done. It makes the database easier to maintain. There are tools out there that will help you keep track of this stuff. Some of them are even free (e.g., Oracle Data Modeler).

ADD COMMENT
0
Entering edit mode
13.4 years ago
Anon1234Y ▴ 160

Why don't you start by using google? Or why don't you start taking a course? In every university you will find courses that will teach you the basics. For instance, that mysql, etc. are database management systems and no "types of databases"... Are you even sure you need an relational database?

ADD COMMENT
2
Entering edit mode

+1 Giving so many down votes is not a good way to encourage a newcomer even if (s)he is using a temporary ID. I actually think this answer is good. At least (s)he questioned the necessity of using SQL, which is better than letting anon1234z pursue a wrong direction.

ADD REPLY
1
Entering edit mode

The advice is sound but formulating it in a more 'diplomatic' way would avoid the appearance of it being a little bit antagonistic

ADD REPLY
0
Entering edit mode

+1 for speaking out the truth.

ADD REPLY
0
Entering edit mode

Needlessly inflammatory and doesn't address the question at all. Also, respondent should spend 30 seconds learning the difference betwween Mysql, SQLite, and MongoDB to find out that they are representatives of different approaches to data storage. Maybe Google would help?

ADD REPLY
0
Entering edit mode

I might go against the downvoting trend here for a second and +1 this :) I agree the "google" suggestion is less than useful, but I think the basic point is sound - when coming at databases from a very novice level, finding a basic course at a university might be a good place to start. I think jumping straight into articles about normalisation, etc., without some context will be a bit daunting. As Istavan implies, the question is too broad to be treated as a standard Q&A.

ADD REPLY
0
Entering edit mode
6.2 years ago
bioinfo8 ▴ 230

Hi all,

I am looking for ways to create a biological database. I have some data in MS-Access and kind of confused on which approach I should follow to create database. I would like to have an updated opinion from the biostars community on the different ways of creating a database, front and back-end options, using R or Python and publishing on institute website/github.

Please guide.

Thanks.

ADD COMMENT
0
Entering edit mode

You should post your own question making sure that you define the scope better e.g. what kind of data do you want to store ? What would be the access pattern ?

ADD REPLY
0
Entering edit mode

Jean-Karim Heriche I would like to store information on a protein family among different organisms including their names, gene information, sequence information, structure etc. Would you please elaborate more on access pattern?

ADD REPLY
0
Entering edit mode

What JKH said: "You should post your own question". Furthermore, this sounds like a very basic database problem, somehow not related to bioinformatics.... see my comments A: Fundamentals Of Designing Biological Databases?

ADD REPLY

Login before adding your answer.

Traffic: 2854 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6