Question

How Do You Store And Share Your Bioinformatics Data? (Fasta, Fastq, Sff, Etc.)

5

Entering edit mode

11.2 years ago

Biomonika (Noolean) 3.2k

I would like to find more efficient way for storing and sharing data in our lab. Every week, new data from sequencing are generated so that the database of genomic data, RNASeq, small RNAs is getting bigger and bigger. After some time, it can be inconvenient to search for genomic sequences from October that have already been trimmed.

How do you deal with this problem?

Any experience is appreciated. Thanks.

EDIT: I will also welcome direct suggestions for databases you use for your own data. For example, do you have any experience with CHADO? Can it give answers to questions like "give me all RNASeq data we have got for this species"? It would be also good if datasets could be associated (e.g. raw data associated with respective post-processed file)

data • 8.8k views

ADD COMMENT • link 10.6 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Out of curiosity, what did you end up implementing, if anything?

ADD REPLY • link 9.5 years ago by Adamc ▴ 680

score 7 · Answer 1 · 2013-01-25

I'm going to tell you a bad thing.

While working with microarrays, I made a considerable effort to keep everything ; and everything "very" organized: I had installed the LIMS "BASE" with a lot of plugins, which I had tweaked to fit our kind of data/analyses: it has been a full time job for several months, and a part time job to maintain after.

And, 7 years after, my conclusion is: why did I do that? I was the only one who cared about it, nobody has ever asked me to come back to those data. So now, I just backup my data/analysis in the state they were in my folders on a tape based archiving system (= cheap). When I have less than 2TB free on my server, I remove the folders I'm not sure I need in the following 2 weeks.

So, my "bad" advise is (unless you work for a sequencing facility): just back them up as they are, and don't lose time on any other consideration.

score 3 · Answer 2 · 2013-01-25

Compression with indexing is one way to handle lots of data. Generally, you want to look for good compression performance with "random access", which allows you to retrieve data more quickly than through a naive, one-shot compression of a whole dataset with gzip, bzip2, etc.

In our lab, for example, we work a lot with BED data. As we collected increasing amounts of data, we developed a compression algorithm for BED and starch/unstarch tools to provide greater efficiency than naive gzip etc., as well as fast per-chromosome access ("indexing"), which allows us to more easily split analysis of data in BED files onto computational cluster nodes. The BAM format does similar indexing of compressed data for sequencing read data. For FASTQ data, there was a contest recently to come up with the best lossless-compression-with-indexing scheme, but I'm not sure what the result of that was (take a look at the site).

Other data formats or storage options (databases, for example) may have their own compression and methods for logically breaking up data into more manageable chunks (e.g. by chromosome or some other logical unit).

score 3 · Answer 3 · 2013-01-25

The appropriate solution will depend on how much data you envision you will generate and what type of data you need to share. The best advice I heard from experts in the field is to imagine what you will generate in the next couple of years and multiply that by at least factor of 5, for example (we can't predict the future, so that may be way off).

We use a RAID 50 storage array (not for computing against) and we are just keeping the raw Fastqs and BAMs (and of course back up analyses). I've tried to follow the development of compression/storage solutions like CRAM, but I'm not convinced any of these will become so common that these formats will be easy to share/use in the future so we just keep raw data, alignments, etc. The high volume faster disks for computing against are considerably more expensive, and we have less storage on them. You want to minimize moving big amounts of data around, so try to design a system where your storage is connected to where you will analyze that data. For sharing data, make it available for shipping on hard drives via regular mail. Your IT people probably won't want you to tie up the bandwidth with large/numerous downloads and your collaborators won't want to spend >1000 hrs. getting the raw data. Of course, if you need to just share assemblies or annotations, those can easily be placed on your server for download.

I'm not aware of any journals that want/require you to deposit the raw data anywhere for large sequencing projects these days, so it's probably up to you to maintain copies of your data. This is a very timely question, and I'd be interested to hear what other labs/institutes are doing for setting privileges for each member of the group to certain data sets, and how it is stored/analyzed/shared.

score 2 · Answer 4 · 2013-01-25

An efficient way is to store the reads in a CRAM file. Like BAMs, you can recover the original FASTQ from the file if you include the unmapped reads. It goes one step better and doesn't store sequence unless it needs to (so e.g. perfectly aligned reads are stored only as coordinates in the reference genome). Plus, you gain fast random access by genome position. If you're okay with compressing quality scores in a lossy way, you can get even bigger space gains.

score 1 · Answer 5 · 2013-01-25

1

Entering edit mode

11.2 years ago

diltsjeri ▴ 470

We use a shared unix server. How exactly are you currently storing your data and what type of hardware are you using? (These logistics will help us answer your question better.)

ADD COMMENT • link 11.2 years ago by diltsjeri ▴ 470

0

Entering edit mode

Yes, we use shared unix server. However, sequencing data are distributed based on dates they came out of the sequencing machines and therefore cannot help me to answer questions like "give me all RNASeq data for this species"

ADD REPLY • link 10.6 years ago by Biomonika (Noolean) 3.2k