Indexing Compressed Fastq-Like Data
7
6
Entering edit mode
13.7 years ago

Hi,

I would like to ask what is the best option to have FASTQ-like information indexed and compressed at the same time. I have seen cdbfasta/cdbyank can index fastq files but not if they are compressed, and I've also seen there are fastq2bam conversion tools that can create unaligned bam files, which I suppose can be indexed.

What is the best option that is a good compromise between speed/space taken? Does anybody know of any tool that will allow retrieving indexed sequences by sequence itself as well as by sequence id?

Cheers

fastq • 8.4k views
ADD COMMENT
4
Entering edit mode
13.7 years ago
Aaron Statham ★ 1.1k

Doesn't look like BAM will be able to index unaligned reads, quote from the paper Pierre linked to:

A position-sorted BAM file can be indexed.

You could try putting your reads into a database (eg I use sqlite3 for some projects) and index the table by the read ID and sequence

ADD COMMENT
4
Entering edit mode
13.7 years ago

As mentioned in other answers, BAM only fulfills part of your requirement (compression and random access), but not indexing. However, you can easily roll your own index using the BGZF API and your key-value store of choice e.g. Berkeley DB.

Here's an example using my cl-sam API, but you can substitute e.g. the C API that comes with Samtools or the Java API from Picard (or a Swig wrapper). I'm just writing data to a text file, instead of BDB, but you get the idea...

(defun bamdex (bam-file index-file)
  (with-open-file (index index-file :direction :output)
    (with-bgzf (bgzf bam-file :direction :input)
      (read-bam-meta bgzf)
      (loop
         for offset = (bgzf-tell bgzf)
         for record = (read-alignment bgzf)
         while record
         do (format index "~s ~d~%" (read-name record) offset)))))

Keys and values:

"ENSDART00000000005_480_670_12d" 2269928771

The key is the read name, the number is the virtual offset into the uncompressed data (see the SAM spec). Use with a BGZF seek to reach the record:

(with-bgzf (bgzf "test.bam")
  (bgzf-seek bgzf 2269928771)
  (read-name (read-alignment bgzf)))

gives

"ENSDART00000000005_480_670_12d"

If the reads are very long, you might consider using a sequence checksum as a key instead of the actual sequence.

ADD COMMENT
0
Entering edit mode

Good info. I know samtools has a "TO DO" list, but I wonder if anyone has already done an implementation of an indexed BAM file...

ADD REPLY
0
Entering edit mode

Might be added to Biopython shortly, see here

ADD REPLY
4
Entering edit mode
13.7 years ago

If you care about the actual implementation - there is an excellent post by brentp on how to index files in general. Formats like FASTA, FASTQ and SAM were used as examples.

ADD COMMENT
0
Entering edit mode

This doesn't work on compressed data. Also, it uses TokyoCabinet BDB which I've found doesn't scale to the number of records found in real data (>10^7). That's why I suggested Berkeley DB.

ADD REPLY
0
Entering edit mode

@Keith, have you found that problem with BDB? I tested with 20million records and saw no slow-down? HDB, on the other hand slows down at 10^6. and tokyo-cabinet can compress the data.

ADD REPLY
0
Entering edit mode

Yes, BDB. See timings here using the C API directly. Each time is for the insertion of 10^6 records. You can see that it takes 3 sec to start, with, but takes 180 sec once there are 10^7 records. Compression helps somewhat at this point, but it's still not good. The time taken increases exponentially. What tuning parameters do you use?

ADD REPLY
0
Entering edit mode

Would anyone know where that blog post is hosted these days?

ADD REPLY
0
Entering edit mode

web archive link

Look for fileindex on April 10, 2010

ADD REPLY
3
Entering edit mode
13.7 years ago

I've no answer to that question but I know that the BAM format uses a special version of the zlib (BGZF) that divide the compressed file into some indexed chunk of data:

The advantage of BGZF over conventional gzip is that BGZF allows for seeking without having to scan through the entire file up to the position being sought.

See this article

ADD COMMENT
0
Entering edit mode

Doesn't look like BAM will be able to index unaligned reads, quote from the paper Pierre linked to:

A position-sorted BAM file can be indexed.

You could possibly try putting your reads into a database (eg I use sqlite3 for some projects) and index the table by the read ID and sequence

ADD REPLY
2
Entering edit mode
13.7 years ago
Casbon ★ 3.3k

Doesn't screed do something like this?

Docs

Shameless self-promoting link to my influence on this project

ADD COMMENT
0
Entering edit mode

This looks interesting but I don't see compression mentioned by quickly reading the document. Ideally, I would like to have both compression and fast access by seqID (and also sequence if possible).

ADD REPLY
2
Entering edit mode
13.7 years ago
Michael 54k

Sorry if this is more an answer to: "What will, probably, be a good solution in the future". But as I saw, this problem is not really solved yet I put this in anyway.

The BioHDF project is possibly on the way to accomplish this, however it is still pretty much a prototype. It relies on the HDF5 library which natively features compression.

This README shows how a projected work-flow could look.

From their site:

New Features (April 2010)

  • NCList-based indexing for fast and accurate queries (reference)
  • SAM/BAM import and export
  • Better large dataset support (with more later in the month)

I am not sure if they are really already supporting indexing of FastQ/X sequences, but I guess in their ISMB 2009 presentation they said they would. The project has potential but is progressing rather slowly.

ADD COMMENT
1
Entering edit mode
12.4 years ago
Peter 6.0k

How about using BGZF (Blocked GNU Zip Format), the same mechanism as BAM files? This is a variant of GZIP but using 64KB blocks of data. See this blog post.

In principle you could also use BZIP2, which is also block based compression, but its better compression comes at the cost of higher CPU both for compression and decompression (see here).

With either of the above, you'd need a separate index file mapping the keys (e.g. record ID) to the block double offset, as explained in the above blog posts, which links to a proof of principle implementation of this on one of my Biopython branches of github, building on Biopython's existing sequencing indexing infrastructure (e.g. in memory mapping, or using SQLite3).

ADD COMMENT
0
Entering edit mode

I see now Pierre had also suggest BGZF

ADD REPLY
0
Entering edit mode

Looking back over the other answers, Keith James also suggested BGZF.

ADD REPLY

Login before adding your answer.

Traffic: 2385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6