Biostar Beta. Not for public use.
Question: Indexing Compressed Fastq-Like Data
6
Entering edit mode

Hi,

I would like to ask what is the best option to have FASTQ-like information indexed and compressed at the same time. I have seen cdbfasta/cdbyank can index fastq files but not if they are compressed, and I've also seen there are fastq2bam conversion tools that can create unaligned bam files, which I suppose can be indexed.

What is the best option that is a good compromise between speed/space taken? Does anybody know of any tool that will allow retrieving indexed sequences by sequence itself as well as by sequence id?

Cheers

ADD COMMENTlink 9.6 years ago 2184687-1231-83- ♦ 4.9k • updated 9.1 years ago Peter 5.8k
4
Entering edit mode

Doesn't look like BAM will be able to index unaligned reads, quote from the paper Pierre linked to:

A position-sorted BAM file can be indexed.

You could try putting your reads into a database (eg I use sqlite3 for some projects) and index the table by the read ID and sequence

ADD COMMENTlink 9.6 years ago Aaron Statham ♦ 1.1k • updated 10 months ago RamRS 21k
4
Entering edit mode

As mentioned in other answers, BAM only fulfills part of your requirement (compression and random access), but not indexing. However, you can easily roll your own index using the BGZF API and your key-value store of choice e.g. Berkeley DB.

Here's an example using my cl-sam API, but you can substitute e.g. the C API that comes with Samtools or the Java API from Picard (or a Swig wrapper). I'm just writing data to a text file, instead of BDB, but you get the idea...

(defun bamdex (bam-file index-file)
  (with-open-file (index index-file :direction :output)
    (with-bgzf (bgzf bam-file :direction :input)
      (read-bam-meta bgzf)
      (loop
         for offset = (bgzf-tell bgzf)
         for record = (read-alignment bgzf)
         while record
         do (format index "~s ~d~%" (read-name record) offset)))))

Keys and values:

"ENSDART00000000005_480_670_12d" 2269928771

The key is the read name, the number is the virtual offset into the uncompressed data (see the SAM spec). Use with a BGZF seek to reach the record:

(with-bgzf (bgzf "test.bam")
  (bgzf-seek bgzf 2269928771)
  (read-name (read-alignment bgzf)))

gives

"ENSDART00000000005_480_670_12d"

If the reads are very long, you might consider using a sequence checksum as a key instead of the actual sequence.

ADD COMMENTlink 9.6 years ago iw9oel_ad 6.0k • updated 10 months ago RamRS 21k
Entering edit mode
0

Good info. I know samtools has a "TO DO" list, but I wonder if anyone has already done an implementation of an indexed BAM file...

ADD REPLYlink 9.6 years ago
2184687-1231-83-
♦ 4.9k
• updated 10 months ago
RamRS
21k
Entering edit mode
0

Might be added to Biopython shortly, see here

ADD REPLYlink 8.3 years ago
Peter
5.8k
• updated 10 months ago
RamRS
21k
4
Entering edit mode

If you care about the actual implementation - there is an excellent post by brentp on how to index files in general. Formats like FASTA, FASTQ and SAM were used as examples.

ADD COMMENTlink 9.6 years ago Haibao Tang 3.0k • updated 10 months ago RamRS 21k
Entering edit mode
0

This doesn't work on compressed data. Also, it uses TokyoCabinet BDB which I've found doesn't scale to the number of records found in real data (>10^7). That's why I suggested Berkeley DB.

ADD REPLYlink 9.6 years ago
iw9oel_ad
6.0k
• updated 10 months ago
RamRS
21k
Entering edit mode
0

@Keith, have you found that problem with BDB? I tested with 20million records and saw no slow-down? HDB, on the other hand slows down at 10^6. and tokyo-cabinet can compress the data.

ADD REPLYlink 9.5 years ago
brentp
23k
• updated 10 months ago
RamRS
21k
Entering edit mode
0

Yes, BDB. See timings here using the C API directly. Each time is for the insertion of 10^6 records. You can see that it takes 3 sec to start, with, but takes 180 sec once there are 10^7 records. Compression helps somewhat at this point, but it's still not good. The time taken increases exponentially. What tuning parameters do you use?

ADD REPLYlink 9.5 years ago
iw9oel_ad
6.0k
• updated 10 months ago
RamRS
21k
Entering edit mode
0

Would anyone know where that blog post is hosted these days?

ADD REPLYlink 2.9 years ago
Andreas
♦ 2.4k
• updated 10 months ago
RamRS
21k
Entering edit mode
0

web archive link

Look for fileindex on April 10, 2010

ADD REPLYlink 2.7 years ago
samesense
• 40
• updated 10 months ago
RamRS
21k
3
Entering edit mode

I've no answer to that question but I know that the BAM format uses a special version of the zlib (BGZF) that divide the compressed file into some indexed chunk of data:

The advantage of BGZF over conventional gzip is that BGZF allows for seeking without having to scan through the entire file up to the position being sought.

See this article

ADD COMMENTlink 9.6 years ago Pierre Lindenbaum 120k • updated 10 months ago RamRS 21k
Entering edit mode
0

Doesn't look like BAM will be able to index unaligned reads, quote from the paper Pierre linked to:

A position-sorted BAM file can be indexed.

You could possibly try putting your reads into a database (eg I use sqlite3 for some projects) and index the table by the read ID and sequence

ADD REPLYlink 9.6 years ago
Aaron Statham
♦ 1.1k
• updated 10 months ago
RamRS
21k
2
Entering edit mode

Doesn't screed do something like this?

Docs

Shameless self-promoting link to my influence on this project

ADD COMMENTlink 9.6 years ago Casbon ♦ 3.2k • updated 10 months ago RamRS 21k
Entering edit mode
0

This looks interesting but I don't see compression mentioned by quickly reading the document. Ideally, I would like to have both compression and fast access by seqID (and also sequence if possible).

ADD REPLYlink 9.6 years ago
2184687-1231-83-
♦ 4.9k
2
Entering edit mode

Sorry if this is more an answer to: "What will, probably, be a good solution in the future". But as I saw, this problem is not really solved yet I put this in anyway.

The BioHDF project is possibly on the way to accomplish this, however it is still pretty much a prototype. It relies on the HDF5 library which natively features compression.

This README shows how a projected work-flow could look.

From their site:

New Features (April 2010)

  • NCList-based indexing for fast and accurate queries (reference)
  • SAM/BAM import and export
  • Better large dataset support (with more later in the month)

I am not sure if they are really already supporting indexing of FastQ/X sequences, but I guess in their ISMB 2009 presentation they said they would. The project has potential but is progressing rather slowly.

ADD COMMENTlink 9.5 years ago Michael Dondrup 46k • updated 10 months ago RamRS 21k
1
Entering edit mode

How about using BGZF (Blocked GNU Zip Format), the same mechanism as BAM files? This is a variant of GZIP but using 64KB blocks of data. See this blog post.

In principle you could also use BZIP2, which is also block based compression, but its better compression comes at the cost of higher CPU both for compression and decompression (see here).

With either of the above, you'd need a separate index file mapping the keys (e.g. record ID) to the block double offset, as explained in the above blog posts, which links to a proof of principle implementation of this on one of my Biopython branches of github, building on Biopython's existing sequencing indexing infrastructure (e.g. in memory mapping, or using SQLite3).

ADD COMMENTlink 8.2 years ago Peter 5.8k • updated 10 months ago RamRS 21k
Entering edit mode
0

I see now Pierre had also suggest BGZF

ADD REPLYlink 8.2 years ago
Peter
5.8k
• updated 10 months ago
RamRS
21k
Entering edit mode
0

Looking back over the other answers, Keith James also suggested BGZF.

ADD REPLYlink 8.2 years ago
Peter
5.8k
• updated 10 months ago
RamRS
21k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0