Cases where indexing is efficient for read retrieval
4
1
Entering edit mode
8.8 years ago

There's a nice discussion about how to efficiently retrieve a selection of reads from a fastq file here: How To Efficiently Parse A Huge Fastq File?

The efficient solutions proposed in that discussion seem to be based on a read-by-read parsing of the fastq file and testing whether the read belongs to the set of selected reads.

I tried to implement an index-based solution using biopython (see SeqIO.index in http://biopython.org/wiki/SeqIO#Sequence_Input) as follows:

import sys
from Bio import SeqIO
strip = str.rstrip

in_fastq = sys.argv[1]

record_dict = SeqIO.index(sys.argv[1], "fastq")

with open(sys.argv[2], "r") as in_names:
    for line in in_names:
        SeqIO.write(record_dict[strip(line)], sys.stdout, "fastq")

On the example I tested (extracting a list of about 300.000 reads from a fastq file containing about 3.000.000 reads), this was way much slower than the other type of solution.

Are there cases where an index-based solution might be appropriate?

Maybe when the list of reads is not known beforehand, and reads have to be extracted on-demand, a persistent index would be useful. What would be a good way to make the index persistent? Creating a read database?

fastq reads • 3.0k views
ADD COMMENT
1
Entering edit mode

The fastest way to test the inclusion of a very large set of objects in a small set is by hashing (or doing some other expensive operation on) the small set and doing a hash-lookup (or other cheap operation) on members of the large set, not by doing an expensive operation on the large set. This is true inside and outside of bioinformatics. Indexing is not useful unless the large set needs to be accessed many times or randomly - in other words, it is never useful if the procedure can be implemented in a single pass.

IGV, a genome visualizer, is a good example of something that cannot be implemented in one pass, since it's interactive; indexing is thus very useful.

ADD REPLY
2
Entering edit mode
8.8 years ago
thackl ★ 3.0k

Generally, I think, there are two problems with indexing and retrieving NGS reads by ID (or similar). A) Short reads obviously are short and the ID already represents a considerable portion of the information. If stored together with a positional information for retrieval, your index by design is quite large compared to your read file. B) Indexed based retrieval make the more sense the smaller the number of items to be retrieved - minimizing IO and seeking operations. For most read based applications you still will need a lot of reads (>1000) to work with. The scenario changes if you consider longer reads, e.g. PacBio or Nanopore and of course assembled data (contigs, genomes etc.)

Have a look at samtools faidx for an efficient implementation of a ID based indexing and retrieval of sequencing data. It's for FASTA but I found the code to be quite educational.

ADD COMMENT
2
Entering edit mode
8.8 years ago

One of the advances that BEDOPS brought to processing BED files was the use of sorting to gain performance enhancements. You now see other toolkits start to use sorted input to get similar improvements.

You can query a sorted input with bedextract, which reduces a linear search to a binary search — you split the file into query-able halves, decide which half you're interested in, and progressively repeat until you get a final answer. No index is required, just sorted input.

The gain in performance from binary searching is considerable. Retrieving a listing of chromosomes from a multi-GB BED input could take 1200 seconds (20 minutes) by linearly walking through the elements in series. Using a binary search with bedextract reduced this same operation to 2 seconds.

Getting a list of chromosomes seems trivial, but this is an invaluable speedup that allows easy per-chromosome parallelization of analysis tasks on arbitrary input, where you don't know ahead of time what chromosomes are represented in the BED file.

Also, bedextract uses this binary search to offer bedops -e 1-like operations in itself, and in bedops and bedmap binaries via the --faster option. It's a pretty dramatic speed-up.

So if your FASTQ data can be sorted by genomic position, you could write a Perl or Python script to do very much the same thing. Or you could linearize your FASTQ data, convert it to BED intervals and use bedextract to query for a range or ranges of interest.

To go even faster, you could set up a hash table on some key, as suggested in another comment. Once you figure out what to make your key, this offers a constant-time lookup per-key — usually at the expense of storage or memory for the hash table, and potentially some storage and maintenance of a file if you are serializing this index somewhere for later re-use.

The advantage of sorted input is that binary searches are still very quick, and the index is "baked into" the data, because the sort order is effectively doing the indexing for you.

ADD COMMENT
1
Entering edit mode
8.8 years ago

Typically one would only use indexed data if random access of elements is necessary. For example you must access reads in random order and preloading into memory is not be feasible.

Run a test to time the efficiency of the best and worst case scenario: accessing many times just the first read or just the last read. That will give you a sense of how well it works.

ADD COMMENT
1
Entering edit mode
8.8 years ago
John 13k

The BAI file is, in my opinion, not an index - at least not by the strictest definition.

BAI files made via samtools:
- only "indexes" the POS column. You cannot use the 'index' to jump to, say, a given QNAME; nor can you use it to jump to the first POS of a given sequence/flag/length.
- The data must be pre-sorted on POS!

Any database engineer would be totally puzzled at this. They would say - "if you can sort the data on POS why do you need an index when you can just search the data directly?" Either a key:value store, or using a binary intersect?
And the answer would be "because bioinformatics - deal with it."

ADD COMMENT
1
Entering edit mode

this is a very interesting point I failed to consider before - indeed why does it have a separate file if the data is also sorted ... perhaps the compression makes records compress at different sizes ... I thought that's what the fixed blocksize was for, but I admit I don't understand the details.

ADD REPLY
0
Entering edit mode

I think thats exactly it Istvan - the BAM format is this really complicated format designed to get all the read data into as compressed format as possible. Personally, and this is just a guess, I think the indexing was an afterthought.

And in all honesty, I think it goes against the purpose of the BAM to try and make it some sort of random-access database anyway. Far better to just extract the data you need from the BAM, and use then use that in a random-access way.

For example, a 6Gb BAM file is only 400Mb as a structured numpy array of read start/end positions in RAM.

From there you can get the sum the signal for all genes in the genome at 1bp resolution in less than a second - not to mention store 100s of such arrays in the RAM of a moderate desktop PC. Just using the data you need is the way to go if you want to be fast - not adding more complexity to shoehorn a feature into a file format never designed for it.

ADD REPLY

Login before adding your answer.

Traffic: 2515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6