Question

Do Aligners Tend To Scale With Number Of Bases Or With Number Of Reads?

5

Entering edit mode

12.2 years ago

Ian ▴ 50

I would like to find an efficient way of aligning large fastq files (to the human reference genome) by first splitting-up the fastq into smaller pieces so that they can be aligned in parallel. I can think of two ways of doing this: splitting the fastq up either into files with a fixed number of bases (e.g. a billion bases per file) or into files with a fixed number of reads (e.g. 10 million reads per file). I was wondering if anyone knows which approach should be more efficient in terms of run time? This question is particularly applicable when different fastq files have different read lengths.

I suppose another way of asking the way question is: Do aligners tend to scale with number of bases or with number of reads (in terms of run time)? The aligners I am most interested in are BWA, BFAST and stampy.

Many thanks,

Ian

alignment bwa fastq • 2.1k views

ADD COMMENT • link updated 12.2 years ago by Manu Prestat 4.1k • written 12.2 years ago by Ian ▴ 50

score 7 · Answer 1 · 2012-02-10

7

Entering edit mode

12.2 years ago

lh3 33k

BWA roughly scales with the number of bases. Nonetheless, I do not think it matters at all with data splitting. The total CPU time is roughly fixed. The wall-clock time depends on how many CPU cores you use at once.

ADD COMMENT • link 12.2 years ago by lh3 33k

score 1 · Answer 2 · 2012-02-10

1

Entering edit mode

12.2 years ago

Manu Prestat 4.1k

I think that the best way to split your file, is to generate files with the same size (roughly the same as the number of residues). Genometools (very very fast) is your friend.

gt splitfasta -targetsize 50 file.fasta

ADD COMMENT • link 12.2 years ago by Manu Prestat 4.1k