Do Aligners Tend To Scale With Number Of Bases Or With Number Of Reads?
2
5
Entering edit mode
12.2 years ago
Ian ▴ 50

I would like to find an efficient way of aligning large fastq files (to the human reference genome) by first splitting-up the fastq into smaller pieces so that they can be aligned in parallel. I can think of two ways of doing this: splitting the fastq up either into files with a fixed number of bases (e.g. a billion bases per file) or into files with a fixed number of reads (e.g. 10 million reads per file). I was wondering if anyone knows which approach should be more efficient in terms of run time? This question is particularly applicable when different fastq files have different read lengths.

I suppose another way of asking the way question is: Do aligners tend to scale with number of bases or with number of reads (in terms of run time)? The aligners I am most interested in are BWA, BFAST and stampy.

Many thanks,

Ian

alignment bwa fastq • 2.1k views
ADD COMMENT
7
Entering edit mode
12.2 years ago
lh3 33k

BWA roughly scales with the number of bases. Nonetheless, I do not think it matters at all with data splitting. The total CPU time is roughly fixed. The wall-clock time depends on how many CPU cores you use at once.

ADD COMMENT
1
Entering edit mode
12.2 years ago

I think that the best way to split your file, is to generate files with the same size (roughly the same as the number of residues). Genometools (very very fast) is your friend.

gt splitfasta -targetsize 50 file.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6