Best Practices For Genome Indexing With Bfast
2
6
Entering edit mode
13.8 years ago

The [bfast] short read aligner has been receiving a number of positive reviews.

We're giving it an initial try and are currently stuck on indexing the genomes. For mm9, we are using the 10 masks suggested in the documentation and running each with:

bfast index -d 1 -n 4 -f mm9.fa -A 0 -w 14 -m <mask> -i <num>

The full Python fabric script with all the commands is here.

Each mask takes about 5 hours to process and uses up 12Gb of space, so we'll end up with 120Gb and 50 hours of process time to generate an index. Ideally, we'd be indexing 10+ different genomes we use and sharing this on Amazon, but that's a Tb of space and 3 weeks of process time, and double that if we make colorspace indexes available as well.

Is there a best practice everyone is using for improving the bfast space/time constraints? Can I get reasonable results with a smaller subset of masks? Am I missing parameters to improve processing time and compression? Any other tips from experienced bfast users?

alignment short aligner • 4.2k views
ADD COMMENT
2
Entering edit mode
13.8 years ago

The 5 hour run time per index sounds about right.

It seems that the bfast method is strongly optimized towards speed at the cost of a one-time indexing. It is appropriate for the use cases where there is only one genome and lots of data would be mapped against it.

Your use case seems to be one that does not fit the strengths of bfast, thus perhaps a different method would be better suited.

For example SHRiMP is a tool that does not need to create an index.It works equally well for color-space and letter space data, albeit it is a lot slower than bfast. But if you have access to the cloud you have access to lots of CPUs thus you could split your problem into hundreds of pieces, and for that SHRiMP might work out very well.

ADD COMMENT
0
Entering edit mode

Thanks Istvan for the confirmation that this is the expected behavior. You're right that it's probably not the right application for bfast. I'll give SHRiMP a try.

ADD REPLY
2
Entering edit mode
13.4 years ago
brentp 24k

BFast is very flexible, so you can tailor the indexes to your use-case. If the read-lengths are long, try specifying your own, longer seeds. This will take less time and space than more shorter seeds.

There is a utility in the butil/ directory of the source tree called btestindexes that will find the "optimal" set of indexes to use given specified constraints for accuracy, mismatches, and key size/width.

That said, I agree with @Istvan that it's always good to consider other options such as bwa, bowtie (if you don't care about indels), or gsnap.

Actually, if your 10+ genomes are just different mm9 individuals and can be represented as a snp table, then they can be saved in a single gsnap index (see the section titled "SNP-tolerant alignment in GSNAP" in the readme).

ADD COMMENT

Login before adding your answer.

Traffic: 2455 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6