Question

bwa mem: which parameters to tune to improve alignment?

0

Entering edit mode

7.3 years ago

Lina F ▴ 200

Hi all,

I have a handful of genes of interest that I am trying to identify in a plant transcriptome dataset.

I used bwa mem to map the paired-end transcriptome reads against a reference dataset generated from the genes of interest.

The gene sequences I used as leads are not the same organism as the transcriptome, so I am expecting mismatches. I got pretty good coverage for most of the gene sequences, but the reads seem to cluster in chunks, and there is a lot of soft clipping going on in between the chunks.

enter image description here

I am wondering how to start playing with bwa mem parameters to try and retrieve more of the pieces in between the chunks.

Which parameters are good to modify? Matching score, mismatch penalty, or gap penalties?

I appreciate any suggestions!

bwa parameters tuning • 6.0k views

ADD COMMENT • link 7.3 years ago by Lina F ▴ 200

1

Entering edit mode

Are you mapping the reads to the genome? Because in that case you definitely need a splice-aware aligner such as STAR or HISAT2.

As a rule of thumb: if you don't know what you are doing, stick to the default parameters. Definitely for a tool such as bwa mem the defaults are sensible and most optimal under common conditions. If there is a reason to map to the transcriptome you could also have a look at the "newer" generation alignment-free quantification tools such as kallisto and salmon.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

1

Entering edit mode

Thanks for the suggestion! I mapped with STAR, and it got rid of the sharp cutoffs and peaks which were probably splice sites:

enter image description here

ADD REPLY • link 7.3 years ago by Lina F ▴ 200

0

Entering edit mode

@WouterDeCoster, my understanding is that the pseudoalignment methods like Kallisto do not work well when the mismatch rate is high (such as this example, where the OP is using reference from a related species). Am I incorrect?

ADD REPLY • link 7.3 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Right, I was probably too quick. It will definitely depend on how different/how many mismatches. Although, from the kallisto website:

In fact, because the pseudoalignment procedure is robust to errors in the reads, in many benchmarks kallisto significantly outperforms existing tools.

And for salmon:

Quasi-mapping is typically much faster to compute than traditional (or full) alignments, and can sometimes provide superior accuracy by being more robust to errors in the read or genomic variation from the reference sequence. More details about quasi-mappings, and how they are computed, can be found

That said, I'm not aware of a study comparing pseudo-alignment to a different species, but can definitely understand it wouldn't be optimal. Nevertheless, even with mismatches you would get matching kmers and therefore correct assignment. For sure your transcriptome database needs to be complete! But since I'm one of those ignorant bastards who only sequences human samples I'm not really familiar with the struggles of 'plant people' as we would call them...

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

1

Entering edit mode

Thanks for the feedback. I was hoping you had a reference for comparative benchmarking, but it looks like I'll have to do the analysis myself.

ADD REPLY • link 7.3 years ago by harold.smith.tarheel ★ 4.9k

1

Entering edit mode

You could try adjusting the mismatch penalty (-B) and clipping penalty (-L). That said, in this case, I'd suggest trying BBMap, which is a global aligner and does not do any clipping by default and allows low-identity alignments. With the -slow flag, it will allow even more mismatches. Also, if you are mapping to genomic sequence (containing introns) then you need a splice-aware aligner such as BBMap rather than bwa; in that case, the gaps would be explained by introns.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k