Forum:Feedback on Read Alignment Programs
1
0
Entering edit mode
5.2 years ago

Hey everyone! I'm a microbiology undergrad in the middle of completing a work term with a bioinformatics team. My project for the semester involves researching different read alignment / mapping programs for the development of a future training course. One point that's been emphasized is that I should look into the effects of using the default parameters vs custom settings.

So far, I've been looking into Bowtie2, BWA and BWA-MEM, SMALT, segemehl, and BBMap. Reading the manuals and literature, trying to get a few test datasets, figuring out the different options for each program. I've also just added Minimap2 to my list, and will be reading up on that one over the coming days.

If you perform read alignments as part of your research, or have any other unique expertise on the subject, I'd be grateful for whatever insight you can offer! Here's an idea of the type of questions I'd like feedback on:

  • What sequencing technology do you use to obtain your reads? What is your preferred tool for performing read alignments, and why? What advantage does it have over other available programs?
  • Are there any unique challenges presented by the organisms you study or the sequencing technology you use? What adjustments do you make to your workflow to combat this?
  • What factors do you place the most weight on when performing read alignments? Speed, adjustable parameters, low memory requirements?
  • Do you use default settings, or custom options? If you use custom settings, what options do you change? Why is this necessary? How does it affect the outcome of your alignment?
  • Any other info you think is valuable and would like to share with me!

Thanks in advance!

alignment • 1.1k views
ADD COMMENT
1
Entering edit mode

This really boils down to personal preference. Most aligners (as long as you use them for intended purposes, e.g. you can't use bowtie v.1 with spliced reads) should produce more or less similar answers for 90+% of data. They will likely perform differently (badly or not at all) on edge cases but then those may or may not be critical to overall success of your analysis. If you are doing something non-standard to begin with then you need to research your options for that specific application.

research many aligners, identify one you like, become intimately familiar with its options and stick with it, is the best take home I offer.

What sequencing technology do you use to obtain your reads?

For bulk (read volume) sequencing there is no alternative for Illumina. For long reads both Oxford Nanopore and PacBio will do. It depends on how much money you have to spend and what kind of error you are willing to tolerate.

ADD REPLY
0
Entering edit mode

The "accuracy" is often a just a red-herring, there are usually far more fundamental differences between aligners than what percent of reads they appear to be able to align. For example, the default behavior for bwa mem can be achieved with bowtie if you run it with --very-sensitive-local parameter. At the same time bowtie2 gives users far many more options to filter their results (which alignments make it into the BAM file).

In addition, different aligner may fill in different optional SAM tags - that may be necessary for downstream analysis. Alas, the differences between aligners are not well documented thus is more of a gray area for most.

ADD REPLY
1
Entering edit mode
5.2 years ago
ATpoint 81k
What sequencing technology do you use to obtain your reads? What is your preferred tool for performing read alignments, and why? What advantage does it have over other available programs.

=> Illumina. BWA mem for everything > 70bp, bowtie for everything < 70bp and Hisat2 for RNA-seq unless it is about pure quantification, then it is salmon (but that is not an aligner). Reasons: well-maintained and tested tools. BWA is the go-to for most downstream tools when it comes to variant calling and structural variation calling, and I do not feel the need to have scripts for more than one aligner for a given read length.

Are there any unique challenges presented by the organisms you study or the sequencing technology you use? What adjustments do you make to your workflow to combat this?

=> No. Mouse/human only, default settings always perform well.

What factors do you place the most weight on when performing read alignments? Speed, adjustable parameters, low memory requirements?

=> Neither of these. Tools (in general) must be well-accepted in the literature/community, well-tested and maintained and downstream tools must be able to deal with their output. Speed can typically be increased by smart usage of Unix-pipes (avoiding unnecessary intermediate files, indexing/flagstatting on the fly, see example on the bottom) and GNU parallel (given you have enough CPUs for parallelization and enough memory). All common aligners support multithreading anyway.

Do you use default settings, or custom options? If you use custom settings, what options do you change? Why is this necessary? How does it affect the outcome of your alignment?

=> Custom for standard tasks. Might set mismatch and gap open penalties very high if exact matches are desired.

Any other info you think is valuable and would like to share with me!

=> There is really extensive material, papers, forum discussions, blogs on alignment tool comparison, pro/cons/personal preference/benchmarkings. Use google and the search function. You'll find plenty of stuff.

Example for Unix-pipe for trimming/alignment/duplicate_marking/sorting/indexing/flagstatting, say we start with a pair of fastq files:

seqtk mergepe in_1.fastq.gz in_2.fastq.gz | \
  cutadapt (options...) --interleaved - | \
  bwa mem -p $IDX /dev/stdin | \
  samblaster --ignoreUnmated | \
  samtools sort -O BAM - | \
  tee >(samtools flagstat - > out.flagstat) | \
  tee out_sorted.bam | \
  samtools index - out_sorted.bam.bai
ADD COMMENT

Login before adding your answer.

Traffic: 2735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6