Question

What are the cut-offs during read quality filtering in Bowtie/TopHat before mapping?

3

Entering edit mode

9.1 years ago

glados ▴ 40

Before mapping RNAseq reads, TopHat always perform a quality filtering step on the reads when preparing them. I would like to know what the cut-off is for discarding a read? It's difficult to find information about this since most posts about quality and tophat/bowtie relates to mapping quality (naturally) and not read quality.

Cannot find much in the TopHat manual but in the Bowtie manual it says:

Some reads are skipped or "filtered out" by Bowtie 2. For example, reads may be filtered out because they are extremely short or have a high proportion of ambiguous nucleotides. Bowtie 2 will still print a SAM record for such a read, but no alignment will be reported and and the YF:i SAM optional field will be set to indicate the reason the read was filtered.

YF:Z:LN: the read was filtered becuase it had length less than or equal to the number of seed mismatches set with the -N option.

YF:Z:NS: the read was filtered because it contains a number of ambiguous characters (usually N or .) greater than the ceiling specified with --n-ceil.

YF:Z:SC: the read was filtered because the read length and the match bonus (set with --ma) are such that the read can't possibly earn an alignment score greater than or equal to the threshold set with --score-min

YF:Z:QC: the read was filtered because it was marked as failing quality control and the user specified the --qc-filter option. This only happens when the input is in Illumina's QSEQ format (i.e. when --qseq is specified) and the last (11th) field of the read's QSEQ record contains 1.

My read length is definitely not shorter than the seed mismatches so this first option can be ruled out.

Regarding the second option about ambiguous characters, this is what --n-ceil says:

Sets a function governing the maximum number of ambiguous characters (usually Ns and/or .s) allowed in a read as a function of read length. For instance, specifying -L,0,0.15 sets the N-ceiling function f to f(x) = 0 + 0.15 * x, where x is the read length. See also: [setting function options]. Reads exceeding this ceiling are [filtered out]. Default: L,0,0.15.

So default read length * 0.15. Pretty straightforward. No questions here.

The third option regard --ma I would assume does not apply to the default settings since the default setting is to use end-to-end alignment? --ma always equals 0 in this default mode.

The fourth option applies to users specifying the --qseq options so for someone like me who uses fastq, I would assume it's not relevant.

Does that mean with default settings, bowtie/tophat only takes into consideration ambiguous characters? What about fastq base quality/read quality, is this not taken into consideration when filtering?

Appreciate any help and thoughts.

TopHat filtering quality RNA-Seq Bowtie • 5.2k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by glados ▴ 40

score 1 · Answer 1 · 2016-01-15

I am wondering the same thing about Bowtie2. I think it must include the Phred scores in the calculation of the Alignment Score and the corresponding MAPQ, considering that you can specify either --phred33 or --phred64. I haven't found a whole lot of resources explaining this, but this is what I've found so far:

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#scores-higher-more-similar

http://biofinysics.blogspot.com/2014/05/how-does-bowtie2-assign-mapq-scores.html

http://www.acgt.me/blog/2014/12/16/understanding-mapq-scores-in-sam-files-does-37-42

It was in fact my search for this information that brought me across your Biostars thread! Maybe this post will boost visibility and someone who really knows Bowtie can edify us.

Edit: Actually, upon closer inspection of the second link above, the biofinysics blog post by John Urban, there is a great bunch of information about how Bowtie2 considers base Phred quality. See Experiment 2 in the "Experiments, Results, and Conclusions" section.

It seems like Bowtie2 is more strongly influenced by multiple alignments than anything else. When a read maps to multiple places, the MAPQ never goes above 1. But, we should also notice that the next best alignment (XS flag) is the same as the "best" alignment score (AS flag) for each of the alignments in the experiment. In the Bowtie2 manual it states that the gap between the alignment score of the "best" mapping position and the alignment scores of any secondary alignments will significantly influence the MAPQ. So maybe those MAPQs would be greater than 1 if the second alignment score was substantially lower than the primary alignment score.

The base quality is taken into account when assessing the mismatch penalties. Higher quality base calls result in stronger penalties for mismatches, whereas lower quality bases result in weaker penalties for mismatches. In a way that makes sense, because you don't want a strong mismatch penalty if the mismatch is due to a sequencing error. But it is also kind of ironic, because if you have a read with lots of poor quality bases, it won't really be penalized if it aligns to only one place, even if there are a ton of mismatches. It could be a totally spurious alignment because it's a junk sequence, but it would still have a high MAPQ.

Maybe Bowtie2 could implement a quality filter similar to the one that exists in ABySS (q: minimum base quality). For example, if you set q=20, then all bases lower than Q20 are trimmed from the end of the read.