Is there a correct order in which to filter BAM files?
2
0
Entering edit mode
9.0 years ago
James Ashmore ★ 3.4k

Say I have a BAM file which I want to filter, I want to apply the following processing steps: Remove reads aligned to blacklist regions, remove mitochondrial mapping reads, remove unmapped reads, remove pcr duplicates, remove reads with MAPQ score lower than 30.

Is there a 'correct' order to do these in, or does it not matter? I think problems could arise where certain filtering steps create orphans, resulting in incorrect SAM flags, which later software require to be accurate.

BAM Filtering • 3.5k views
ADD COMMENT
0
Entering edit mode

Most of the variant callers don't consider reads that are flagged as pcr duplicates, secondary alignments, unmapped reads (no brainer). So you need not to worry about their presence in the BAM file provided that these reads have been correctly flagged. MAPQ > 30 can be chosen if you really want to be stringent but as Brian suggested that it may incur some bias. There will be lots of reads with MAPQ < 30 but still align uniquely to the reference genome. So I would not use MAPQ to filter reads as variant caller will be smart enough to only use uniquely aligned reads. But these decisions are subjective and differ between people.

ADD REPLY
2
Entering edit mode
9.0 years ago

I think removing PCR duplicates first would be good as this step takes paired-end information into account. Then perform rest of the steps in any order ( can be a single command with pipes).

ADD COMMENT
0
Entering edit mode
9.0 years ago

Where did the fad of removing reads with mapq below 30 come from? If you are doing variant-calling, that can incur a serious reference bias.

ADD COMMENT
0
Entering edit mode

This is interesting. Do you mean it's too stringent ? Or mapq based filtering is not at all a good idea ?

ADD REPLY
0
Entering edit mode

mapq-based filtering is useful when you set a threshold of "above 3", meaning reads that the mapper think are uniquely/unambiguously aligned. But the more a read differs from the reference, the lower the mapq will be. So filtering with a high threshold will exclude reads that indicate variants. It's much better to use all the reads (that were aligned unambiguously) the generate a consensus and figure out things from there, rather than throwing away all the data that did not align near-perfectly on a per-read basis.

mapq is useful for a variant-caller to weight the values of different reads covering a given location, but I don't suggest using it for generic filtering in most cases (other than to screen ambiguously-mapped reads).

ADD REPLY
0
Entering edit mode

I'm not doing any variant calling, and the MAPQ value helped remove regions with extremely large coverage in my dataset which I did not want for my specific analysis.

ADD REPLY
0
Entering edit mode

Well IMHO just removing the extra reads genome wide to make the bam file smaller doesn't sound a systematic approach. You should some other approach if you want to do a focused analysis on certain genomic regions.

ADD REPLY

Login before adding your answer.

Traffic: 2369 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6