Greetings!
I'm currently working on pre-processing some whole genome and exome sequencing data and could use some tips for my current pipeline. I use GATK for most of the steps and have looked at their recommended best practice but I'm still not sure about some things.
The order I'm doing stuff is now:
[Picard]: Merge BAM files if the sample has been run on several lanes.
[GATK]: Re-align around indels using RealignerTargetCreator followed by IndelRealigner. These walkers are supplied with VCFs containing known indels, provided in the GATK bundle.
[GATK]: Recalibrate base quality scores using BaseRecalibrator followed by PrintReads with the -BQSR option. BaseRecalibrator is supplied with a VCF containing known sites, provided in the GATK bundle.
In step 2 i use the following filters:
- MappingQualityFilter -mmq 40 (Require mapping quality 40 or higher)
- DuplicateReadFilter
- FailsVendorQualityCheckFilter
- UnmappedRead
- MappingqualityUnavailableFilter
- MappingQualityZero
- BadMateFilter
What I'm unsure of is if I should do that much filtering in step 2 since the recalibrated base qualities affect the mapping quality (right?). I'm thinking it might be better to skip the filters altogether in step 2 and instead filter in step 3 after the base recalibration is done. That way, the filtering is done on more accurate scores and the end result should be more reliable.
What do you guys think? Any tips are greatly appreciated!
Indeed, the BaseRecalibrator adjusts base scores, but wouldn't that also affect the mapping quality of the reads since it's dependent on the base score? This is assuming mapping quality is calculated as described here: http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores
Anyway, I'll try the filters in different ways as you suggested. There's probably more than one way to do this. Thank you so much for your answer!
Yes, you are right, the recalibration of the base quality also affects the mapping quality (but I'm not sure in what dimensions...). Anyway, the mapping quality should not affect the Indel realigner, so the results should be the same, independently of the order in which you perform these corrections...
And I'm pretty sure there is more than one way to do it, that's the challenge! ;-)