I have some whole exome sequencing data from cell-free DNA (at 500 or 1000X coverage) and have called mutations using the GATK pipeline with Mutect2. However, the results have mutations listed which are not in the exons, alongside all the exon mutations we expect.
Examining the original BAM file from mapping the reads, we have lots of off-target reads, from introns and from intergenic regions. The depth of coverage in some off target reads is of comparable size to coverage at some exons, so a hard depth filter will remove all the true positives in these places as well as all the false positives from introns or intergenic regions.
I'm not entirely au fait with Mutect2 so I don't know if there is a normalisation step in the algorithm, so its result might change if the off-target reads are removed before calling the variants (the effective library size would change, but there would still be >200X coverage at exons). This would suggest filtering the final generated VCF, but I'm not sure on best practice in this case. As mentioned before, a hard filter on depth will remove some of the mutations we want to detect, as they are likely to be at low frequency.
The question is: at which stage in the pipeline should the off-target reads be removed? Before calling the variants or once the VCF is produced?
Moved to answer. I agree, you should generally leave everything that has aligned until the end, and then only filter on the final output that you obtain (e.g. the VCF output).