Hello!
I have made a raw (unfiltered) variant call set following GATK best practices (VCF file with ~16 Million SNPs produced by GenotypeGVCFs). The original WGS data corresponds to 60 samples sequenced at a average coverage of 20x.
We want to identify a small subset of really good SNPs and another subset of really bad SNPs, which we could use for validation.
How can I construct a filter that keeps SNPs most likely to be true and false positives, respectively?
A first choice would be to rank by QUAL and pick the SNPs at the top and the bottom of the list, but I am sure there is a more sofisticated way to do this.
Also, since the VCF contains multiple samples, would it be better to filter by site or by genotype?
Thanks and I appreciate your feedback!
Thanks Istvan
OK, I will call variants with SAM/BCFtools on the same BAMs as well. Then I can subset both raw call sets by depth and allele frequency. Then consider common intersecting SNPs as the good ones.
To filter by depth, I guess that I could only take the SNPs where all samples have a depth >= 30x as per this white paper.
To filter by allele frequency (provided that SNPs have the required depth), I was thinking to keep SNPs where all homozygous samples have an allele frequency of 1 or all heterozygous samples have an allele frequency of 0.5, as it has been stated in this review.
Did I get this right?