Hello all,
I would like to know how to filter a VCF by number of individuals with genotype data (i.e. without missing data).
First, some background:
I have a large VCF file containing SNP data for 95 individuals called from bcftools mpileup output. BAMs passed to mpileup contained data for alignments of RNAseq reads to a de novo transcriptome. Thus, the VCF lists SNPs within each contig defined in the reference de novo transcriptome.
The 95 individuals were collected from 6 different populations. For population genetic analysis, I would like to produce a VCF that only contains SNPs at which there are data for at least 75% of individuals in each population (11 or 12).
I have a decent familiarity with bcftools, and know how to filter missing data, but I am not sure how to approach this. As expected, filtering on the N_SAMPLES variable alone doesn't seem to modify the VCF. Perhaps my best course of action would be to first split the VCF into population-specific files?
Any help with this would be greatly appreciated! Please let me know if I've failed to include any useful information.
Thanks for the answer, Brice. This worked well -- I split the master .vcf by population, filtered using vcftools --recode, and merged filtered population .vcf's using bcftools isec. It may also be useful to note that the --max-missing option excludes sites based on proportion of missing data, where a 0 indicates no filter and a 1 only passes sites with no missing data (i.e. 0.75 filters out sites missing over 25% of data).
Yep, you're correct. I assumed people would peek at the manual for specifics, but I've gone ahead and made the clarification in my post in case others take the call verbatim. Thanks!