Biostar Beta. Not for public use.
Filter VCF on Number of genotyped individuals
Entering edit mode
3.5 years ago
mek362 • 10

Hello all,

I would like to know how to filter a VCF by number of individuals with genotype data (i.e. without missing data).

First, some background:

I have a large VCF file containing SNP data for 95 individuals called from bcftools mpileup output. BAMs passed to mpileup contained data for alignments of RNAseq reads to a de novo transcriptome. Thus, the VCF lists SNPs within each contig defined in the reference de novo transcriptome.

The 95 individuals were collected from 6 different populations. For population genetic analysis, I would like to produce a VCF that only contains SNPs at which there are data for at least 75% of individuals in each population (11 or 12).

I have a decent familiarity with bcftools, and know how to filter missing data, but I am not sure how to approach this. As expected, filtering on the N_SAMPLES variable alone doesn't seem to modify the VCF. Perhaps my best course of action would be to first split the VCF into population-specific files?

Any help with this would be greatly appreciated! Please let me know if I've failed to include any useful information.

Entering edit mode
16 months ago
Brice Sarver ♦ 2.6k
United States

VCFtools will do this:

vcftools --recode --vcf your_input.vcf --max-missing (proportion missing above which to filter: [0-1]) --out your_output.vcf

Edit: if you want to use this command per population, you're right: split first.

Entering edit mode

Thanks for the answer, Brice. This worked well -- I split the master .vcf by population, filtered using vcftools --recode, and merged filtered population .vcf's using bcftools isec. It may also be useful to note that the --max-missing option excludes sites based on proportion of missing data, where a 0 indicates no filter and a 1 only passes sites with no missing data (i.e. 0.75 filters out sites missing over 25% of data).

Entering edit mode

Yep, you're correct. I assumed people would peek at the manual for specifics, but I've gone ahead and made the clarification in my post in case others take the call verbatim. Thanks!


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1