Filter VCF on Number of genotyped individuals
1
1
Entering edit mode
7.7 years ago
mek362 ▴ 10

Hello all,

I would like to know how to filter a VCF by number of individuals with genotype data (i.e. without missing data).

First, some background:

I have a large VCF file containing SNP data for 95 individuals called from bcftools mpileup output. BAMs passed to mpileup contained data for alignments of RNAseq reads to a de novo transcriptome. Thus, the VCF lists SNPs within each contig defined in the reference de novo transcriptome.

The 95 individuals were collected from 6 different populations. For population genetic analysis, I would like to produce a VCF that only contains SNPs at which there are data for at least 75% of individuals in each population (11 or 12).

I have a decent familiarity with bcftools, and know how to filter missing data, but I am not sure how to approach this. As expected, filtering on the N_SAMPLES variable alone doesn't seem to modify the VCF. Perhaps my best course of action would be to first split the VCF into population-specific files?

Any help with this would be greatly appreciated! Please let me know if I've failed to include any useful information.

SNP VCF Filtering Transcriptome • 4.8k views
ADD COMMENT
5
Entering edit mode
7.7 years ago
Brice Sarver ★ 3.8k

VCFtools will do this:

vcftools --recode --vcf your_input.vcf --max-missing (proportion missing above which to filter: [0-1]) --out your_output.vcf

Edit: if you want to use this command per population, you're right: split first.

ADD COMMENT
0
Entering edit mode

Thanks for the answer, Brice. This worked well -- I split the master .vcf by population, filtered using vcftools --recode, and merged filtered population .vcf's using bcftools isec. It may also be useful to note that the --max-missing option excludes sites based on proportion of missing data, where a 0 indicates no filter and a 1 only passes sites with no missing data (i.e. 0.75 filters out sites missing over 25% of data).

ADD REPLY
0
Entering edit mode

Yep, you're correct. I assumed people would peek at the manual for specifics, but I've gone ahead and made the clarification in my post in case others take the call verbatim. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2699 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6