I am working with a multisample VCF file for a non-model organism that contains only biallelic single nucleotide variants (SNVs). The samples belong to different breeds of the organism as well. I wish to find SNPs (single nucleotide polymorphisms), which is a population-based term (the SNV should be present in 1 % or 5 % of the population or all my samples in my VCF file). If I want to find out SNPs from my multisample SNV file, is removing variants having MAF <= 5 % or 1 % the correct way to get it?
My second question is that if I want to find breed-specific SNPs and variants shared by more than one breed, what is the way to get it? This is what I think can be a way, let me know if I am in the right track:
1) Separate the base multisample VCF file into multiple smaller VCF files where each VCF file contains samples specific to a breed
2) Set a genotype quality threshold and put genotypes to missing if GQ is less than the threshold
3) Remove monomorphic loci (where all genotypes are same, or only 1 allele is present)
4) Set MAF threshold to remove variants where MAF <= MAF threshold to get the SNPs
5) Use Venny or any other tool to compare variant IDs that are specific to a breed or shared amongst the breeds.