I have been trying to phase 66 genomes that are all contained in chromosome specific VCF files using the software ShapeIt. I have a working pipeline (works if I use the --force command to override the error I will discuss).
I get the following error:
33mERROR:[0m 15611 SNPs with high rates of missing data (>10%). These sites should be removed.
First I tried to use Plink to remove these SNPs, but the resulting VCF had seemingly lost a lot of information. I've since deleted the script, but I could probably figure out what I did if necessary.
Second I found VCFtools could remove the SNPs too. I used the following code;
vcftools --vcf $file --max-missing 0.1 --recode --recode-INFO-all --out $OUTDIR/"$newname"
This step only removes a few hundred SNPs, and the error message from ShapeIt indicates that 15461 of the missing data SNPs are still present. Have I misinterpreted the VCFtools manual, missed a parameter, or approached the problem incorrectly?
Thank you in advance for your help. I am still learning a lot as I go, and bioinformatics is certainly not my forte.
Yes, that is correct. So, selecting 0.9 for
--max-missing
means that only variants that appear in 90% of your samples will be included. The name of this parameter does not do justice to its actual usage.Please feel free top accept your own answer (I have already up-voted it).