Dear all,
I'm using VCFtools 0.1.17 to filter out loci with low coverage, this is the process:
- Use --site-mean-depth to get a "table.ldepth.mean" with the mean values per site.
- Calculate manually how many loci there are above different mean depth values, and choose a minimum depth value.
- Use --min-meanDP to filter out the loci below the chosen value.
Problem is:
The final vcf file after pruning low depth sites always has fewer loci than expected, when calculated from the table.
Example: If I check how many loci there are with MEAN_DEPTH above 4 in the table there is 5926; but if I prune loci using "--min-meanDP 4" there are only 2157 loci left in the output file. And there should be more, because it should not only include loci above 4, but also equal to 4.
Tried this with many different data, and combinations of values for --min-meanDP and --max-meanDP, and I always get way fewer loci in the output file than if I calculate their number manually from the table.
After some digging I found out, the problem are the missing values, looks like ----site-mean-depth ignores missing values, while --min-meanDP considers that they have DP=0 (which makes more sense).
If the vcf file has no missing genotypes, the numbers match.
Someone else has this problem?
Is there an alternative to calculate mean depth per locus that will consider missing as 0?
Thanks!
Thanks for the heads up on "missing", Kevin, you are right. Yes, before seeing your answer we were discussing already that it will be better to write our own script, in perl in my case. Cheers!
Óscar