I have a vcf file which contains information on allele depth (i.e. number of reads which map to the reference and alternate allele):
1 752566 . G A 68 . GG=0,69,68,69,849,556,849,490,556,849;DP=274;AC=1;AN=2 GT:AD:DP:GQ:PL:GG 0/1:1,2:3:19:41,0,19:19,25,0,25,95,50,95,41,50,95
I was wondering whether there is a way (using bcftools for example, rather than some home made script) to reduce the coverage of the vcf to a certain coverage, by removing ref and alt reads? I.e., take a file which has a mean coverage of 40x and reducing the mean coverage to 3x. Obviously I want the PL scores in the INFO field to be adjusted accordingly (hence why I'd rather something like bcftools does it, rather than a home made script).
EDIT: just to say, I don't have access to the original SAM/BAM files, so the action has to be done on the vcf.
Thanks.
Just a thought: use Picard DownsampleSam to select out 5%, 10%, 20% random reads at the BAM stage, and then re-call variants.
HI, unfortunately I don't have access to the original bam files, otherwise this would be a good idea! thanks.
When you go from BAM to VCF, you lose a lot of information. I am not sure how you can simply downsample from the VCF stage - you have no core information on the reads.
Why do you want to do this?
The idea is to take a high coverage individual, downsample it so that there are some SNPs which aren't covered by any reads, impute these missing markers and then compare the imputed calls to the original full sampled calls, in order to test the accuracy of imputation.
I am not sure that you can do this with just the VCF...
You may try a different approach, like this:
./.