This might be an XY question, so I'll explain my premise:
- I have 3 VCF files,
f1
,f2
andf3
. f1
is an annotated VCF covering 50 samplesf2
is an annotated VCF covering 5 samples, but only sites that are not inf1
f3
is an un-annotated VCF covering 5 samples across sites inf1
as well as not inf1
- All annotations are site-level
I now wish to get this as one VCF files with all sites annotated and all sample-level information present.
When I merge f1
and f2
, I get a VCF with all annotated sites and all samples, but for those sites overlapping with f3
, the GT/AD/...
fields are empty, because that information is in f3
. How do I merge these three datasets?
Question:
In essence, can I do an operation to update genotype fields in one VCF file based on a sample+site match in another VCF file? If they were 2 data.frame
s, the operation would be something like vcf1[site, sample] <- vcf2[site, sample]
.
Current solution:
The way I see it, I might have to subset That solution does not work as f3
to f1
-sites only, then bcftools merge <f1> <f3_subset> ><f1_F3_subset>
- that way I do not add any site, only samples. Then I bcftools concat <f1+f3_subset> <f2> > <final_vcf>
, so this time I add only sites, no samples. Any other solution will be appreciated.bcftools concat
cannot work on VCFs with different samples in them.