I have 50 .VCF files each corresponding to a patient sample and what I want to do is to merge all these files together, extract based on chromosome position / SNP ID for the Genotype information and then convert it into a 012 matrix in the most time efficient and effective way possible. VCF tools and BCF tools are capable of doing so but I'm trying to automate this so I'm trying to script it in Python or R possibly.
I wouldn't want duplicated SNPs over different samples (files) either, so the idea is to get a SNP array with column names as sample IDs extracted from file names and the row names as chromosome positions /SNP IDs.
What you want just sounds like a multisample VCF file without the metadata headers. Why not just call the necessary
vcftools
command from within Python or R?A "SNP array" is usually an oligonucleotide microarray for calling millions of SNPs. Probably not the same as what you have in mind, but confusing nonetheless.
All things considered, what would be the fastest way to merge GVCF files and VCF files efficiently? BCF tools is a faster alternative when compared to VCF tools but doesn't work with GVCF files.
Please use
ADD COMMENT
orADD REPLY
to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.combinegvcfs walker from gatk (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_CombineGVCFs.php) allows combining gvcfs
ps: please move this post to comment to OP or make it a new post.
GATK doesn't allow merging of VCF and gVCF files unfortunately. My aim is to obtain a single VCF file from the entire set,
did you try bcftools merge with -g option ?
Tried it, but BCF tools on merge considers the NON - REF as a literal allele call instead of ignoring it and a NON-REF contributes to the genotype call.