Dear all,
I have multiple vcf file and want to extract allele frequency and the hetero- and homozygous call with their count. Could you please let me know if I should merge all vcf file to get one vcf file, and then extract the required information? If yes, I read CombineVariants from GATK do this job, but never use it, please kindly tell me is it OK for my task in your opinion?
For extracting the required information (allele frequency and the hetero- and homozygous call with their count), I found that convert2annovar.pl from annovar package help to get this information. However, when I tested on the example vcf file containing 3 for getting zygosity information by the below command:
perl convert2annovar.pl ex2.vcf --format vcf4 --allsample --withzyg --outfile file1.vcf
It gave me 3 output files corresponded to 3 samples with hetero and homo information, could you please kindly tell me how I can have all information in one file? Assuming multi-sample vcf file containing 100 samples, so 100 output files will be generated, how we could handle them?
Also, I tried vcftools to obtain the required information, but sounds that it give us just frequency, is any experience with vcftools for getting zygosity information? Any alternative tools and commands suggestion would be highly appreciated.
Thanks
Hi Kevin,
Thank you so much for your helpful response. First, I normalized vcf file by bcftools norm –m-any and tried your command on the very small vcf file containing just 3 samples and found a strange thing. One of rows in my vcf file is:
Your suggested command calculated the hetero and homo count as: H
As it is obvious, there is 2 AltHet and 1 AltHom, but based on the above output of the command, 1 genotype was considered as RefHom, which is not correct. Could you please kindly tell me how we can interpret this issue?
For obtaining AF and MAF from vcf file, I tried several ways, but none of them useful. I’m so grateful if you have any command /tool to calculate AF and MAF, please kindly share with me.
Thanks in advance
Hey, oh, that's because your organism is non-diploid. Your genotypes (GTs) are non-standard. My script is for diploid organisms.
I could modify the script, or you could just try the BCFtools plugin that calculates AF and MAF: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?
No, my organism is human, and it is a multiallelic site. so, your script is fit for me, but how we can handle such sites? Thanks a lot for the link, I try it for AF and MAF.
Ah, I see. The way around that is to split these into separate variant calls with
bcftools norm -m-any
, and then my script will work.Actually, I run the norm command before running the script and as you can find below, the variants splited
However, the count in the second row is correct. however, we couldn't check this issue in the vcf file with thousands of variant sits. Have you any suggestions/advice for this issue, please.
What are the individual records in your VCF after you have split the multi-allelic calls?
Here, it's the original vcf file at multi-allelic site:
As you can find, we have not RefHom in the original vcf file.
it's normalized vcf file with individual records for multi-allelic site:
and I posted the output of your command on the normalized vcf file in my previous post.
Thank you very much for following the post.
Thanks for posting. So, my function has calculated it correctly:
No problem. I also try to get this information from gatk (variants to table function), and it returned me 2 AltHet and 1AltHom, like what we can see in the vcf file (original vcf file). Yes, your script is correct based on normalized vcf file. however, I'm confused with real (original) vcf file as homRef really isn't in this file? sorry, but it will be great if you please tell me how we can interpret this discrepancy and how we have homRef in the normalized vcf file?
Thanks
Homozygous reference is inferred by the
0/0
genotype callThanks Kevin, of course, I know it. so sorry, My question is: we have not any ref allele in the real (original) vcf file (all calls are as 1|2, 2|1 and 2/2). So, how it is generated in the normalized vcf file during splitting and counted by your script, what information used for calling this homRef in the normalized vcf file?
I'm totally confused as all genotype calls in normalized vcf file differs with the original, could you please kindly tell me if we can say these call (in normalized vcf file) actually exist in the sample?
sorry for further questions and many thanks for clearing me
If we look at the variant call before the normalisation:
This says that, in your 3 samples, there were 2 different variants called at this position:
It follows, then:
So, your sample genotypes are:
---------------------------------------------
Your concern, then, about the normalisation is quite valid and reflects a limitation of the VCF format. For example, for the first sample,
1|2
is split into1|0
and0|1
, which is not reflective of the actual genotype. This is partly why we sometimes just exclude these sites from analyses; indeed, many programs do not even support multi-allelic sites. There is another discussion here about it: https://gatkforums.broadinstitute.org/gatk/discussion/3148/how-to-deal-with-multiallelic-sites-in-the-vcfThank you, Kevin for your explanation and link. so, the multi-allelic site has a long story. However, as I mentioned the gatk (variants to table) correctly computed hetero and homo at these positions, however, this tool doesn't say which samples are homo and which are hetero, it may be a good idea to modify your script for multiallelic sites, I think.
Thank you. I may modify it later. Today I am working on CPU parallelisation and also have to go to the supermarket.
May I ask, seta, if you definitively require my help with this?, that is, are you waiting for me to modify my script?
Hi Kevin,
Many thanks for your kind message. I'm going to use GATK for this purpose, however, as I mentioned before, this tool doesn't tell us which samples are homo and which are hetero, unlike your script. So, I'm happy if you please let me know when you modify your script.
Kind regards,
Seta
Hi seta, you could try this modified script, if you wish. It is for mono- and bi-allelic records (if you have tri- or quad-allelic, it won't print anything):
Whilst I have tested this, you should also test it before you are sure that it functions correctly.
Hi Kevin,
Thanks for backing to me. I tested it on an example vcf file and it worked well on the vcf file without normalization.
Best,
Seta