Getting allele frequency and zygosity information from several vcf file
1
1
Entering edit mode
5.6 years ago
seta ★ 1.9k

Dear all,

I have multiple vcf file and want to extract allele frequency and the hetero- and homozygous call with their count. Could you please let me know if I should merge all vcf file to get one vcf file, and then extract the required information? If yes, I read CombineVariants from GATK do this job, but never use it, please kindly tell me is it OK for my task in your opinion?

For extracting the required information (allele frequency and the hetero- and homozygous call with their count), I found that convert2annovar.pl from annovar package help to get this information. However, when I tested on the example vcf file containing 3 for getting zygosity information by the below command:

perl convert2annovar.pl ex2.vcf --format vcf4 --allsample --withzyg --outfile file1.vcf

It gave me 3 output files corresponded to 3 samples with hetero and homo information, could you please kindly tell me how I can have all information in one file? Assuming multi-sample vcf file containing 100 samples, so 100 output files will be generated, how we could handle them?

Also, I tried vcftools to obtain the required information, but sounds that it give us just frequency, is any experience with vcftools for getting zygosity information? Any alternative tools and commands suggestion would be highly appreciated.

Thanks

vcf allele frequency zygocity • 4.5k views
ADD COMMENT
3
Entering edit mode
5.6 years ago

This may help, tested on 1000 Genomes Phase 3 variants (total sample n=2504):

bcftools view /Kev/CollegeWork/ReferenceMaterial/1000Genomes/ImputedGenotypes/1000Genomes.Norm.bcf | \
awk -F"\t" 'BEGIN {
    print "CHR\tPOS\tID\tREF\tALT\tAltHetCount\tAltHomCount\tRefHomCount"
}
!/^#/ {
  print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" gsub(/0\|1|1\|0|0\/1|1\/0/,"") "\t" gsub(/1\/1|1\|1/,"") "\t" gsub(/0\/0|0\|0/,"")
}' | \
head -18

CHR POS     ID          REF ALT AltHetCount AltHomCount    RefHomCount
1   10177   rs367896724 A   AC  1490        320            694
1   10235   rs540431307 T   TA  6           0              2498
1   10352   rs555500075 T   TA  2025        83             396
1   10505   rs548419688 A   T   1           0              2503
1   10506   rs568405545 C   G   1           0              2503
1   10511   rs534229142 G   A   1           0              2503
1   10539   rs537182016 C   A   3           0              2501
1   10542   rs572818783 C   T   1           0              2503
1   10579   rs538322974 C   A   1           0              2503
1   10616   rs376342519 CCGCCGTTGCAAAGGCGCGCCG  C   35  2469    0
1   10642   rs558604819 G   A   21          0              2483
1   11008   rs575272151 C   G   403         19             2082
1   11012   rs544419019 C   G   403         19             2082
1   11063   rs561109771 T   G   15          0              2489
1   13011   rs574746232 T   G   3           0              2501
1   13110   rs540538026 G   A   134         0              2370
1   13116   rs62635286  T   G   414         36             2054
1   13118   rs200579949 A   G   414         36             2054

It counts AltHet as either of:

  • 0/1
  • 1/0
  • 1|0
  • 0|1

It counts AltHom as either of:

  • 1/1
  • 1|1

It counts RefHom as either of:

  • 0/0
  • 0|0

Other scripts:

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin,

Thank you so much for your helpful response. First, I normalized vcf file by bcftools norm –m-any and tried your command on the very small vcf file containing just 3 samples and found a strange thing. One of rows in my vcf file is:

20  1110696 rs6040355   A   G,T 67  PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ 1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4

Your suggested command calculated the hetero and homo count as: H

R   POS ID  REF ALT AltHetCount AltHomCount RefHomCount 
20  1110696 rs6040355   A   G   2   0   1   
20  1110696 rs6040355   A   T   2   1   0

As it is obvious, there is 2 AltHet and 1 AltHom, but based on the above output of the command, 1 genotype was considered as RefHom, which is not correct. Could you please kindly tell me how we can interpret this issue?

For obtaining AF and MAF from vcf file, I tried several ways, but none of them useful. I’m so grateful if you have any command /tool to calculate AF and MAF, please kindly share with me.

Thanks in advance

ADD REPLY
0
Entering edit mode

Hey, oh, that's because your organism is non-diploid. Your genotypes (GTs) are non-standard. My script is for diploid organisms.

I could modify the script, or you could just try the BCFtools plugin that calculates AF and MAF: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

ADD REPLY
0
Entering edit mode

No, my organism is human, and it is a multiallelic site. so, your script is fit for me, but how we can handle such sites? Thanks a lot for the link, I try it for AF and MAF.

ADD REPLY
0
Entering edit mode

Ah, I see. The way around that is to split these into separate variant calls with bcftools norm -m-any, and then my script will work.

ADD REPLY
0
Entering edit mode

Actually, I run the norm command before running the script and as you can find below, the variants splited

HR  POS ID  REF ALT AltHetCount AltHomCount RefHomCount
20  1110696 rs6040355   A   G   2   0   1
20  1110696 rs6040355   A   T   2   1   0

However, the count in the second row is correct. however, we couldn't check this issue in the vcf file with thousands of variant sits. Have you any suggestions/advice for this issue, please.

ADD REPLY
0
Entering edit mode

What are the individual records in your VCF after you have split the multi-allelic calls?

ADD REPLY
0
Entering edit mode

Here, it's the original vcf file at multi-allelic site:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
20  1110696 rs6040355   A   G,T 67  PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ 1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4

As you can find, we have not RefHom in the original vcf file.

it's normalized vcf file with individual records for multi-allelic site:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
20  1110696 rs6040355   A   G   67  PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ 1|0:21:6:23,27  0|1:2:0:18,2    0/0:35:4:.
20  1110696 rs6040355   A   T   67  PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ 0|1:21:6:23,27  1|0:2:0:18,2    1/1:35:4:.

and I posted the output of your command on the normalized vcf file in my previous post.

Thank you very much for following the post.

ADD REPLY
0
Entering edit mode

Thanks for posting. So, my function has calculated it correctly:

  • For A>G, there are 2 het variants, 0 hom variants, and 1 hom reference
  • For A>T, there are 2 het variants, 1 hom variant, and 0 hom reference
ADD REPLY
0
Entering edit mode

No problem. I also try to get this information from gatk (variants to table function), and it returned me 2 AltHet and 1AltHom, like what we can see in the vcf file (original vcf file). Yes, your script is correct based on normalized vcf file. however, I'm confused with real (original) vcf file as homRef really isn't in this file? sorry, but it will be great if you please tell me how we can interpret this discrepancy and how we have homRef in the normalized vcf file?

Thanks

ADD REPLY
0
Entering edit mode

Homozygous reference is inferred by the 0/0 genotype call

ADD REPLY
0
Entering edit mode

Thanks Kevin, of course, I know it. so sorry, My question is: we have not any ref allele in the real (original) vcf file (all calls are as 1|2, 2|1 and 2/2). So, how it is generated in the normalized vcf file during splitting and counted by your script, what information used for calling this homRef in the normalized vcf file?

I'm totally confused as all genotype calls in normalized vcf file differs with the original, could you please kindly tell me if we can say these call (in normalized vcf file) actually exist in the sample?

sorry for further questions and many thanks for clearing me

ADD REPLY
0
Entering edit mode

If we look at the variant call before the normalisation:

20  1110696 rs6040355   A   G,T 67  AF=0.333,0.667 1|2  2|1    2/2

This says that, in your 3 samples, there were 2 different variants called at this position:

  • A>G, with AF=0.333
  • A>T, with AF=0.667

It follows, then:

  • A>G is encoded as GT 1
  • A>T is encoded as GT 2

So, your sample genotypes are:

  • 1|2 = GT
  • 2|1 = TG
  • 2/2 = TT

---------------------------------------------

Your concern, then, about the normalisation is quite valid and reflects a limitation of the VCF format. For example, for the first sample, 1|2 is split into 1|0 and 0|1, which is not reflective of the actual genotype. This is partly why we sometimes just exclude these sites from analyses; indeed, many programs do not even support multi-allelic sites. There is another discussion here about it: https://gatkforums.broadinstitute.org/gatk/discussion/3148/how-to-deal-with-multiallelic-sites-in-the-vcf

ADD REPLY
0
Entering edit mode

Thank you, Kevin for your explanation and link. so, the multi-allelic site has a long story. However, as I mentioned the gatk (variants to table) correctly computed hetero and homo at these positions, however, this tool doesn't say which samples are homo and which are hetero, it may be a good idea to modify your script for multiallelic sites, I think.

ADD REPLY
0
Entering edit mode

Thank you. I may modify it later. Today I am working on CPU parallelisation and also have to go to the supermarket.

ADD REPLY
0
Entering edit mode

May I ask, seta, if you definitively require my help with this?, that is, are you waiting for me to modify my script?

ADD REPLY
0
Entering edit mode

Hi Kevin,

Many thanks for your kind message. I'm going to use GATK for this purpose, however, as I mentioned before, this tool doesn't tell us which samples are homo and which are hetero, unlike your script. So, I'm happy if you please let me know when you modify your script.

Kind regards,

Seta

ADD REPLY
0
Entering edit mode

Hi seta, you could try this modified script, if you wish. It is for mono- and bi-allelic records (if you have tri- or quad-allelic, it won't print anything):

bcftools view test.vcf.gz | \
awk -F"\t" '{line=$0} BEGIN {
        print "CHR\tPOS\tID\tREF\tALT\tAltHetCount\tAltHomCount\tRefHomCount"
    } !/^#/ {
        if (gsub(/,/, ",", $5)==0) {
            print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" gsub(/0\|1|1\|0|0\/1|1\/0/,"") "\t" gsub(/1\/1|1\|1/,"") "\t" gsub(/0\/0|0\|0/,"")
        } else if (gsub(/,/, ",", $5)==1) {
            print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" gsub(/1\/0|0\/1|1\|0|0\|1|1\|2|2\|1|1\/2|2\/1/,"")","gsub(/2\/0|0\/2|2\|0|0\|2|1\|2|2\|1|1\/2|2\/1/,"",line) "\t" gsub(/1\/1|1\|1/,"")","gsub(/2\/2|2\|2/,"") "\t" gsub(/0\/0|0\|0/,"")
        }
    }'

CHR POS         ID          REF ALT  AltHetCount    AltHomCount RefHomCount
16  56380495    rs78560610  G   A    1              0           10
16  56380502    rs3811361   G   A    3              0           8
16  56380570    rs3811360   A   G    6              0           5
16  56381544    rs11297695  CT  C    6              2           3
16  56381677    .           A   C    1              0           10
16  56381699    rs11640263  C   T    7              2           2
16  56388821    rs148650019 A   G    1              0           10
16  56389119    rs143245214 T   TTA  3              3           5
16  56389827    rs5817056   CA  C    5              1           5
16  56390685    .           CTT C,CT 4,9            0,0         1
16  56390968    rs34097380  GA  G    4              2           5
16  56396430    rs3748401   C   A    1              0           10
16  56396486    rs4924      C   T    5              4           2
16  56397153    rs2617846   G   A    2              5           3
16  56397174    rs2617847   C   G    2              4           3
16  56921684    .           T   TA,A 3,8            1,1         1

Whilst I have tested this, you should also test it before you are sure that it functions correctly.

ADD REPLY
2
Entering edit mode

Hi Kevin,

Thanks for backing to me. I tested it on an example vcf file and it worked well on the vcf file without normalization.

Best,

Seta

ADD REPLY

Login before adding your answer.

Traffic: 2329 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6