VCF file PL values for more than 1 alternatives allele
1
1
Entering edit mode
7.0 years ago
bharata1803 ▴ 560

Helo,

In VCF file, there GT/PL folumn for genotype and its likelihood values. If 2 allele are possible (reference allele and alternative allele) the column value would be like below:

0/1:56:0:80

The 56 score is correspond to reference homozygous, 0 is to heterezygous, and 80 is to alternative homozygous.

My question is, if there are more than 2 allele (let's say 0 for reference and 1,2 for alternate allele), the score will consist of 6 score which is corresponds to:

  1. reference homozygous (0/0)
  2. alt 1 homozygous (1/1)
  3. alt 2 homozygous (2/2)
  4. ref and alt 1 heterzygous (0/1)
  5. ref and alt 2 heterozygous (1/2)
  6. alt 1 and alt 2 heterozygous (2/2)

My question is what is the order in the actual VCF file? I just don't know the order of the score and its corresponding meaning. Below is the actual example of 1 line in my vcf data.

1 226548932 . ACGGCGGCGGCGGCGGCGGCGGTGGCGGCGGCGG ACGGCGGCGGCGGTGGCGGCGGCGG,ACGGCGGCGGCGGCGGCGGTGGCGGCGGCGG 39.049 . INDEL;IDV=1;IMF=1;DP=9;VDB=0.0225004;SGB=-1.15236;MQSB=0.900802;MQ0F=0;ICB=0.153846;HOB=0.0555556;AC=1,1;AN=12;DP4=4,2,1,1;MQ=60 GT:PL ./.:0,0,0,0,0,0 0/0:0,3,60,3,60,60 0/0:0,3,60,3,60,60 ./.:0,0,0,0,0,0 0/1:60,3,0,60,3,60 0/0:0,3,60,3,60,60 0/0:0,3,60,3,60,60 0/2:50,56,132,0,81,78

Look at the GT/PL list below (I have 8 samples):

  1. Sample 1 : ./.:0,0,0,0,0,0
  2. Sample 2 : 0/0:0,3,60,3,60,60
  3. Sample 3 : 0/0:0,3,60,3,60,60
  4. Sample 4 : ./.:0,0,0,0,0,0
  5. Sample 5 : 0/1:60,3,0,60,3,60
  6. Sample 6 : 0/0:0,3,60,3,60,60
  7. Sample 7 : 0/0:0,3,60,3,60,60
  8. Sample 8 : 0/2:50,56,132,0,81,78

I add more interesting result:

  1. Sample 1: 1/1:26,12,9,26,12,26
  2. Sample 2: 0/1:0,3,5,3,5,5
  3. Sample 3: 1/1:26,12,9,26,12,26
  4. Sample 4: 1/2:45,45,45,6,6,0
  5. Sample 5: 1/1:20,3,0,20,3,20
  6. Sample 6: ./.:0,0,0,0,0,0
  7. Sample 7: ./.:0,0,0,0,0,0
  8. Sample 8: 1/1:26,12,9,26,12,26

So, if anyone knows how to interpret the score, please teach me and if it is possible, maybe you can explain the general consept. I treid reading the VCF documentation but it is not written there I think.

SNP indel vcf • 9.3k views
ADD COMMENT
5
Entering edit mode
7.0 years ago

My question is what is the order in the actual VCF file?

This info is present in VCF specification, not easy to find though. Section 1.4.2

PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as the GL field) (Integers)

GL : genotype likelihoods comprised of comma separated floating point log10-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)

So the order in PL is the same as GL, which follows AA,AB,BB,AC,BC,CC, for tri-allelic sites.

So, if anyone knows how to interpret the score, please teach me and if it is possible, maybe you can explain the general consept.

This concept is very well explained in following GATK document.

http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it

If you understand the Phred sclae, it should be easy to follow. In case of difficulty, let us know.

ADD COMMENT

Login before adding your answer.

Traffic: 2011 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6