Biostar Beta. Not for public use.
VCF file PL values for more than 1 alternatives allele
1
Entering edit mode
14 months ago
bharata1803 • 420
Japan

Helo,

In VCF file, there GT/PL folumn for genotype and its likelihood values. If 2 allele are possible (reference allele and alternative allele) the column value would be like below:

0/1:56:0:80

The 56 score is correspond to reference homozygous, 0 is to heterezygous, and 80 is to alternative homozygous.

My question is, if there are more than 2 allele (let's say 0 for reference and 1,2 for alternate allele), the score will consist of 6 score which is corresponds to:

  1. reference homozygous (0/0)
  2. alt 1 homozygous (1/1)
  3. alt 2 homozygous (2/2)
  4. ref and alt 1 heterzygous (0/1)
  5. ref and alt 2 heterozygous (1/2)
  6. alt 1 and alt 2 heterozygous (2/2)

My question is what is the order in the actual VCF file? I just don't know the order of the score and its corresponding meaning. Below is the actual example of 1 line in my vcf data.

1 226548932 . ACGGCGGCGGCGGCGGCGGCGGTGGCGGCGGCGG ACGGCGGCGGCGGTGGCGGCGGCGG,ACGGCGGCGGCGGCGGCGGTGGCGGCGGCGG 39.049 . INDEL;IDV=1;IMF=1;DP=9;VDB=0.0225004;SGB=-1.15236;MQSB=0.900802;MQ0F=0;ICB=0.153846;HOB=0.0555556;AC=1,1;AN=12;DP4=4,2,1,1;MQ=60 GT:PL ./.:0,0,0,0,0,0 0/0:0,3,60,3,60,60 0/0:0,3,60,3,60,60 ./.:0,0,0,0,0,0 0/1:60,3,0,60,3,60 0/0:0,3,60,3,60,60 0/0:0,3,60,3,60,60 0/2:50,56,132,0,81,78

Look at the GT/PL list below (I have 8 samples):

  1. Sample 1 : ./.:0,0,0,0,0,0
  2. Sample 2 : 0/0:0,3,60,3,60,60
  3. Sample 3 : 0/0:0,3,60,3,60,60
  4. Sample 4 : ./.:0,0,0,0,0,0
  5. Sample 5 : 0/1:60,3,0,60,3,60
  6. Sample 6 : 0/0:0,3,60,3,60,60
  7. Sample 7 : 0/0:0,3,60,3,60,60
  8. Sample 8 : 0/2:50,56,132,0,81,78

I add more interesting result:

  1. Sample 1: 1/1:26,12,9,26,12,26
  2. Sample 2: 0/1:0,3,5,3,5,5
  3. Sample 3: 1/1:26,12,9,26,12,26
  4. Sample 4: 1/2:45,45,45,6,6,0
  5. Sample 5: 1/1:20,3,0,20,3,20
  6. Sample 6: ./.:0,0,0,0,0,0
  7. Sample 7: ./.:0,0,0,0,0,0
  8. Sample 8: 1/1:26,12,9,26,12,26

So, if anyone knows how to interpret the score, please teach me and if it is possible, maybe you can explain the general consept. I treid reading the VCF documentation but it is not written there I think.

SNP indel vcf • 3.2k views
ADD COMMENTlink
3
Entering edit mode
14 months ago

My question is what is the order in the actual VCF file?

This info is present in VCF specification, not easy to find though. Section 1.4.2

PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as the GL field) (Integers)

GL : genotype likelihoods comprised of comma separated floating point log10-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)

So the order in PL is the same as GL, which follows AA,AB,BB,AC,BC,CC, for tri-allelic sites.

So, if anyone knows how to interpret the score, please teach me and if it is possible, maybe you can explain the general consept.

This concept is very well explained in following GATK document.

http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it

If you understand the Phred sclae, it should be easy to follow. In case of difficulty, let us know.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1