Extract allele frequencies from VCF
1
0
Entering edit mode
7.6 years ago
mufernando ▴ 10

I am trying to do the variant calling on pooled samples. For that, I am using CRISP. I finished the variant calling and obtained a VCF with the following structure:

##fileformat=VCFv4.0
##fileDate=Fri Sep  9 10:20:58 2016
##source=CRISP_V0.1
##reference=dmel-all-chromosome-r5.39.fasta
##options: poolsize=2,bamfiles=14,qvoffset=33,bedfile=None,min_base_quality=13
##INFO=<ID=NP,Number=1,Type=Integer,Description="Number of Pools With Data">
##INFO=<ID=DP,Number=2,Type=Integer,Description="Total number of reads (+strand,-strand) across all pools (filtered reads only)">
##INFO=<ID=CT,Number=.,Type=Float,Description="contingency table p-value for each variant allele in same order as listed in column 5">
##INFO=<ID=VP,Number=.,Type=Integer,Description="Number of Pools with variant allele(s)">
##INFO=<ID=HP,Number=.,Type=Integer,Description="Ambiguity in positioning of indel (homopolymer run length or microsatellite length)">
##INFO=<ID=MQ,Number=.,Type=Integer,Description="# of reads with mapping qualities 0-9,10-19,20-39,40-255">
##FILTER=<ID=StrandBias,Description="strand bias in distribution of reads between reference and variant alleles on two strands">
##FILTER=<ID=LowDepth,Description="low average coverage: less than 1 (filtered) read per haplotype across all samples ">
##FILTER=<ID=LowMQ20,Description=" >20 percent of reads have mapping quality score less than 20">
##FILTER=<ID=LowMQ10,Description=" >10 percent of reads have mapping quality score less than 20">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MLAC,Number=.,Type=Integer,Description="Maximum likelihood estimate for the allele counts for the ALT allele(s), in the same order as listed in column 5">
##FORMAT=<ID=AF,Number=.,Type=Float,Description="variant allele frequency in pool, one value for each variant allele listed in column 5">
##FORMAT=<ID=ADf,Number=.,Type=Integer,Description="Number of reads aligned to the forward strand of the genome supporting reference allele and the alternate alleles in the order listed">
##FORMAT=<ID=ADr,Number=.,Type=Integer,Description="Number of reads aligned to the reverse strand of the genome supporting reference allele and the alternate alleles in the order listed">
##FORMAT=<ID=ADb,Number=.,Type=Integer,Description="Number of overlapping paired-end reads (read from both strands) supporting reference allele and the alternate alleles in the order listed">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SRR1525694  SRR1525697  SRR1525768  SRR1525771  SRR1525774  SRR1525695  SRR1525698  SRR1525769  SRR1525772  SRR1525685  SRR1525696  SRR1525699  SRR1525770  SRR1525773
YHet    49  .   C   G   4074    LowDepth    NP=14;DP=67,43,1;VT=SNV;CT=-0.0;VP=11;VF=EMpass;AC=1790;AF=0.99940;EMstats=407.46:-154.93;HWEstats=-0.0;MQS=0,2,83,29;FLANKSEQ=cattcatgtt:C:ccgctcttgc  MLAC:GQ:DP:ADf:ADr  .:12:1:0,1:0,0  .:9:8:0,4:0,4   .:13:20:0,10:0,10   .:14:7:0,2:0,4  .:12:5:0,4:0,1  .:13:15:0,9:0,5 .:10:8:0,5:0,3  .:11:29:0,18:0,11   .:10:2:0,2:0,0  .:13:5:0,1:0,3  .:13:7:0,5:0,1  .:10:0:0,0:0,0  .:9:4:0,4:0,0   .:13:3:0,2:0,1  
YHet    50  .   C   T   4216    LowDepth    NP=14;DP=69,43,1;VT=SNV;CT=-0.0;VP=11;VF=EMpass;AC=1790;AF=0.99945;EMstats=421.67:-162.93;HWEstats=-0.0;MQS=0,2,83,30;FLANKSEQ=attcatgttc:C:cgctcttgct  MLAC:GQ:DP:ADf:ADr  .:13:1:0,1:0,0  .:10:8:0,4:0,4  .:13:20:0,10:0,10   .:15:7:0,2:0,4  .:13:5:0,4:0,1  .:13:15:0,10:0,5    .:10:8:0,5:0,3  .:12:30:0,19:0,11   .:11:2:0,2:0,0  .:14:5:0,1:0,3  .:13:7:0,5:0,1  .:11:0:0,0:0,0  .:9:4:0,4:0,0   .:13:3:0,2:0,1  
YHet    51  .   C   T   4248    LowDepth    NP=14;DP=68,45,1;VT=SNV;CT=-0.0;VP=12;VF=EMpass;AC=1790;AF=0.99890;EMstats=424.84:-166.56;HWEstats=-0.0;MQS=0,2,86,30;FLANKSEQ=ttcatgttcc:C:gctcttgctt  MLAC:GQ:DP:ADf:ADr  .:10:1:0,1:0,0  .:7:8:0,4:0,4   .:10:20:0,10:0,10   .:12:7:0,2:0,4  .:10:5:0,4:0,1  .:10:15:0,9:0,5 .:7:8:0,5:0,3   .:9:31:0,19:0,11    .:8:3:0,2:0,1   .:11:6:0,1:0,4  .:10:7:0,5:0,1  .:8:0:0,0:0,0   .:6:4:0,4:0,0   .:10:3:0,2:0,1  
YHet    195 .   T   A   12816   LowDepth    NP=14;DP=143,194,5;VT=SNV;CT=-0.6;VP=13;VF=EMpass;AC=1789;AF=0.99645;EMstats=1281.69:-20.31;HWEstats=-0.0;MQS=0,0,203,143;FLANKSEQ=ccatttccta:T:taagagtaat  MLAC:GQ:DP:ADf:ADr:ADb  .:5:4:1,0:0,3:0,0   .:2:14:0,5:0,7:0,0  .:6:62:0,24:0,37:0,0    .:8:23:0,7:0,15:0,1 .:5:10:0,1:0,9:0,0  .:6:44:0,20:0,24:0,0    .:3:16:0,8:0,7:0,1  .:6:96:0,38:0,58:0,0    .:3:3:0,2:0,1:0,0   .:6:14:0,12:0,2:0,0 .:6:23:0,6:0,16:0,1 .:3:4:0,4:0,0:0,0   .:1:12:0,5:0,6:0,0  .:6:21:0,10:0,9:0,2 
YHet    196 .   T   A   12627   LowDepth    NP=14;DP=143,189,4;VT=SNV;CT=-0.0;VP=14;VF=EMpass;AC=1790;AF=0.99992;EMstats=1262.76:-548.38;HWEstats=-0.0;MQS=0,1,198,143;FLANKSEQ=catttcctat:T:aagagtaatt MLAC:GQ:DP:ADf:ADr:ADb  .:22:4:0,1:0,3:0,0  .:18:14:0,5:0,7:0,0 .:23:62:0,24:0,36:0,0   .:24:23:0,7:0,15:0,1    .:22:10:0,1:0,9:0,0 .:23:43:0,20:0,23:0,0   .:19:17:0,8:0,7:0,1 .:22:94:0,38:0,56:0,0   .:19:3:0,2:0,1:0,0  .:23:14:0,12:0,2:0,0    .:22:22:0,6:0,15:0,1    .:20:4:0,4:0,0:0,0  .:18:12:0,5:0,6:0,0 .:22:20:0,10:0,9:0,1    
YHet    1544    .   A   T   8596    LowDepth    NP=14;DP=85,148,3;VT=SNV;CT=-0.9;VP=13;VF=EMpass;AC=1769;AF=0.98609;EMstats=859.60:-328.72;HWEstats=-0.0;MQS=0,4,84,156;FLANKSEQ=ttcaaatgat:A:aaaaaaaaca    MLAC:GQ:DP:ADf:ADr:ADb  .:1:11:0,6:0,4:0,0  .:1:11:0,4:0,5:0,0  .:1:40:0,9:4,25:0,1 .:1:9:0,3:0,6:0,0   .:1:13:0,4:0,9:0,0  .:1:25:0,9:0,16:0,0 .:0:19:0,11:0,7:0,0 .:2:57:0,15:0,42:0,0    .:0:3:0,0:0,3:0,0   .:0:10:1,1:0,7:0,0  .:1:15:0,8:0,7:0,0  .:0:7:0,4:0,3:0,0   .:0:8:0,2:0,5:0,1   .:1:16:0,8:0,5:0,1  
YHet    1835    .   C   A   8903    LowDepth    NP=14;DP=135,103,0;VT=SNV;CT=-0.1;VP=13;VF=EMpass;AC=1790;AF=0.99929;EMstats=890.31:-386.89;HWEstats=-0.0;MQS=0,5,181,63;FLANKSEQ=gaagagtaca:C:cagcatatcc   MLAC:GQ:DP:ADf:ADr  .:12:9:0,5:0,3  .:9:14:0,6:0,4  .:12:34:0,19:0,15   .:14:14:0,14:0,0    .:12:7:0,2:0,5  .:12:21:0,13:0,8    .:9:14:0,5:0,8  .:12:63:0,29:0,33   .:10:3:0,3:0,0  .:13:9:0,5:0,4  .:13:27:0,15:0,11   .:10:4:0,1:0,2  .:8:7:0,5:0,2   .:13:23:0,13:0,8    
YHet    1844    .   C   A   8752    LowDepth    NP=14;DP=141,100,0;VT=SNV;CT=-0.0;VP=14;VF=EMpass;AC=1790;AF=0.99990;EMstats=875.25:-343.76;HWEstats=-0.0;MQS=0,3,162,85;FLANKSEQ=accagcatat:C:ccgtgagtgg   MLAC:GQ:DP:ADf:ADr  .:21:9:0,5:0,4  .:17:10:0,5:0,2 .:21:35:0,22:0,13   .:23:13:0,12:0,0    .:20:6:0,2:0,4  .:21:21:0,13:0,8    .:18:14:0,4:0,8 .:20:70:0,34:0,35   .:18:4:0,3:0,1  .:22:9:0,6:0,3  .:21:28:0,16:0,11   .:18:3:0,1:0,2  .:17:6:0,4:0,2  .:21:22:0,14:0,7

Now I am going to write a script to transform this into a VCF with allele frequencies (because I haven't found one that does that), but I am not sure what I should use for that.

Should I sum the ADf and ADr of one of the alleles and divide by DP? What should I do when there are reads overlapping (ADb)?

Thanks!

variant-calling crisp vcf • 3.4k views
ADD COMMENT
3
Entering edit mode
7.6 years ago
Noushin N ▴ 600

Assuming that you are looking for alternate allele frequency in each of the samples represented in this VCF file (and not the cohort level frequency of alleles), you would like to extract the number of reads supporting alternate allele and divide by depth of coverage.

Based on the VCF header:

DP="Total number of reads (+strand,-strand) across all pools (filtered reads only)>"

ADf="Number of reads aligned to the forward strand of the genome supporting reference allele and the alternate alleles in the order listed"

AD="Number of reads aligned to the reverse strand of the genome supporting reference allele and the alternate alleles in the order listed"

Example entry: MLAC:GQ:DP:ADf:ADr .:10:8:0,4:0,4

DP: 8

ADf: 0,4 --> 0 ref and 4 alt allele supporting reads mapped to forward strand

ADr: 0,4 --> 0 ref and 4 alt allele supporting reads mapped to reverse strand

Total number of reads supporting alternate allele: 8 Alternate allele frequency: 1.0 (8/8)

ADD COMMENT
0
Entering edit mode

Thanks for your answer! What about ADb?

ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6