How To Find The Most Frequent Allel Snp In Vcf File
1
1
Entering edit mode
10.2 years ago
Chen Sun ★ 1.1k

When I use samtools mpileup, there are usually several allel SNP in ALT column according to REF column. just as follows:

##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
##reference=file:///gpfs/home/cxs1031/backup/ref/chr22.fa
##contig=<ID=chr22,length=51304566>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=IS,Number=2,Type=Float,Description="Maximum number of reads supporting an indel and fraction of indel reads">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=QBD,Number=1,Type=Float,Description="Quality by Depth: QUAL/#reads">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Read Position Bias">
##INFO=<ID=MDV,Number=1,Type=Integer,Description="Maximum number of high-quality nonRef reads in samples">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    chr22_10x_sort.bam
chr22    16159158    .    A    G,C    123    .    DP=24;VDB=1.728111e-01;AF1=1;AC1=2;DP4=0,0,12,12;MQ=49;FQ=-96    GT:PL:DP:GQ    1/1:156,69,0,154,54,151:24:99

In the last row, A has 2 alts: G and C, how can I fingure out which alt is most frequent one?

vcf snp • 5.3k views
ADD COMMENT
0
Entering edit mode

samtools mpileup generates phased genotypes ('1|2') ? how did you get this ? Which version of samtools are you using ?

ADD REPLY
0
Entering edit mode

Last time, I use the sample data in vcf user guider: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 , this time, I paste my own result.

ADD REPLY
0
Entering edit mode

And my running command is : samtools mpileup -uDf ref.fa xxx.bam> xxx.bcf; bcftools view -gv xxx.bcf >xxx.vcf

ADD REPLY
1
Entering edit mode
10.2 years ago

The allele frequency is in your INFO field. For the variant in line 3 (20 1110696 rs6040355 A G,T), there are frequencies of 0.333 and 0.6667 for the two alleles. I'm not sure why samtools doesn't provide that for the last line as well - perhaps it doesn't calculate this number for insertions and deletions?

ADD COMMENT

Login before adding your answer.

Traffic: 2223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6