Hi,
I have some questions in the output of mpileup2snp and mpileup2indel. For substitutions obtained from mpileup2snp when viewed in VCF format, whether they have genotype 0/1 (HET=1) or genotype 1/1 (HOM=1) all of them have only allele mentioned in ALT column. When I pulled out the %age distribution of bases in there subsitution sites they do not correspond the the cutoff frequency of 0.75 to be called as homozygote. For example, in the following variant:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
Sample1 chr7 142460462 . C T . PASS ADP=6;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:2:33:6:1:5
To obtain the above variant the --min-freq-for-hom value taken was default (i.e. 0.75). When I looked at the allele distribution for the above putative variant position using samtools mpileup, it came out to be the following:
%A %T %G %C %Ins %Del
0 75.75757576 0 24.24242424 0 0
Here %T is > 75% so it should have been called Homozygote but VarScan calls it Heterozygote mentions only one of the ALT allele, instead of both. The mpileup output for this position is:
chr7 142460462 N 33
tttttt$ttccctTT+2ACttctttctttctttctctt IIIIIIIIH9GIIIIIGIIIHICIHIIIHIGII
Another example is:
chr7_gl000195_random 1 . G A . PASS
ADP=5;WT=0;HET=1;HOM=0;NC=0
GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR
0/1:2:5:5:0:5:50%:3.9683E-3:0:40:0:0:5:0
Base % distribution is:
%A %T %G %C %Ins %Del
100 0 0 0 0 0
and mpileup is:
chr7_gl000195_random 1 N 5
^vA+2GA^~A+2GA^~A+4AAGA^bA+15CAGATGTGACAAAGA^~A+2GA IIIII
which clearly shows homozygotic substitution, unlike what is shown in VarScan output (HET=1; 0/1 genotype)
Regarding Indels:
VarScan VCF format output:
chr1 121107133 . T +T . PASS
ADP=4;WT=0;HET=0;HOM=1;NC=0
GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR
1/1:1:483:4:0:4:100%:1.4286E-2:0:33:0:0:0:4
%base distribution:
%A %T %G %C %Ins %Del
0 0 0 99.18864097 0.811359026 0
the FREQ value in the above VarScan output is 100% which contradicts the %base distribution (%Ins = 0.811359026). The value of --min-var-freq that I took for the above run is 0.05 and yet it not filtered from the output. The mpileup for the above is:
chr1 121107133 N 483
T$TTtTTTTTTTTTTTTTTTTTTTTTTtTTTTTTTTTTTTTTTTtTtTTTttttTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtTTTTTTtttttTTTTttttTTTTTTTTtttttTTTTTTTTTTTTTTTTTTTTTTTTTTTttTTTtttTTTTTtttttttttttttTTTTTTTttTTTTTTTTttTTTTTTTTTTTTTTTttTTTTTTTTTTTttttTTTTTTTTTtttTTtTTTTTTTTTTTTTtTTTTTttttTTTTTTTTTttttTTTTTTtttttTTTTttttttttTttttTTtTTTTTTTTTTTTTTTTTttttTTTTTTTTTTTTTTTTTtttttttttTTTTTTTTTtttttttTTTTTTTTTTTTtTTTttttttTTTTTTTTTtttttttTTTTTTTTTtttttttttTTTTTTTTTttttTTTTttttttTTTTttttttttt^!T^!T^!T^!T^!T^!T^!T^!T^!T^!T^!T^!T^!T^!T^!T^!t^!t^!t^!t+1t^!t+1t^!t+1t^!t^!t+1t
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIBIIIIIIIIIIIDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICIIIIIIIIIIIIIIIIIIIIIIII
chr1 121107134 N 489
C$CcCCCCCCCCCCCCCCCCCCCCCCcCCCCCCCCCCCCCCCCcCcC$CC$ccccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCcCCCCCCcccccCCCCccccCCCCCCCCcccccCCCCCCCCCCCCCCCCCCCCCCCCCCCccCCCcccCCCCCcccccccccccccCCCCCCCccCCCCCCCCccCCCCCCCCCCCCCCCccCCCCCCCCCCCccccCCCCCCCCCcccCCcCCCCCCCCCCCCCcCCCCCccccCCCCCCCCCccccCCCCCCcccccCCCCccccccccCccccCCcCCCCCCCCCCCCCCCCCccccCCCCCCCCCCCCCCCCCcccccccccCCCCCCCCCcccccccCCCCCCCCCCCCcCCCccccccCCCCCCCCCcccccccCCCCCCCCCcccccccccCCCCCCCCCccccCCCCccccccCCCCcccccccccCCCCCCCCCCCCCCCcccccccc^!C^!C^!C^!C^!c^!c^!c
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-IIIIIIIBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
chr1 121107135 N 489
A$a$AAAAAAAAAAAAAAAAAAAAAAaAAAAAAAAAAAAAAAAaAaAaaaaA$A$AA$A$A$AA$AA$A$A$AAAA$A$A$AA$A$AAAAA$A$A$AA$A$A$A$aAAAAAAaaaaaAAAAaaaaAAAAAAAAaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAaaaAAAAAaaaaaaaaaaaaaAAAAAAAaaAAAAAAAAaaAAAAAAAAAAAAAAAaaAAAAAAAAAAAaaaaAAAAAAAAAaaaAAaAAAAAAAAAAAAAaAAAAAaaaaAAAAAAAAAaaaaAAAAAAaaaaaAAAAaaaaaaaaAaaaaAAaAAAAAAAAAAAAAAAAAaaaaAAAAAAAAAAAAAAAAAaaaaaaaaaAAAAAAAAAaaaaaaaAAAAAAAAAAAAaAAAaaaaaaAAAAAAAAAaaaaaaaAAAAAAAAAaaaaaaaaaAAAAAAAAAaaaaAAAAaaaaaaAAAAaaaaaaaaaAAAAAAAAAAAAAAAaaaaaaaaAAAAaaa^!A^!A^!a
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIEIIII(EIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICICIIIIIIIIIIIIIIIAICIIIIIIIIIIIBII/
Why there is discrepancy in the above cases? For %Ins (Insertion) and %Del (Deletion) I calculated for 1 base after the InDel position showed in VCF format because in standard VCF format In-Dels are shown in 1 base before the insertion.
I took the following parameter values to run VarScan mpileup2snp and mpileup2ins:
samtools mpileup -f ref bamfile.bam > output_mpileup
java -Xmx6g -Djava.io.tmpdir=temp -jar ~/VarScan_2_3_2/VarScan.v2.3.2.jar
mpileup2snp output_mpileup \
--min-coverage 4 \
--min-reads2 4 \
--min-avg-qual 20 \
--min-var-freq 0.05 \
--p-value 0.05 \
--output-vcf 1 > output_varscan_snp.vcf
java -Xmx6g -Djava.io.tmpdir=temp -jar ~/VarScan_2_3_2/VarScan.v2.3.2.jar
mpileup2indel output_mpileup \
--min-coverage 4 \
--min-reads2 4 \
--min-avg-qual 20 \
--min-var-freq 0.05 \
--p-value 0.05 \
--output-vcf 1 > output_varscan_indel.vcf
Thanks, Rahil
Hi Dan,
Thanks for the reply. Regarding your concerns:
"If you did provide a reference base when running mpileup for VarScan, but didn't here, please let me know" I did provide the reference sequence for mpileup running in default settings as explained in VarScan manual. Here I showed mpileup output without reference so it is easy to count bases instead of dealing with dots and commas. Otherwise mpileup output in above examples should be the same as used in VarScan. Regarding my follow-up questions:
Follow up on Question 1 (chr7 142460462): I still don't get it how VarScan is calculating those percentages. Can you please elaborate? According to mipleup there are 33 total reads (after passing through default filter) mapping at this position. So 8/33 for C = 0.2424 (24.24%) and remaining 25/33 for T = 0.7575 (75.75%). Moreover the %ages you showed 23.5 and 73.5 do not add up to 100%
Regarding Question 2 and 3: Please explain this if reference base was provided