Why Does Samtools/Bcftools Give Incorrect Genotypes And Innacurate Quality Scores?
1
9
Entering edit mode
12.6 years ago
Casbon ★ 3.3k

I have resequenced a target region in depth and called with samtools with default parameters (except depth). I find the calls to be pretty inaccurate.

For example, two of these samples are from the 1000 genomes pilot 2 YRI family. I have used the 1000 genomes calls to check the genotypes at rs73479087 and we have:

  • NA19238 is a heterozgote
  • NA19239 is a homozygous variant

Now consider samtools output:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    FORMAT    NA19238    NA19239
chr11    59855960    rs73479087    A    G    110    .    GT:PL:DP:SP:GQ    0/1:46,0,81:142:0:24    0/1:62,4,0:8:0:23

Both are called as hets with very similar genotype qualities (24 vs 23). But lets look closer at the PL column, we have:

  • NA19238 has 46, 0, 81 - pretty strong evidence that this is a het. It is actually sequenced at depth 142 with an allele balance of 55%
  • NA19239 has 62, 4, 0 - which looks like a borderline homozygous variant call. It is actually sequenced at depth 8 with a single reference read.

This seems to show that samtools generates pretty accurate PLs. However, the genotype column is wrong here: why choose a het when the PL for a variant is lower?. Also the genotype quality score bears no relation to the depth or difference between PLs.

So am I doing it wrong, or is samtools/bcftools not producing accurate calls and scores? Is there something between the PLs and the actual calls?

EDIT:

Program: samtools (Tools for alignments in the SAM format) Version: 0.1.18 (r982:295)

Program: bcftools (Tools for data in the VCF/BCF formats) Version: 0.1.17-dev (r973:277)

Command line:

 samtools mpileup -d 1000000 -SD -l $i -uf /data/reference/ucsc/hg19/ucsc.hg19.fasta data.bam | bcftools view -Ibvcg - > bcf.$i.bcf
samtools bcftools genotyping vcf • 9.7k views
ADD COMMENT
0
Entering edit mode

Might be worth adding the version of samtools and the command line switches used to get that output.

ADD REPLY
0
Entering edit mode
ADD REPLY
5
Entering edit mode
12.6 years ago
lh3 33k

Yes, this is a flaw of bcftools and that is exactly why bcftools does not output genotypes by default.

On the other hand, the whole idea of bcftools is to do inference without knowing genotypes. Bcftools never looks at the genotypes itself infers, but can pretty accurately estimate site allele frequency, compute the LD r^2, discovery very rare mutations and test association and HWE. Actually if you ascertain genotypes without imputation and then do things above, you frequently get much worse results.

So, my suggestion is not to look at genotypes. Bcftools has already computed many statistics you are interested in, only in the right way.

ADD COMMENT
0
Entering edit mode

Well, as I understood it, the point is to avoid genotype calling in low coverage data. Since this it high coverage, this makes bcftools inappropriate?

ADD REPLY
0
Entering edit mode

I am using SAMTools Version: 1.3 (using htslib 1.3) , How to get GQ score for my variants ? I have used this command. samtools mpileup -DguS -f PMC01-01.bam | bcftools call -vmO z -o PMC01-01.vcf.gz

ADD REPLY

Login before adding your answer.

Traffic: 2707 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6