Question

Problems When Calling Snps Using Samtools Mpileup

2

Entering edit mode

11.4 years ago

玄 ▴ 20

Hi there,

Recently, I'm using samtools mpileup to call SNPs using low coverage (average 1x) resequencing data of 600+ populations. But I encountered a few problems.

My command line is as follows:

samtools mpileup -E -C50 -Q20 -q20 -DSuf ref.fasta -r chr4:21000001-24000000 -b bam_list | bcftools view -Nbcvg - > partE-chr4:
21000001-24000000.bcf

The individuals sequenced are of inbred lines. So we want to get homoSNPs. HeteroSNPs shold be wrong calling.

1) The SNPs printed by samtools mpileup seemed weird:

SNP1: 0/1:26,3,0:1:0:10
SNP2: 0/1:26,3,0:1:0:9

Both SNP1 and SNP2 have PLs with (26,3,0), but samtools- bcftools printed them as hetero.

lh3 have give a answer at the post: Why does samtools/bcftools give incorrect genotypes and innacurate quality scores?

but finally which genotype should I give? First, I think I should compare the three values of PLs, and choose the maximum value as the final genotype. If two of the values were very close, how to give the genotype? I'm not sure of this. Can someone give me some suggestions and some explanations?

2) For large population, how to deal with QUAL columns (the sixth column)? How to choose the cutoff?

This post was the same as in seqanswer (http://seqanswers.com/forums/showthread.php?t=25354).

samtools snp calling • 6.7k views

ADD COMMENT • link updated 5.0 years ago by Biostar 20 • written 11.4 years ago by 玄 ▴ 20

score 2 · Answer 1 · 2012-12-03

First of all, you do not really know heterozygotes are wrong. It is not always guaranteed that every base in an inbred strain is homozygous. I used to briefly look at some drosophila inbred strains. Except a couple of most widely used strains, they are all known to be heterozygotes in a small fraction of regions. Furthermore, even if you know for sure the sample is haploid (e.g. a bacterium), copy number changes may lead to spurious heterozygotes. For single-sample calling, I always recommend to treat them as diploid and filter heterozygotes afterwards. I have made the argument with quite a few researchers, and I think they all agreed with me in the end.

In case of multi-sample calling, bcftools works with a mixture of haploid and diploid samples (e.g. for SNP calling on chrX) . You need to provide a file like:

sample1 1
sample2 2
sample3 1

with bcftools view -s. The second column indicates the ploidy. It should be either 1 or 2.

However, bcftools always assume the samples being diploid when it generates GT. In your example, you only have one read. As hets are more often, calling it as a heterozygote is preferred. The bcftools call is expected given diploid samples. For your haploid samples, you may parse PL to call the most-likely base as the genotype. When you find strong evidence that the site is heterozygous, you should be careful or even to filter out the site.

Usually the default QUAL=3 is fine. If you want to make sure, you can plot ts/tv as a function of the QUAL threshold. You choose the threshold such that ts/tv does not drop significantly. You may also derive the threshold by comparing to the truth. How exactly this can be done depends on the type of your truth data.

score 1 · Answer 2 · 2012-12-03

1

Entering edit mode

11.4 years ago

Erik Garrison ★ 2.4k

Have you tried other methods? They might provide workarounds. In freebayes, for instance, you can set ploidy to 1 and the genotypes will be modeled woth the assumption that the sample is haploid, and only homozygous genotypes will be reported.

ADD COMMENT • link 11.4 years ago by Erik Garrison ★ 2.4k

0

Entering edit mode

Thanks for your replying. Right now I have tried GATK. For now SNPs reported by GATK seemed much more than samtools. I'll look into it. Also I think I should try freebayes. But still I want know how to deal with the samtools's output.

ADD REPLY • link 11.4 years ago by 玄 ▴ 20