Why Is The Variant File Produced By Samtools So Large? It Is About ~34G.
1
0
Entering edit mode
10.7 years ago
Jordan ★ 1.3k

Hi,

I have a bam file of size 62G. When I do the variant analysis by samtools, it gives a huge bcf file like 34G. It's quite unusual. It's not even a vcf file. It's a compressed vcf file.

I'm not sure if I'm doing it right. I used the following command:

samtools mpileup -uf ~/refs/human_g1k_v37.fasta normal.bam | bcftools view -bvcg - > normal.raw.bcf

When I looked at this bcf file, I found something rather strange.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Unknown
chr1    10114   .       N       T       5.45    .       DP=38;VDB=2.588829e-01;AF1=1;AC1=2;DP4=0,0,11,15;MQ=0;FQ=-105   GT:PL:GQ        1/1:37,78,0:45
chr1    10115   .       N       A       4.76    .       DP=26;VDB=5.001104e-03;AF1=1;AC1=2;DP4=0,0,8,17;MQ=0;FQ=-99     GT:PL:GQ        1/1:36,72,0:42
chr1    10116   .       N       A       9.51    .       DP=26;VDB=1.725300e-02;AF1=1;AC1=2;DP4=0,0,7,17;MQ=0;FQ=-99     GT:PL:GQ        1/1:42,72,0:60
chr1    10117   .       N       C       8.64    .       DP=25;VDB=8.678101e-02;AF1=1;AC1=2;DP4=0,0,6,16;MQ=0;FQ=-93     GT:PL:GQ        1/1:41,66,0:57
chr1    10118   .       N       C       9.51    .       DP=25;VDB=1.464226e-01;AF1=1;AC1=2;DP4=0,0,6,17;MQ=0;FQ=-96     GT:PL:GQ        1/1:42,69,0:60

As you can see all the REF alleles are labeled as N. I'm not sure why it shows that. Can anyone help?

samtools mpileup variant-calling vcf • 3.5k views
ADD COMMENT
0
Entering edit mode

Actually I think I figured out why it's so large. It's recognizing N's as nucleotides in Reference file and hence producing each allele as a variant. So basically each and every position is being labeled as an variant. Hence, the huge file size. But I'm not sure why this is occuring though.

ADD REPLY
2
Entering edit mode
10.7 years ago

human_g1k_v37.fasta looks like the reference of the 1KG project: chromosomes doesn't have the 'chr' prefix, while your bam seem to have been mapped on the UCSC reference (with 'chr' prefix).

ADD COMMENT

Login before adding your answer.

Traffic: 2550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6