Hi everyone,
I have a bam file and a snps file for chr1 (hg19) which contains the chromosome number & position for a list of known snps.
$head chr1.snps
chr1 1211292
chr1 2069172
chr1 2069681
chr1 2513216
chr1 2553624
Now, I want to compute the genotype information (just like it is given in a vcf format) at all the positions given in my snps file, but I do not want to "call" any snps because these are already known variants. And I think VCF format is a good standard for getting genotype information so I tried using:
samtools mpileup -u -l chr1.snps -f hg19.fa sample1_filter_sort.bam | bcftools view - > test1.vcf
So here I feed my .snps file and .bam file to samtools mpileup and create an uncompressed .bcf file which I then pipe into bcftools to get a .vcf file. Here I got 245 entries in my .vcf file, so it found 245/513 snps. But as it is not calling any snps, the output is a bit weird:
$head test1.vcf
chr1 1211292 . C X 0 . DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0.000000,0.000000,0.000000,0.000000 PL 0,0,0
chr1 2069172 . C X 0 . DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0.000000,0.000000,0.000000,0.000000 PL 0,0,0
chr1 2069681 . C X 0 . DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0.000000,0.000000,0.000000,0.000000 PL 0,0,0
chr1 2513216 . C T,X 0 . DP=1;I16=0,0,1,0,0,0,34,1156,0,0,39,1521,0,0,4,16;QS=0.000000,1.000000,0.000000,0.000000 PL 34,3,0,34,3,34
chr1 2553624 . T X 0 . DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=-nan,0.000000,0.000000,0.000000 PL 0,0,0
In another test I use options to call the variants:
samtools mpileup -u -l chr1.snps -f hg19.fa sample1_filter_sort.bam | bcftools view -v -c -g -l chr1.snps - > test2.vcf
Here, I feed my .snps and .bam file to samtools mpileup and create an uncompressed .bcf file which I then pipe into bcftools to get a .vcf file but this time I am calling the variants using -vcg options. Here, I got 16 entries in my .vcf file so it "called" 16/513 snps. The output here is just like I want it to be:
$head test2.vcf
chr1 2513216 . C T 5.46 . DP=1;QS=0.000000,1.000000,0.000000,0.000000;AF1=1;AC1=2;DP4=0,0,1,0;MQ=39;FQ=-30 GT:PL:GQ 1/1:34,3,0:3
chr1 29519778 . T G 187 . DP=8;QS=0.000000,1.000000,0.000000,0.000000;VDB=4.580661e-02;AF1=1;AC1=2;DP4=0,0,4,3;MQ=40;FQ=-48 GT:PL:GQ 1/1:220,21,0:39
chr1 41204569 . C T 222 . DP=91;QS=0.000000,1.000000,0.000000,0.000000;VDB=4.337502e-01;AF1=1;AC1=2;DP4=0,0,47,34;MQ=40;FQ=-271 GT:PL:GQ 1/1:255,244,0:99
chr1 42880516 . T C 222 . DP=11;QS=0.000000,1.000000,0.000000,0.000000;VDB=1.273245e-01;AF1=1;AC1=2;DP4=0,0,5,5;MQ=40;FQ=-57 GT:PL:GQ 1/1:255,30,0:57
chr1 55319902 . A G 222 . DP=20;QS=0.000000,1.000000,0.000000,0.000000;VDB=1.803282e-01;AF1=1;AC1=2;DP4=0,0,12,7;MQ=40;FQ=-84 GT:PL:GQ 1/1:255,57,0:99
So, essentially you can see that there is no genotype information for the snps detected in test1 but there is genotype information in the last column of test2 (because we used variant calling). What I want is that the program should detect the 245 snps just like test1 (I dont want it to call it because I already have the list) but should report the output in a format like test2. How can that be done?
Thanks you are right, the question is that but I was also doubtful about my approach. So nothing is wrong with my approach of extracting genotype information for a list of snps from a bam file?
yes, it should work just fine, most of the time we call snps not even knowing where they are, if you knew them to be true getting the genotype is "easy"
I have edited my question so that it makes sense a bit
I have changed the title of your post your question does not really depend on the existance of the SNP file.