I have a question about the dbSNP VCF file and the RSPOS coordinate in the INFO field. I am working with GRCh37 and the dbSNP release 149 VCF.
$ head -n5 00-All.vcf
## fileformat=VCFv4.0
##fileDate=20161121
##source=dbSNP
##dbSNP_BUILD_ID=149
##reference=GRCh37.p13
The RSPOS entry in the INFO field is described here:
$ head -n9 00-All.vcf | tail -n1
##INFO=< ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP" >
There is a longer description given at https://www.ncbi.nlm.nih.gov/variation/docs/glossary :
Chromosome position based on dbSNP submission. Note that for insertion or deletion variants, the value in RSPOS is different from the value in VCF POS column. VCF POS is the leftmost position of the variant on the reference sequence identified in CHROM. In addition, VCF POS is the position of the base before the insertion or deletion, and therefore may be 5' upstream from the variant location in RSPOS and the location represented by HGVS notation, which uses the rightmost position. For that reason, VCF POS and alleles may look different from HGVS in position and sequence, as in the example rs398124296: "1 45974693 rs398124296 CAGA C .... CLNHGVS=NC_000001.10:g.45974696_45974698delAAG…"
I am not sure how to interpret the POS and RSPOS coordinates in the VCF. Here are two examples:
1) Consider the indel with coordinates 7:100366706;REF=T;ALT=TTTTTG
$ bcftools view -H -r 7:100366706-100366706 00-All.vcf.gz
7 100366706 rs148527571 T TTTTTG . . RS=148527571;RSPOS=100366706;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x05000008000517002e000200;GENEINFO=ZAN:7455;WGT=1;VC=DIV;INT;ASP;VLD;G5A;G5;KGPhase3;CAF=0.8007,0.1993;COMMON=1
7 100366706 rs77888755 T TTTTG . . RS=77888755;RSPOS=100366709;dbSNPBuildID=131;SSR=0;SAO=0;VP=0x050000080005000002000200;GENEINFO=ZAN:7455;WGT=1;VC=DIV;INT;ASP
7 100366706 rs199620324 T TTTTTG . . RS=199620324;RSPOS=100366728;dbSNPBuildID=137;SSR=0;SAO=0;VP=0x050000080005000002000200;GENEINFO=ZAN:7455;WGT=1;VC=DIV;INT;ASP;CAF=0.8007,0.1993;COMMON=1
There are two rsids for the variant listed in the file: rs148527571 and rs199620324. The first has RSPOS equal to the POS field of the VCF, the second however is offset by 22 basepairs. Given the insertion is of 5 basepairs, this seems inconsistent with the suggestion in the glossary linked above that RSPOS uses the right most base pair position of the variant.
The rsids are shown in different positions in the NCBI variation viewer, apparently in positions consistent with the RSPOS values. However, the alternative allele for rs199620324 shown in the viewer is different from that in the VCF, i.e. an insertion of TTGTT rather than TTTTG. See a viewer screen shot here:
Is rs148527571the only valid rsid for this variant?
Are the VCF POS/ALT fields wrong for rs199620324?
2) A second example. The indel with coordinates 1:3758650;REF=TTTTTTA;ALT=T
$ bcftools view -H -r 1:3758650-3758650 00-All.vcf.gz
1 3758650 rs796470771 TTTTTTA T . . RS=796470771;RSPOS=3758651;dbSNPBuildID=146;SSR=0;SAO=0;VP=0x05000008000501003e000200;GENEINFO=CEP104:9731;WGT=1;VC=DIV;INT;ASP;G5;KGPhase1;KGPhase3;CAF=0.8706,0.1294;COMMON=1
1 3758650 rs140734282 TTTTTTA T . . RS=140734282;RSPOS=3758657;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000080005140002000200;GENEINFO=CEP104:9731;WGT=1;VC=DIV;INT;ASP;VLD;CAF=0.8706,0.1294;COMMON=1
1 3758650 rs374457800 TTTTTTA T . . RS=374457800;RSPOS=3758662;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000080005000002000200;GENEINFO=CEP104:9731;WGT=1;VC=DIV;INT;ASP;CAF=0.8706,0.1294;COMMON=1
For this variant there are three lines in the VCF, each with a distinct rsid. Each line has a different RSPOS value. Again the rsids appear in different locations in the viewer. The deleted reference sequence in the viewer highlighted region corresponding to rs374457800 (ATTTTT) seems to be inconsistent with the reference sequence in the VCF (TTTTTA).
Again, which rsid is valid here? Assuming the VCF is valid and only one rsid corresponds to the variant, why do the other two rsids appear with these coordinates in the VCF?
Thanks very much for any assistance.