Why Does A Query For Single Position Yield Neighboring Positions In Tabix?
1
0
Entering edit mode
11.0 years ago
obfuskater ▴ 20

I have found a major bug in tabix that appears both in its command line and perl API contexts.

Tabix seems to retrieve more than what is queried for when dealing with single positions in a vcf file (for example dbSNP). The problems seems to occur if the neighboring positions are too close to the target position. Tabix cannot retrieve a single position if you supply a start position that is equal to the end position, so one must define a region small enough to include a only single position. The way I had been doing this is to supply a start position that is 0.5 less than my target position an an end point that is 0.5 more. This works most of the time, except for cases where a positions has neighbors within a few units of it.

Below are a few examples of what I am seeing. The grep command shows the target postion plus 6 lines of context in dbSNP.

The obvious solution would be to discard the results that don't match the target position. However, it does create some doubt about the reliability of tabix.

Has anybody else seen this? Could anybody explain why this occurs?


This example shows 1:900526's neighbor (900525) getting picked up

*Target: Chr 1 Pos 900526

$ `grep -P -A 3 -B 3 '1\t900526' 00-All.vcf`
1       900519  rs140019196     A       G
1       900522  rs79133198      T       C
1       900525  rs201438052     C       
*1       900526  rs146302352     G       A
1       900528  rs150067459     C       T
1       900529  rs148779938     G       A
1       900544  rs200240051     G       A
1       900526  rs146302352     G       A


$ `tabix 00-All.vcf.gz 1:900525.5-900526.5`
1       900525  rs201438052     C       T
1       900526  rs146302352     G       A

Perl: `$tabix_dbSNP->query(1, 900525.5, 900526.5)`
1       900526  rs146302352     G       A
--------------------------------------------------------------

This case shows 1686296's neighbor (16862293) getting picked up.

*Target: Chr 1 Pos 16862296

$ grep -P -A 3 -B 3 '1\t16862296' 00-All.vcf`
1       16862273        rs76517011      C       A
1       16862282        rs11260784      G       A
1       16862293        rs71259516      GACG    GACC,GGCG
*1       16862296        rs1759227       G       C
1       16862340        rs200564364     G       A
1       16862432        rs188066837     A       G
1       16862454        rs71248115      T       G

$ tabix 00-All.vcf.gz 1:16862295.5-16862296.5
1       16862293        rs71259516      GACG    GACC,GGCG
1       16862296        rs1759227       G       C

Perl: $tabix_dbSNP->query(1, 16862295.5, 16862296.5)

1       16862293        rs71259516      GACG    GACC,GGCG
1       16862296        rs1759227       G       C

This case works fine--returns on the target position; the neighbors are more than a few units away from the target position.

*Target: Chr 1 Pos 16952481

$ `grep -P -A 3 -B 3 '1\t16952481' 00-All.vcf`
1       16952371        rs57507062      C       T
1       16952377        rs9658928       C       T
1       16952382        rs744025        C       T
*1       16952481        rs915311        A       G
1       16952510        rs1765551       G       A
1       16952565        rs71259460      TC      TCCTC,TTCTT
1       16952577        rs11799496      C       A 

$ `tabix 00-All.vcf.gz 1:16952480.5-16952481.5`
1       16952481        rs915311        A       G

Perl: `$tabix_dbSNP->query(1, 16952480.5, 16952481.5)`
1       16952481        rs915311        A       G
tabix perl • 3.9k views
ADD COMMENT
3
Entering edit mode

Please try not to use words like "major bug" in the title unless you are 100% sure. It is very irritating, especially when you use floats for genomic coordinates.

ADD REPLY
1
Entering edit mode

I agree, this is poor form. I've edited the title to change to "possible bug" so as not to alarm too many people (like me when I saw the post).

ADD REPLY
0
Entering edit mode

My apologies. I won't make that mistake again.

ADD REPLY
6
Entering edit mode
11.0 years ago
lh3 33k

These are NOT bugs. Positions are integers. There are no positions like 1234.5. Your first example is equivalent to tabix 00-All.vcf.gz 1:16862295-16862296. Tabix gives the right results. Furthermore, tabix gives overlapping records which both the paper and the the manpage have mentioned. This is a crucial feature rather than a bug. In your second example, the MNP at 16862293 spans position from 16862293 to 16862296 inclusive. Tabix gives the right output again.

ADD COMMENT
2
Entering edit mode

My apologies for the poor wording of the original post. Thank you for the correction.

ADD REPLY
0
Entering edit mode

Thanks. I've edited the title from "Major Bug" to "Possible Bug", though even that is still incorrect. Not sure what the right moderator etiquette is for major surgery of the post title.

ADD REPLY
0
Entering edit mode

Thank you for fixing my mistake.

ADD REPLY
0
Entering edit mode

I usually have few qualms about rewriting titles to be more closely representative of the content. After all it helps no one if the title is misleading or incorrect, just adds to later confusion.

ADD REPLY

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6