Disagreement in the coordinates of VCF and internal formats for Pindel
0
0
Entering edit mode
6.7 years ago
jvel • 0

I am observing strange discrepancies between the information present in the VCF created by Pindel and the internal file format. Here is what I did:

  • I downloaded Pindel from github 2 days ago with command: git clone https://github.com/genome/pindel.git
  • I ran pindel on a sample BAM file using as reference the human GRCh37 g1k_v37 decoy genome sequence.
  • I noticed that the VCF file does not have the support information, score, or quality measures, so I decided to recover it by merging the VCF file with the lines from the internal format. I recovered those lines with grep ChrID internal_file
  • However, not all the coordinates matched. For one particular case I had in the VCF:

    1 10290621 . ATAGCTGGGATTACAGGTGTGTGCCACCACACCTGGTTAATTTTTGTATTTTTAATAGAGACGGGGTTTCACCGTGTTGGCTAGGCTGGTCTTGAT GTACTTGGGATTACTGGCGTACGCCACCACGCCCAGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTCAACCAGGCTGGTCTCGAA . PASS END=10290715;HOMLEN=0;SVLEN=-96;SVTYPE=RPL;NTLEN=96 GT:AD 0/0:0,1

and the internal format:

3305 D 96 NT 96 "GTACTTGGGATTACTGGCGTACGCCACCACGCCCAGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTCAACCAGGCTGGTCTCGAA" ChrID 1 BP 10290620 10290717 BP_range 10290620 10290717 Supports 1 1 + 0 0 - 1 1 S1 2 SUM_MS 99 1 NumSupSamples 1 1 pFDA_simTruth_76x_0.4_FEMALE 0 0 0 0 1 1

  • Notice that in the VCF the position for the variant 10290621 and the internal file says 10290620. I went to UCSC Genome Browser and checked that the REF sequence is 96 bp and starts at 10290621 and ends at 10290716.
  • So now I have the following discrepancies for the (begin, end): reference (10290621,10290716). VCF (10290621, 10290715). internal (10290620, 10290717)

  • I repeated the exercise with a line where the start position matches:

1 10289908 . GGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGC GGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGCTATTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCACTGCACTCCAGCCTGGGTCACAGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGC . PASS END=10290003;HOMLEN=0;SVLEN=95;SVTYPE=DUP:TANDEM;NTLEN=95 GT:AD 0/0:0,1

168 TD 95 NT 95 "TATTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCACTGCACTCCAGCCTGGGTCACA" ChrID 1 BP 10289908 10290004 BP_range 10289908 10290004 Supports 1 1 + 0 0 - 1 1 S1 2 SUM_MS 99 1 NumSupSamples 1 1 pFDA_simTruth_76x_0.4_FEMALE 0 0 0 0 1 1

and in this case the (begin, end) coordinates are reference (10289908, 10290003), VCF (10289908, 10290003), and internal format (10289908, 10290004)

  • I cannot make sense of it.
    • In the first case the VCF coordinates are wrong and the internal format coordinates seem to be flanking.
    • In the second case the VCF coordinates are correct but the internal format coordinates are not flanking.

The user manual does not explain anything of this, so I am clueless. Any help is appreciated.

pindel variant calling • 1.3k views
ADD COMMENT

Login before adding your answer.

Traffic: 1998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6