Samtools: variant calling for INDELs
1
1
Entering edit mode
8.1 years ago
user230613 ▴ 360

Hi all, I've been using samtools for a while to do variant calling. I've some questions about indel vcf output.

Ex.1:

ch12 11542 . GTTTTTTTTT GTTTTTTTTTT 38.5 . INDEL;IS=11,0.647059;DP=17;VDB=6.858248e-02;AF1=1;AC1=2;DP4=0,0,5,6;MQ=60;FQ=-67.5 GT:PL:DP:GQ 1/1:79,33,0:11:62

In this example, the "real" insertion is only one "T", isn't it? Where is the insertion of that "T"? At the end of the variant?GTTTTTTTTTTT, so the insertion position would be 11542+length(GTTTTTTTTTT)-1? Or at the beginning of the variant GTTTTTTTTTT. Looking at IGV, it seems that the option 2 is the correct one. But, what's the point about putting all the region GTTTTTTTTT->GTTTTTTTTTT, instead of G->GT?

For other hand, take a look to this other variant:

Ex.2:

` ch12 13971470 . TTTATTATTATTATTATTATTATTATTATTATTATTTCTATTATTATTATTATTATTATTTTTATTATTACTATTATTATTATTATTATTATTAT TTTTTATTATTATTATTATTATTATTATTATTATTATTATTTCTATTATTATTATTATTATTATTTTTATTATTACTATTATTATTATTATTATTATTAT 214.0 . INDEL;IS=9,0.450000;DP=20;VDB=1.562057e-01;AF1=1;AC1=2;DP4=0,0,3,15;MQ=55;FQ=-82.5 GT:PL:DP:GQ 1/1:255,48,0:18:93

`

What's is going on here? Which sequence is the "inserted" one? I think this format is very confusing, or I do not quite see the point here.

Any help would be really appreciated.

samtools mpileup • 2.5k views
ADD COMMENT
1
Entering edit mode
8.1 years ago

But, what's the point about putting all the region GTTTTTTTTT->GTTTTTTTTTT, instead of G->GT?

see Unified representation of genetic variants http://bioinformatics.oxfordjournals.org/content/31/13/2202

ADD COMMENT
0
Entering edit mode

So that page states:

A variant is normalized if and only if

  1. it has no superfluous nucleotides on the left side and
  2. each allele does not end with the same type of nucleotide.

that seems to indicate that this variant call by samtools is not normalized and that's why it reports the long form.

ADD REPLY
0
Entering edit mode

I can not read the paper, sorry, but thank you Pierre and Istvan, samtools format is a bit confusing. So I think that now I got it.

The "correct" interpretation is this, GTTTTTTTTTT, the insertion happens just after the first base. And in order to check at which nt the insertion ends, I think that a possible way should be to do the following : length(ALT)-length(REF) and add this number to the variant position.

ADD REPLY
1
Entering edit mode

I just want to note here that based on sequence alone we can't tell where the actual insertion took place - technically it could have been anywhere between the Ts. Correct in this case means the normalized representation. It is a little thing but it is important to note that correct in this sense means "normalized".

ADD REPLY

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6