I noticed that GATK sometimes calls two consecutive indels like the two below. One, at position 3479486, is a variation from AAG to A. The second, at 3479487, is a variation from AG to A. Both indels survived a quite strict quality filtering, are both homozygous and both supported by 54 reads. You can see the two lines below.
chr13 3479486 . AAG A 1640.73 PASS AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=53.87;MQ0=0;QD=29.30;SOR=0.767 GT:AD:DP:GQ:PL 1/1:0,54:56:99:1678,151,0
chr13 3479487 . AG A 1448.73 PASS AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=53.87;MQ0=0;QD=25.87;SOR=0.767 GT:AD:DP:GQ:PL 1/1:0,54:56:99:1486,160,0
My reference in the region is as follows
>chr13:3479484-3479488
AAAAG
This convinced me that GATK somehow got confused, and is calling two different variants for the same event. Realignment near indels has already been performed.
For downstream analysis I want to find a general way of dealing with such issue by removing one of the two.
Are you aware of any solution for this?
EDIT Sept 2nd*****
I found that the solution provided in a Biostars post might work for me (so maybe my question is duplicate?)
bcftools filter --IndelGap 3 infile.vcf > outfile.vcf
I will stick to it, but too minor improvements would be great! 1) I would like to remove indels that overlap, irrespective of the distance 2) I would like to select which indel to remove based on some quality information (looks like bcftools always removes the second instance)