Biostar Beta. Not for public use.
CIGAR inconsistency BWA/GATK
0
Entering edit mode
17 months ago
ncoca • 0

Hi everyone,

I have a pair end sample that mapped to the human genome (hg19) with bwa, and used GATK IndelRealigner.

The point is that I find that the CIGAR expression is not consistent for some reads.

An example:

NS500644:35:H5WFMAFXX:1:11106:23458:8358    99  9   33797899    40  3S148M  =   33797902    154 GGTCAAGATCATCCGCCACCCCAAATACAACAGCTGGACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCACACCTGCCATCATCAATGCCCATGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGTGCCTC ??>ABAB@@AA@@@;A@AA@@@AAA@?AAAAABABCBAAAAACA@BBA@CAABABABBB?C?ABBBCBBB?BABB-BABCBABABBABBBACBAABAA@C@BABBA-?BBBBABCBAA-BA@BA@=?B@B?BBCABCBABBBB/B@@@>@9 XA:Z:7,+142470615,3S138M10S,9;7,+142481199,3S148M,12;   MC:Z:27M2I2M2D120M  BD:Z:KKLMNKKKLMMLMKKKLLKLKKLKEKJLJKLJMMMLLLLLKKLLLLJKKMLLJLMKLLMMLLLMMKKMMKKKKLKJKLLLMLLLMMLMMKKMMLKLLMKIKMKLKLLLMKKKKKLMLKLKLLKLKKLKKLMMMLMMLLKMKLLLKMIMLLK    MD:Z:18T11G0G0A0C18A10A6T8G12G0C54  PG:Z:MarkDuplicates RG:Z:1  BI:Z:NNNNONNNOOOOOMNNOOLONNONIONONNONOOPPNNNONNPNMNNNONONNOOMOPOPPOOOONNOPNNNMOLNLOOPOOOOOOOOONONOONOONNNNNMOLOOOONNNNNPOONOLOOLONNONMOOOPPOPPNPPLOPONONOOON    NM:i:11 MQ:i:43 AS:i:95 XS:i:93

NS500644:35:H5WFMAFXX:1:11106:23458:8358    147 9   33797902    43  27M2I2M2D120M   =   33797899    -154    GATCATCCGCCACCCCAAATACAACAGCTGGACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCACACCTGCCATCATCAATGCCCATGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGTGCCTCATCTCT ?=>A@@A<B1C@B1@B@BA@?B?@CBBCBAB@CACB@B@CBABB@CAABCBBCBBAAC@BBC.CABA@C@BC@BBCAACAACB?BABBBABAAB.BC@BC@@BAC@BBBBBB?AB?ABABAABAABAABAABB?BAAAAAAAB?A@AA=@> XA:Z:7,-142470618,151M,12;7,-142481202,151M,13; MC:Z:3S148M BD:Z:MLLMLLLKKLILJJLKEKKLKKKKLMMMLKMLKKMLKMKKKLMMKMLLMMMMMMMLLKKMMKKKLLIKILMMMKLMLLMLLKKLMKJLMLJKJLLLILLMLKKKKKMMKJLILLILJJMKLLLMMMMMMLLMILMMKLKMKMKLMMLKKKK    MD:Z:15T13^AC18A10A6T8G12G0C59C0    PG:Z:MarkDuplicates RG:Z:1  BI:Z:OOONONNNPNNNMMNPJNNONPNNPPOOOMNONNOOMNNPNOONNNONNOPOOOOOOPOPONNNNNNNNNNOPPNNOONOOPNOPPMNNONLNNNNNNNNONNNNNOPPMNNNNNNMMNNNNPPOOPOOOOONOOONOLPPNNONONNNNN    NM:i:12 MQ:i:40 AS:i:97 XS:i:91


I can see that both sequences are overlapping (and are a pair) with the second one starting 6 bases after fist one (3 are soft clipped). I also notice that both have exactly the same sequence.

Both have the following sequence:

GATCATCCGCCACCCCAAATACAACAGCTGGACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCACACCTGCCATCATCAATGCCCATGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGTGCCTC


The first CIGAR expression is 3S148M, so I have 148 matches after the first 3 soft clipped bases.

The second CIGAR expression is 27M2I2M2D120M, so I have 2 bases inserted and 2 deleted after 27 matches.

The reference in this region (hg19) is

CAAGATCAATCCGCCACCCCCAATACAACAGGGACACTCTGGACAATGACATCATGCTGATCAAACTCTCCTCACCTGCCGTCATCAATGCCCGCGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGT


If I map to the reference I get:

REF:    CAAGATCAATCCGCCACCCCCAATACAACAG  GGACACTCTGGACAATGACATCATGCTGATCAAACTCTCCTC...
1st:  GGTCAAGATCATCCGCCACCCCAAATACAACAGCTGG  ACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCAC...
2nd:        GATCATCCGCCACCCCAAATACAACAGCTGG  ACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCAC...


This way the second CIGAR code is correct (CT inserted, GA deleted, assuming that there is no A delecion but ATCAA > GATCA in the 5 first 5 bases), but not the first one...

How could I explain this? Do I get this because GATK IndelRealigner is recalculating sequence and CIGAR in a wrong way? Does it has something to do with the fact that I am in a multimapped region?

In the other hand, they are multimapped sequences and the alternative mapped regions are (XA tag): 7,142470615,3S138M10S, and 7,142481199,3S148M for first read and 7,142470618,151M and 7,142481202,151M for the 2nd. In both cases the 2nd one starts 3 bases after the 1st one.

If I have the sequence fully matched if mapping to chr7...

Why bwa assigned them to chr9 as primary?

Thank you very much

EDIT: I mapped again this sample without doing IndelRealigner and I still get the same thing, so the problem is with bwa

0
Entering edit mode

your post is too long - too many unrelated details - makes too difficult to read and parse. What you should post instead is just the issue at hand (and don't post examples where there is no problem), for example post the CIGAR string the alignment itself like so:

2M2D2M


alignment:

AATTAA
AA--AA


and then ask the question of why it does not match.