Hi everyone,
I have a pair end sample that mapped to the human genome (hg19) with bwa, and used GATK IndelRealigner.
The point is that I find that the CIGAR expression is not consistent for some reads.
An example:
NS500644:35:H5WFMAFXX:1:11106:23458:8358 99 9 33797899 40 3S148M = 33797902 154 GGTCAAGATCATCCGCCACCCCAAATACAACAGCTGGACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCACACCTGCCATCATCAATGCCCATGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGTGCCTC ??>ABAB@@AA@@@;A@AA@@@AAA@?AAAAABABCBAAAAACA@BBA@CAABABABBB?C?ABBBCBBB?BABB-BABCBABABBABBBACBAABAA@C@BABBA-?BBBBABCBAA-BA@BA@=?B@B?BBCABCBABBBB/B@@@>@9 XA:Z:7,+142470615,3S138M10S,9;7,+142481199,3S148M,12; MC:Z:27M2I2M2D120M BD:Z:KKLMNKKKLMMLMKKKLLKLKKLKEKJLJKLJMMMLLLLLKKLLLLJKKMLLJLMKLLMMLLLMMKKMMKKKKLKJKLLLMLLLMMLMMKKMMLKLLMKIKMKLKLLLMKKKKKLMLKLKLLKLKKLKKLMMMLMMLLKMKLLLKMIMLLK MD:Z:18T11G0G0A0C18A10A6T8G12G0C54 PG:Z:MarkDuplicates RG:Z:1 BI:Z:NNNNONNNOOOOOMNNOOLONNONIONONNONOOPPNNNONNPNMNNNONONNOOMOPOPPOOOONNOPNNNMOLNLOOPOOOOOOOOONONOONOONNNNNMOLOOOONNNNNPOONOLOOLONNONMOOOPPOPPNPPLOPONONOOON NM:i:11 MQ:i:43 AS:i:95 XS:i:93
NS500644:35:H5WFMAFXX:1:11106:23458:8358 147 9 33797902 43 27M2I2M2D120M = 33797899 -154 GATCATCCGCCACCCCAAATACAACAGCTGGACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCACACCTGCCATCATCAATGCCCATGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGTGCCTCATCTCT ?=>A@@A<B1C@B1@B@BA@?B?@CBBCBAB@CACB@B@CBABB@CAABCBBCBBAAC@BBC.CABA@C@BC@BBCAACAACB?BABBBABAAB.BC@BC@@BAC@BBBBBB?AB?ABABAABAABAABAABB?BAAAAAAAB?A@AA=@> XA:Z:7,-142470618,151M,12;7,-142481202,151M,13; MC:Z:3S148M BD:Z:MLLMLLLKKLILJJLKEKKLKKKKLMMMLKMLKKMLKMKKKLMMKMLLMMMMMMMLLKKMMKKKLLIKILMMMKLMLLMLLKKLMKJLMLJKJLLLILLMLKKKKKMMKJLILLILJJMKLLLMMMMMMLLMILMMKLKMKMKLMMLKKKK MD:Z:15T13^AC18A10A6T8G12G0C59C0 PG:Z:MarkDuplicates RG:Z:1 BI:Z:OOONONNNPNNNMMNPJNNONPNNPPOOOMNONNOOMNNPNOONNNONNOPOOOOOOPOPONNNNNNNNNNOPPNNOONOOPNOPPMNNONLNNNNNNNNONNNNNOPPMNNNNNNMMNNNNPPOOPOOOOONOOONOLPPNNONONNNNN NM:i:12 MQ:i:40 AS:i:97 XS:i:91
I can see that both sequences are overlapping (and are a pair) with the second one starting 6 bases after fist one (3 are soft clipped). I also notice that both have exactly the same sequence.
Both have the following sequence:
GATCATCCGCCACCCCAAATACAACAGCTGGACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCACACCTGCCATCATCAATGCCCATGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGTGCCTC
The first CIGAR expression is 3S148M, so I have 148 matches after the first 3 soft clipped bases.
The second CIGAR expression is 27M2I2M2D120M, so I have 2 bases inserted and 2 deleted after 27 matches.
The reference in this region (hg19) is
CAAGATCAATCCGCCACCCCCAATACAACAGGGACACTCTGGACAATGACATCATGCTGATCAAACTCTCCTCACCTGCCGTCATCAATGCCCGCGTGTCCACCATCTCTCTGCCCACCACCCCTCCAGCTGCTGGCACTGAGT
If I map to the reference I get:
REF: CAAGATCAATCCGCCACCCCCAATACAACAG GGACACTCTGGACAATGACATCATGCTGATCAAACTCTCCTC...
1st: GGTCAAGATCATCCGCCACCCCAAATACAACAGCTGG ACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCAC...
2nd: GATCATCCGCCACCCCAAATACAACAGCTGG ACTCTGGACAATGACATCCTGCTGATCAAGCTCTCCAC...
This way the second CIGAR code is correct (CT inserted, GA deleted, assuming that there is no A delecion but ATCAA > GATCA in the 5 first 5 bases), but not the first one...
How could I explain this? Do I get this because GATK IndelRealigner is recalculating sequence and CIGAR in a wrong way? Does it has something to do with the fact that I am in a multimapped region?
In the other hand, they are multimapped sequences and the alternative mapped regions are (XA tag):
7,142470615,3S138M10S
, and 7,142481199,3S148M
for first read and 7,142470618,151M
and 7,142481202,151M
for the 2nd. In both cases the 2nd one starts 3 bases after the 1st one.
If I have the sequence fully matched if mapping to chr7...
Why bwa assigned them to chr9 as primary?
Thank you very much
EDIT: I mapped again this sample without doing IndelRealigner and I still get the same thing, so the problem is with bwa
your post is too long - too many unrelated details - makes too difficult to read and parse. What you should post instead is just the issue at hand (and don't post examples where there is no problem), for example post the CIGAR string the alignment itself like so:
alignment:
and then ask the question of why it does not match.