Hello all,
Recently, I started to work with alignment data. At the moment I have alignments for Ribo and RNA seq data that I can visualize with IGV tool. I am trying to understand the sam files of these alignments to then to try to get some information from them.
Currently, I have two questions: The first one is to understand the values of MAPping Quality.
I observed this case:
In my reference genome I have this sequence from chromosome 1.
>chr1:10538-10568
CCACCGAAATCTGTGCAGAGGACAACGCAGC
Into the SAM file of the alignment of RiboSeq (single end) reads, I got this:
SRR4293695.162674600 272 chr1 10538 0 30M1S * 0 0 CCACCGAAATCTGTGCAGAGGAGAACGCAGA FFFFAFFFFFFF7<FFFFAFFFFFFF<AAAA NH:i:8 HI:i:2 AS:i:27 nM:i:1
I looked for the SRR4293695.162674600
read into my Ribo Seq reads and I found that this read has a length of 31:
@SRR4293695.162674600
TCTGCGTTCTCCTCTGCACAGATTTCGGTGG
+
AAAA<FFFFFFFAFFFF<7FFFFFFFAFFFF
Evidently, the alignment has been made for the reverse complement of SRR4293695.162674600
.
Since, it seems to be a very good alignment I was thinking it would have a good quality alignment, but it's not the case. It got a Mapping quality of 0. Where −10 log10 Pr{mapping position is wrong}
meaning that the 'mapping position is wrong' = 1.
So, the mapping position is considered to be wrong because the alignment is made from the reverse complement?
The second question that I have is about the CIGAR string 30M1S
:
Why the alignment has 30 Match and one single substitution when actually I see two differences position 22 and position 31.
>chr1:10538-10569
CCACCGAAATCTGTGCAGAGGACAACGCAGC --> Sequence in genome
CCACCGAAATCTGTGCAGAGGAGAACGCAGA --> Alignment sequence
Why this happened? Or why one difference is counted as Match?
Thank you in advance for your answers!
Thank you very much @Istvan Albert. This is so much clear for me now. I had noticed that there are several alignments for the same read from
NH:i:8
tag, and because I greped the name of that read into the sam file to find the 8 times. But, I didn't think that this could affect the quality for the alignment since I thought the alignment as the individual one.. hum.. a very Important thing. Thanks again for this.Said this, I would like to know your point of view. It's better to not consider this read for further analysis, since that I am not sure from which region from the genome the read really came on, or on the contrary I have to consider those regions taking into account the deep of reads in those regions?
On the other hand, I was expecting to have the
256
flag standing forsecondary alignment
as is said in the format specification https://samtools.github.io/hts-specs/SAMv1.pdf. But it's good to know now what272
flag means.Thank you very much in advance.
MarVi : Use this tool to understand what the flags mean.
This is so useful, thank you!
Binary flags add up. Secondary alignment + read aligned reverse = 256 + 16 = 272
Good to know. I am feeing really novice in this. Thank you!