Question

Understanding MAPping Quality in SAM file / Star

1

Entering edit mode

6.0 years ago

MarVi ▴ 30

Hello all,

Recently, I started to work with alignment data. At the moment I have alignments for Ribo and RNA seq data that I can visualize with IGV tool. I am trying to understand the sam files of these alignments to then to try to get some information from them.

Currently, I have two questions: The first one is to understand the values of MAPping Quality.

I observed this case:

In my reference genome I have this sequence from chromosome 1.

>chr1:10538-10568
CCACCGAAATCTGTGCAGAGGACAACGCAGC

Into the SAM file of the alignment of RiboSeq (single end) reads, I got this:

SRR4293695.162674600    272 chr1    10538   0   30M1S   *   0   0 CCACCGAAATCTGTGCAGAGGAGAACGCAGA FFFFAFFFFFFF7<FFFFAFFFFFFF<AAAA   NH:i:8  HI:i:2  AS:i:27 nM:i:1

I looked for the SRR4293695.162674600 read into my Ribo Seq reads and I found that this read has a length of 31:

@SRR4293695.162674600
TCTGCGTTCTCCTCTGCACAGATTTCGGTGG
+
AAAA<FFFFFFFAFFFF<7FFFFFFFAFFFF

Evidently, the alignment has been made for the reverse complement of SRR4293695.162674600. Since, it seems to be a very good alignment I was thinking it would have a good quality alignment, but it's not the case. It got a Mapping quality of 0. Where −10 log10 Pr{mapping position is wrong} meaning that the 'mapping position is wrong' = 1.

So, the mapping position is considered to be wrong because the alignment is made from the reverse complement?

The second question that I have is about the CIGAR string 30M1S: Why the alignment has 30 Match and one single substitution when actually I see two differences position 22 and position 31.

>chr1:10538-10569
CCACCGAAATCTGTGCAGAGGACAACGCAGC -->  Sequence in genome
CCACCGAAATCTGTGCAGAGGAGAACGCAGA --> Alignment sequence

Why this happened? Or why one difference is counted as Match?

Thank you in advance for your answers!

RNA-Seq alignment • 7.7k views

ADD COMMENT • link updated 6.0 years ago by Istvan Albert 100k • written 6.0 years ago by MarVi ▴ 30

score 4 · Accepted Answer · 2018-04-11

4

Entering edit mode

6.0 years ago

Istvan Albert 100k

A mapping quality of zero usually indicates a multiply mapped read (one that aligns equally well in multiple locations). The alignments often indicate the other locations, but in this case, they do not. The flag 272 has more information

0x110   272 REVERSE,SECONDARY

See how this alignment is a secondary alignment.

The sequence is aligned with 30M1S with 30 matches and one soft clip. Th last base is chopped off and is not part of the 30M aligned region. This works this way since the end clip penalty is smaller than a mismatch penalty. There are specific instances where using this type of scoring is advantageous.

That being said, the 30M would also include mismatches if you actuall had them as the M stands for match or mismatch.

It is quite misleading actually, the mismatching alignments would be identified from the MD and NM tags of the alignment. But since both of these tags are optional it is very much possible to have an alignment where you can't easily tell by eye if the alignment has mismatches. Umm ... yeah ... I agree that's just plain ridiculous.

The samtools calmd command will fill in missing MD tags.

ADD COMMENT • link 6.0 years ago by Istvan Albert 100k

0

Entering edit mode

Thank you very much @Istvan Albert. This is so much clear for me now. I had noticed that there are several alignments for the same read from NH:i:8 tag, and because I greped the name of that read into the sam file to find the 8 times. But, I didn't think that this could affect the quality for the alignment since I thought the alignment as the individual one.. hum.. a very Important thing. Thanks again for this.

Said this, I would like to know your point of view. It's better to not consider this read for further analysis, since that I am not sure from which region from the genome the read really came on, or on the contrary I have to consider those regions taking into account the deep of reads in those regions?

On the other hand, I was expecting to have the 256 flag standing for secondary alignment as is said in the format specification https://samtools.github.io/hts-specs/SAMv1.pdf. But it's good to know now what 272 flag means.

Thank you very much in advance.

ADD REPLY • link 6.0 years ago by MarVi ▴ 30

1

Entering edit mode

MarVi : Use this tool to understand what the flags mean.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

This is so useful, thank you!

ADD REPLY • link 6.0 years ago by MarVi ▴ 30

0

Entering edit mode

Binary flags add up. Secondary alignment + read aligned reverse = 256 + 16 = 272