How To Define Percent-Identity From A Cigar String/ A Bam/Sam File?
3
11
Entering edit mode
12.2 years ago
Michael 54k

Hi, I would like to calculate the percent-identity from a CIGAR string from a BAM/SAM file containing alignments. I want to calculate the PID only for the aligned region, ignoring clipped ends ("H","S"). I can parse the CIGAR in R and get the sums for each letter in the CIGAR, so it's just a conceptual question about the definition and whether or not it is correct: so given the CIGAR contains characters: "M" (match),"N" (skip),"D" (Deletion),"I"(Insertion), "S" (soft-clip),"H" (hard-clip), of which I ignore "S", "H", and "M" is the total sum of M, N the total of N, etc.:

e.g.: 10S 20M5I5D20M 10S

Is this a good way of defining the formula?

pid1 := 100 * M / (M+N+I)

or maybe

pid2 := 100* M / (M+N+I-D)

the example: pid1 = ~88% but then pid2 = 100% (which wouldn't make much sense).

Related: http://biostar.stackexchange.com/questions/9358/is-there-any-r-package-to-parse-cigar-element-of-sam-files/17031#17031

Thank you very much

sam identity bam cigar • 17k views
ADD COMMENT
0
Entering edit mode

Computationally, is there really a difference between skipped and deletion?

ADD REPLY
0
Entering edit mode

I thought that N was indicating a mismatch, or not?

ADD REPLY
0
Entering edit mode

None.

%identity = match-(NM)/length

S=20 M=40 I=5 D=5, length=20+40+5+5=70, match=40 ... so 57% (unless there were NM's).

Soft clipping = mismatch in most situations. Inserts & deletions are also mismatch.... in that they don't exist in the reference. Only way to get 100% is if it's ##M with no NM.

ADD REPLY
6
Entering edit mode
12.2 years ago

I would say pid1 would be correct. Computationally, I don't think there is any difference between N (skipped) and D (deletion).

Both skipped and deletion should produce something like this:

read       ACGTACGT--ACGTACGT
reference  ACGTACGTAAACGTACGT

So if you add N + D, you are adding the gap twice.

edit **

The SAM specs is a bit confusing on the matter:

M alignment match (can be a sequence match or mismatch)
I insertion to the reference
D deletion from the reference
N skipped region from the reference
S soft clipping (clipped sequences present in SEQ)
H hard clipping (clipped sequences NOT present in SEQ)
P padding (silent deletion from padded reference)
= sequence match
X sequence mismatch

M can be a match or mismatch??

Further down in the sam specs:

  • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
  • Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ
ADD COMMENT
3
Entering edit mode

looks like M is both match and mismatch from the SAM specs. I think bowtie outputs M, H, and Ns. Looking at output from one of my runs using ABI LifeScope, I have CIGAR string of: 20H2S28M. Maybe the various mappers just don't conform to the standards?

ADD REPLY
3
Entering edit mode

The sam specification designates MD tag to inform of where the mismatches occur. Most aligners will fill this in correctly in the SAM file. Otherwise you can use calmd command of samtools to fill in this tag.

ADD REPLY
2
Entering edit mode

wait that's fishy, then there is no way of determining if it is a match or mismatch, like in 10M could mean anything? Then that would mean there is no way to calculate the identity, I can't believe it...

ADD REPLY
0
Entering edit mode

and the mismatches?

ADD REPLY
0
Entering edit mode

I guess it depends on whether you are doing a mRNA reads-to-genome alignment or genome reads-to-genome alignment? Ns are intronic in mRNA reads to genome. = and Xs are used in genome reads-to-genome?

ADD REPLY
0
Entering edit mode

Thank you DK, i will check that a bit further, but now believe I have to re-align the sequences using smith-waterman or so to get this data. If one can't differentiate a match from a mismatch (facepalm)

ADD REPLY
0
Entering edit mode

I have tried to run samtools calmd on my bam file but receive a segmentation fault. My bam file is the result of BWA-SW (long read algorithm). Will write to the samtools mailing list.

ADD REPLY
0
Entering edit mode

So I accepted this, because it led to the right path. From the CIGAR string alone it is impossible, and from a whole BAM file only when using the correct MD tags which must be set by the aligner or samtools calmd which crashes.

ADD REPLY
0
Entering edit mode

Maybe I should try to run calmd again?

ADD REPLY
5
Entering edit mode
12.2 years ago
Wen.Huang ★ 1.2k

No one seems to have mentioned it, but how about the "NM" tag ("edit distance" in SAM format specification)? I think this is the closest to "mismatches" that can be obtained in a relatively fast way by parsing bam files. One caveat, indels are always going to be a problem when defining identity and I don't know how they are counted in the "NM" tag, perhaps different programs have different ways of counting them.

ADD COMMENT
0
Entering edit mode

Actually that was what I was looking for. Unfortunately, these tags are optional, aren't they, at least my file doesn't have them, but if most tools support these, this is obviously the way to go. In any case, my guess is that Levenshtein distance makes most sense, because that handles indels.

ADD REPLY
2
Entering edit mode
12.2 years ago
Michael 54k

Thanks to the answer from DK and Istvan's comment I conclude that it is impossible to compute percent identity from the CIGAR string alone. This is because the CIGAR string does not contain enough information to capture Levenshtein-distance. Imo this is actually a misguided design of the string.

It might be feasible by obtaining proper tag annotation, but this way is blocked for me at the moment due to a gefault in samtools calmd.

Therefore, I will describe a way of doing the computation by re-aligning the sequences in the bam-file using R functions pairwiseAlignment and stringDist. I will then define the PID as the quotient of Levenshtein-distance and the length of the maximum length of query, subject.

I will put my code here as a go along.

bam = scanBam("reads.bam")
genome = read.DNAStringSet("genome.fna")
pids = numeric(0)
# calculate coordinates to cut out reference sequences
# this might not work if there are large gaps in the alignment
# on the other hand query width should capture this
start = bam[[1]]$pos - bam[[1]]$qwidth
end = bam[[1]]$pos + 2 * bam[[1]]$qwidth
start[start < 1] = 1
end[end>length(genome[[1]])] = length(genome[[1]])

for (i in 1:length(bam[[1]]$seq))  { 
  # do pairwise alignment against the matched sequence +- 1 query length
  # according to Jeremy's comment
  aln = pairwiseAlignment(bam[[1]]$seq[i], subseq(genome[[1]], start= start[i], end=end[i]), type="local")
  d = stringDist(c(as.character(aln@pattern), as.character(aln@subject)), m="l", i=F) 
  l = max(c(width(aln@pattern),(width(aln@subject) )))
  aln = NULL
  cat (i);
  pid = as.numeric((l-d)/l) 
  print(pid)
  pids = c(pids,pid)
  gc()
}

Edit: Use only portions of the sequence as suggested by Jeremy, that speeds it up quite a bit.

This will take some time as pairwiseDistance using Smith-Waterman is quite slow and is therefore only feasible with small bam files. For larger files, other alignment tools might be more appropriate. But this demonstrates how it works. (You have to use a loop here, this is intentional, because using lapply with pairwiseAlignment opens a memory-leak, process killed at >20GB mem usage)

ADD COMMENT
1
Entering edit mode

Can't you extract some reference sequence (a window) around the reported hit? That would be a lot faster than redoing s-w vs the genome.

ADD REPLY
0
Entering edit mode

Good idea, I'll try that

ADD REPLY

Login before adding your answer.

Traffic: 1474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6