How to calculate percent identity from sam file?
0
0
Entering edit mode
5.1 years ago
O.rka ▴ 710

I'm using pysam to calculate the percent identity from each mapped read in a samfile. I tried to get some of the more complicated CIGAR strings so I did len(CIGAR) > 4. I can't figure out how to calculate the percent identity from this. It would make sense to get the number of matches - the number of mismatches / alignment length but how can I get this from the CIGAR string?

How will indels affect the percent identity? Is there a method to do this straight from pysam ?

In [39]: import pysam
In [40]: samfile = pysam.AlignmentFile("mapped.sam")
In [41]: i = 0
    ...: for read in samfile.fetch():
    ...:     if not read.is_unmapped:
    ...:         cigar = read.cigarstring
    ...:         if len(cigar) > 4:
    ...:             print(read.cigarstring)
    ...:             print(read.get_cigar_stats())
    ...:             print(read.cigartuples)
    ...:             print(read)
    ...:             print()
    ...:             i += 1
    ...:             if i == 20:
    ...:                 break

5S55M
(array('I', [55, 0, 0, 0, 5, 0, 0, 0, 0, 0, 1]), array('I', [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]))
[(4, 5), (0, 55)]
HJGKNBGX5:1:11101:10032:11438   0   37  1411003 60  5S55M   -1  -1  55  ACGTGAATTCCTGCACACTGTTGAAATCGTGGACAGTCAGACTGCCTGCTCAAATTGAAC    array('B', [32, 32, 32, 32, 32, 36, 14, 27, 36, 14, 36, 32, 14, 36, 36, 14, 14, 36, 36, 32, 14, 32, 27, 36, 14, 36, 36, 36, 14, 36, 27, 32, 36, 36, 27, 14, 32, 27, 32, 32, 27, 36, 27, 14, 36, 36, 32, 27, 36, 36, 36, 36, 36, 36, 27, 36, 32, 36, 36, 32])    [('AS', -8), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '7C47'), ('YT', 'UU'), ('NH', 1)]

54M6S
(array('I', [54, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0]), array('I', [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]))
[(0, 54), (4, 6)]
HJGKNBGX5:1:11101:10044:3447    16  204 64576   60  54M6S   -1  -1  54  ATCGAGAGACTGTAGCCGAAAGGGAAAACAGTGGTCGAGTTGCATAACGTCTGCTCATTC    array('B', [36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 14, 36, 36, 32, 14, 36, 36, 36, 36, 36, 14, 36, 36, 36, 32, 14, 36, 36, 36, 27, 14, 36, 36, 36, 27, 36, 36, 32, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 32, 32, 32, 32, 32])    [('AS', -6), ('XN', 0), ('XM', 0), ('XO', 0), ('XG', 0), ('NM', 0), ('MD', '54'), ('YT', 'UU'), ('NH', 1)]

46M79N14M
(array('I', [60, 0, 0, 79, 0, 0, 0, 0, 0, 0, 3]), array('I', [2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]))
[(0, 46), (3, 79), (0, 14)]
HJGKNBGX5:1:11101:10029:1903    0   144 1949847 60  46M79N14M   -1  -1  60  GAATAATTCGGTTACGTTGGCCCAGTTGACATCACAAAGCTCAAATGGAGAAAAGAAGCC    array('B', [14, 21, 14, 14, 32, 14, 32, 36, 36, 14, 14, 14, 14, 14, 14, 27, 32, 36, 36, 32, 14, 14, 14, 14, 36, 14, 27, 32, 14, 14, 14, 36, 32, 14, 32, 36, 14, 14, 36, 36, 36, 36, 36, 27, 14, 14, 36, 36, 36, 14, 32, 14, 32, 36, 32, 14, 32, 36, 36, 27])    [('AS', -9), ('XN', 0), ('XM', 3), ('XO', 0), ('XG', 0), ('NM', 3), ('MD', '12C0C15A30'), ('YT', 'UU'), ('XS', '+'), ('NH', 1)]
mapping assembly sam samtools cigar • 4.2k views
ADD COMMENT
2
Entering edit mode

you cannot use such cigar strings: The M operator can be a match(=) or a mismatch(X). See also the sam attribute 'NM:i:' ( "Edit distance to the reference" )

ADD REPLY
1
Entering edit mode

The blog post The history the MD tag and the CIGAR X operator explains why the CIGAR shouldn't be used to calculate percent identity, but a TL;DR is:

First, CIGAR describes alignment, but sequence matches and mismatches are not indispensable properties of alignment.

ADD REPLY
0
Entering edit mode

I've changed the question to how to calculate from sam file. Is there a straight forward way to do this from the sam record?

ADD REPLY
0
Entering edit mode

maybe you could use a combination of the MD field and cigar string, explained here https://zombieprocess.wordpress.com/2013/05/21/calculating-percent-identity-from-sam-files/

ADD REPLY
0
Entering edit mode

How should indels be handled?

ADD REPLY
0
Entering edit mode

Just out of curiosity, why do you need this?

ADD REPLY
0
Entering edit mode

This pipeline I use relies on CLC mapper which has a percent identity parameter. It seems really useful and I want to make a public version that only uses open source tools so I can throw it on my GitHub. I want to make it so you can use any mapping program and then you can filter out the reads that don't match a certain percent identity post hoc. For example, with environmental metagenomics you might want a lower threshold but with single cell stuff maybe you want it to be higher percent identity (maybe those are poor examples).

ADD REPLY
0
Entering edit mode

That would not be very simple to do with large BAM files. It may just be easier/faster to run alignments at two (or more) thresholds and let people pick one they like. Not sure how much difference this will make on final number of alignments. If you go through that exercise post the results here.

With BBMap for example you can set it while aligning:

minid=0.76              Approximate minimum alignment identity to look for. 
                        Higher is faster and less sensitive.
ADD REPLY
0
Entering edit mode

Yea BBMap is definitely my go for 99% of my mapping jobs mostly because of this feature (and it's super fast, easy to use, and reliable). One thing about BBMap is it can't take in GTF/GFF3 files for exon mapping like HISAT2 & STAR.

ADD REPLY
0
Entering edit mode

You could post-filter your BBMap alignments using samtools view regions with bed files for the exons.

ADD REPLY
0
Entering edit mode

Nice, I didn't know you could do that! So you're suggesting for eukaryotic organisms with lots of splice sites to map directly to transcript, then filter with samtools view regions?

ADD REPLY

Login before adding your answer.

Traffic: 2026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6