What cigar to use to count reads with last base mismatch in Rsamtools?
2
0
Entering edit mode
6.1 years ago
MAPK ★ 2.1k

I have been struggling with this problem for long time and would appreciate if somebody answer some questions I have below. I need to count the number of 21 bases long reads in my bam file with the last base mismatched. First, I tried the following code:

bamfiles <- "aligned.bam"
params <-
  ScanBamParam(
    which = which,
    flag = scanBamFlag(
      isUnmappedQuery = FALSE,
      isDuplicate = FALSE,
      isNotPassingQualityControls = FALSE
    ),
    simpleCigar = FALSE,
    reverseComplement = FALSE,
    what = c(
      "qname",
      "flag",
      "rname",
      "seq",
      "strand",
      "pos",
      "qwidth",
      "cigar",
      "qual",
      "mapq"
    )
  )  


res <- scanBam(bamfiles, param=params)
res

I can count the cigar like this:

sum(res$`NC_013116:0-2166`$cigar =="21M1S")

I want to know what cigar I should use to get the number of reads with 21 bases long and last read mismatched. Shouldn't I be using "21M1S" cigar? How about the the minus strand reads? Does it take both plus and minus reads into account while counting?

rsamtools • 2.2k views
ADD COMMENT
1
Entering edit mode

how is it different from your previous question ? How to use scanbamparam and CIGAR from Rsamtools to count specific reads?

ADD REPLY
0
Entering edit mode

It is not, but want to use rsamtools. I also want to know if I ak using right cigar.

ADD REPLY
0
0
Entering edit mode

So using 21M1S accounts for both + and - strands?

ADD REPLY
0
Entering edit mode

It does not, and you can't count mismatches from the CIGAR alone if the CIGAR alignment contains any 'M` operators.

ADD REPLY
0
Entering edit mode
6.1 years ago
d-cameron ★ 2.9k

If your aligner emits CIGARs containing the M operator, you cannot solve your problem from the CIGAR string alone. You need to write code to compare the actual read sequence with the reference sequence.

The SAM specs define M as alignment match (can be a sequence match or mismatch). Note that aligners also have a soft clipping penalty to prevent CIGARs like 21M1S for single base mismatches (as soft clipping is generally considered to be indicative of structural variation). If your aligner did not report using the M operator, the CIGAR strings you would be looking for would be one of: 21=1X, 21=1S, 21=1H. It's theoretically possible to have P operators added in anywhere but that would be highly unusual.

Note that the above assumes "end" is the last aligned base. When the read is aligned to the negative strand, the "last" read base is the first aligned base and you would be looking for alignments looking like 1X21=, 1S21= or 1H21=

ADD COMMENT
0
Entering edit mode

In this specific instance, you can avoid comparing all read bases by filtering to reads with NM=1 (if your aligner writes an NM tag at all) then checking the last base.

ADD REPLY
0
Entering edit mode

Thank you. So how do you separate reads with tag NM =1? What tool can I use?

ADD REPLY
0
Entering edit mode

You could tell Rsamtools to output the NM tag then filter $NM==1

From the command-line you canuse a simple samtools view | grep "NM:i:1" or use a tool that supporting explicit tag filter such as sambamba and picard tools FilterSamReads.

Edit: you'll need to check first first base if your read is aligned to the negative strand.

ADD REPLY

Login before adding your answer.

Traffic: 2514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6