How do we find the number of sequences with terminal mismatches?
1
0
Entering edit mode
6.2 years ago
MAPK ★ 2.1k

I have created a bowtie aligned .sam file with this command: bowtie -S -p 18 /media/owner/ref.fa test1.fastq test1.sam

which I then converted to .bam file like this: samtools view -bS -o test1.bam test1.sam

From this test1.bam, I was able to get the total number of plus and minus strand of reads aligned to the reference loci using bedtools coverage maping (for positives reads and negative reads) like this: bedtools coverage -a my_region.bed -b test1.bam -bed -d -s | gzip > test1_region_pos.tsv.gz

Since I used default bowtie option which allows for three mismatches, I want to know how many reads are aligned with terminal mismatches (i.e,last three, last two and last one base mismatched at 3' region of the aligned reads at their loci position). How can I get this information? Thank for your help in advance.

bowtie bam • 2.1k views
ADD COMMENT
0
Entering edit mode
6.1 years ago

You should be able to obtain that information by parsing the end of the MD tag for three, two, or one 'GATC' characters. The alternative would be to parse the end of the CIGAR string for X characters (mismatch), but I believe Bowtie predates that feature of the SAM standard.

ADD COMMENT
0
Entering edit mode

Could you please elaborate a bit? Do you mean MD tag within samtools? Thanks for your help.

ADD REPLY
1
Entering edit mode

The MD tag is one of the predefined standards in SAM/BAM file specification, used to represent mismatches. More info can be found here.

EDIT for more information: The MD tag contains the REFERENCE nucleotide at a position if there is a mismatch, so any read that contains a terminal mismatch with contain G/A/T/C in the last position of this tag.

ADD REPLY
0
Entering edit mode

I need positional mismatch. For example, I need total number of reads of length 18 bases where with total number of mismatch at 18th position. Same for 19bases long sequences position and so on.

ADD REPLY
0
Entering edit mode

For that you could check the CIGAR flag ans check for reads with the last base as mismatch. For 18 bp reads it should be 17M1S

ADD REPLY
0
Entering edit mode

I believe 'S' represents soft clipping in the CIGAR. 'X' represents mismatch, but was not used in earlier versions of SAM.

ADD REPLY
0
Entering edit mode

As it's the last bp I guess it will be associated to soft clipping

ADD REPLY
0
Entering edit mode

Bowtie didn't support clipping, only end-to-end alignment (caveat: relying on memory of old software I haven't used in a long time).

ADD REPLY

Login before adding your answer.

Traffic: 2239 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6