BBMAP outputs multimapped reads in ambiguous=toss
0
0
Entering edit mode
7.5 years ago

Hi,

I aligned some dna-seq paired-end data using bbmap. I want to filter out the multi-mapped reads thus I used the ambiguous=toss parameter. However when I checked my alignment file in IGV I found some reads aligning on regions that are found multiple times in the genome (hg19). Here's the command I used :

bbmap.sh path=$bbmapHost local=t ambiguous=toss in1=R1.fastq in2=R2.fastq threads=1 minid=0.90 maxindel=1 mappedonly=t overwrite=t pairedonly=t sam=1.3 out=out.sam

Is there something wrong ?

Here's an example :

BLAT on the highlighted sequence in IGV shows multiple identical regions in the human genome :
http://grch37.ensembl.org/Homo_sapiens/Tools/Blast/Results?db=core;tl=gyu9eR8dd5ffOAcp-2260351

I extracted one of the read of this region of interest and align it again using ambig=toss and also ambig=all . Here are the results for ambig=toss.

M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59    99  chr14   63006989    42  98M =   63007016    156 AATTTACGGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAG  GHAAEHFGGFCEFHHHHG3GFHGAE2GHHHBGFGHBGFHHHHHHHHFFGHDGGHH@DFHFBGHHGGEEGHGDFBGHHHHHGFGHBFGHEGCC/FC//@  NM:i:1  AM:i:42
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59    147 chr14   63007016    44  129M    =   63006989    -156    CAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAGCCCGCATCGCCAAGTCAATCCTAAGCCAAAAGAACAAAGCTGGAGGCATCACACTACC   FHGCF11HHHHHD>GF0GG2BFHHHHHGDDFFGFFHGEGFHFFBHHGEBHGHFFFFHGGHGGEHHHHHHEE?CGE?ACGHHFHHGFFHHDHFCCGG?HFHGAF02HHFBABEAFHHHHHHFHGGA30HH   NM:i:0  AM:i:44

and here for ambig=all. You can see the results from ambig=toss are also there in addition to other region. Thus the read should be flagged as multimapped and not be reported in the ambig=toss mode.. no ?

M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59    99  chr14   63006989    42  98M =   63007016    156 AATTTACGGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAG  GHAAEHFGGFCEFHHHHG3GFHGAE2GHHHBGFGHBGFHHHHHHHHFFGHDGGHH@DFHFBGHHGGEEGHGDFBGHHHHHGFGHBFGHEGCC/FC//@  NM:i:1  AM:i:42 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59    147 chr14   63007016    44  129M    =   63006989    -156    CAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAGCCCGCATCGCCAAGTCAATCCTAAGCCAAAAGAACAAAGCTGGAGGCATCACACTACC   FHGCF11HHHHHD>GF0GG2BFHHHHHGDDFFGFFHGEGFHFFBHHGEBHGHFFFFHGGHGGEHHHHHHEE?CGE?ACGHHFHHGFFHHDHFCCGG?HFHGAF02HHFBABEAFHHHHHHFHGGA30HH   NM:i:0  AM:i:44 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59    369 chr10   7101478 40  98M chr14   63007016    0   *   *   NM:i:2  AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 2:N:0:59    321 chr10   7101420 43  129M    chr14   63006989    0   *   *   NM:i:1  AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59    369 chr10   11774921    40  98M chr14   63007016    0   *   *   NM:i:2  AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 2:N:0:59    321 chr10   11774863    43  129M    chr14   63006989    0   *   *   NM:i:1  AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59    369 chr10   15959217    40  98M chr14   63007016    0   *   *   NM:i:2  AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 2:N:0:59    321 chr10   15959159    43  129M    chr14   63006989    0   *   *   NM:i:1  AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59    369 chr10   21042081    40  98M chr14   63007016    0   *   *   NM:i:2  AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 2:N:0:59    321 chr10   21042023    43  129M    chr14   63006989    0   *   *   NM:i:1  AM:i:43 NH:i:5

enter image description here

edit : check the comment for an additional example

bbmap • 3.3k views
ADD COMMENT
0
Entering edit mode

Are those PCR duplicates (i.e. independent reads that are mapping in the same position)? If you used ambig=toss then one read that maps in several positions would not be included.

ADD REPLY
0
Entering edit mode

Looking at the MAPQ and NM tag, the primary alignment is slightly better than the secondary alignments - read 1 has an edit distance of 1 and read 2 has an edit distance of 0 for the primary alignment, while read 1 has an edit distance of 2 and read 2 has an edit distance of 1 for the secondary alignment. That's more obvious with the "sam=1.4" cigar strings, but the fact that the pair's combined edit distance was 2 less for the primary than secondary alignments is why it was declared unambiguous.

ADD REPLY
0
Entering edit mode

I checked a little bit more in detail. Here's an alignment for a pair of reads in ambiguous=all

M00991:78:000000000-AP8FW:1:1112:20035:11465 1:N:0:59   99  chr14   63006989    42  98M =   63006989    125 AATTTACAGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGCAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAG  CEGBGHHH3A5ADAAF5BGHFHGGGHFGFFFFHG3FCCGHHHHGFHGHHGD@BF@@4F33F3ECGFGFC3?GGHHF4?FHBFFHGFDB/F/FCG@A/0  NM:i:1  AM:i:42 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 1:N:0:59   369 chr10   7101478 40  98M chr14   63006989    0   *   *   NM:i:2  AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 1:N:0:59   369 chr10   11774921    40  98M chr14   63006989    0   *   *   NM:i:2  AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 1:N:0:59   369 chr10   15959217    40  98M chr14   63006989    0   *   *   NM:i:2  AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 1:N:0:59   369 chr10   21042081    40  98M chr14   63006989    0   *   *   NM:i:2  AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 2:N:0:59   147 chr14   63006989    44  125M    =   63006989    -125    AATTTACAGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAGCCCGCATCGCCAAGTCAATCCTAAGCC   D1GBHFF<?11EHG<00C<?CEF2@2ACDB23DF2?3F3BFHHEHF4FF?GFHHEFFDGGEHDFB?4BGEBHHHEFHFHHHHHGDEHFE/G5G3F?0>CEGEAEAEFHDBBFEFEBDBDFCFFHH   NM:i:0  AM:i:44 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 2:N:0:59   321 chr10   7101451 43  125M    chr14   63006989    0   *   *   NM:i:1  AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 2:N:0:59   321 chr10   11774894    43  125M    chr14   63006989    0   *   *   NM:i:1  AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 2:N:0:59   321 chr10   15959190    43  125M    chr14   63006989    0   *   *   NM:i:1  AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:1112:20035:11465 2:N:0:59   321 chr10   21042054    43  125M    chr14   63006989    0   *   *   NM:i:1  AM:i:43 NH:i:5

So chr14:63006989 is the primary alignment as the AM score is the highest. To confirm I ran BBmap using ambiguous=toss and it gives me the right alignment :

M00991:78:000000000-AP8FW:1:1112:20035:11465 1:N:0:59   99  chr14   63006989    42  98M =   63006989    125 AATTTACAGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGCAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAG  CEGBGHHH3A5ADAAF5BGHFHGGGHFGFFFFHG3FCCGHHHHGFHGHHGD@BF@@4F33F3ECGFGFC3?GGHHF4?FHBFFHGFDB/F/FCG@A/0  NM:i:1  AM:i:42 NH:i:1
M00991:78:000000000-AP8FW:1:1112:20035:11465 2:N:0:59   147 chr14   63006989    44  125M    =   63006989    -125    AATTTACAGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAGCCCGC

However when I go to the location of the primary alignment (chr14:63006989) and copy the exact sequence where the reads align (chr14:63006989-63007113) : AATTTACAGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAGCCCGCATCGCCAAGTCAATCCTAAGCC

and align it on hg19 using blat I found two perfect locations :

  • chr14:63006989-63007113 : ok !
  • chr8:94422483-94422607 ??????

Here's the blat output : http://grch37.ensembl.org/Homo_sapiens/Tools/Blast/Results?db=core;tl=yRMn9fQ8AuuM68UD-2286190

The second alignment is not reported in the ambiguous=all mode ? But here this read is clearly a multimap one ..

Brian could you clarify this point ?

Thank you for your time

ADD REPLY
0
Entering edit mode

For paired reads, the mapping score also takes into account the insert size; the score is higher the closer it is to the mean insert size. If you map those reads as single-ended rather than paired, they will all be considered ambiguous. Possibly, I should reduce the effect of insert size when determining whether a mapping is ambiguous.

ADD REPLY

Login before adding your answer.

Traffic: 1510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6