NGS mapper that supports IUPAC reference
2
4
Entering edit mode
8.8 years ago
Marvin ▴ 220

I am looking for a mapper, that handles the whole IUPAC base set properly.

So no turning N's into A's etc. ...

If the mapper sees a Y in the reference, then he should consider C and T as matches whereas G and A as mismatches. And so on.

I have found MOSAIK, but that mapper doesn't seem like a good choice. According to the paper it can handle IUPAC bases in the reference, but the author himself says in a forum '1.1.0014 cannot handle IUPAC codes in the reference' (https://code.google.com/p/mosaik-aligner/issues/detail?id=2). In that same forum there are posted a lot of issues and errors without any reaction. I tried to email the author but I get a mail delivery failure. So that project seems kind of dead to me.

I have found the GEM mapper which seems to map every IUPAC base to N. So it only supports N but not even that seems to work because here's a thread where someone says that it doesn't work and the authors do not reply:

RNA-seq short-read mapping with wildcards (IUPAC codes) using GEM mapper

Seems like another dead mapper project.

So which mapper (where the project isn't dead) can I use for my purpose? Again: I want to map reads against a reference which contains IUPAC codes. If the mapper sees a Y in the reference, then he should consider C and T as matches whereas G and A as mismatches. Same for the other IUPAC bases.

I hope you can help me, I really need such a mapper.

map iupac • 3.9k views
ADD COMMENT
0
Entering edit mode

Did you look at the Mosaik GitHub page? Most projects are moving away from Google Code and it is set to be shut down in the near future. Besides, even on the Google Code page version 1.1.0014 is old, you should at least try the newest one before claiming it doesn't work.

ADD REPLY
0
Entering edit mode

My discarding of the Mosaik mapper was also motivated by this thread

The user points out that Mosaik does not map any reads when only reads with 0 mismatches should be matched (although because of the IUPAC bases in the references, there should be matches).

So, this thread + the many unsolved issues in the link of my first post + the author himself saying that 1.1.0014 cannot handle IUPAC bases (although in the PDF documentation version 1.0 it is explicitly stated that Mosaik can do it) + the mail delivery failure when trying to email the author made me think that the project is dead.

I'll give it a try then, although I feel like it's not promising (that 'Issues' tab on the google code page looks really frustrating ... compilation errors without solution, 'same problem here'-posts etc. ...)

ADD REPLY
3
Entering edit mode
8.8 years ago
Marvin ▴ 220

All right, now I have tested Mosaik, BBMap and BWBBLE.

BWBBLE does not soft-clip reads which makes reference-overhang-detection impossible.

MOSAIK only accepts reads of length 100, otherwise I get a "Segmentation fault (core dumped)" error message. But I have reads of variable lengths and shorter than 100 bp (aDNA reads).

BBMap does not support the full IUPAC code.

Any other suggestions?

ADD COMMENT
1
Entering edit mode

Why do you need support for the full IUPAC code? I have never heard of a situation where fully processing degenerate symbols would be worthwhile other than when designing primers or finding their binding locations. Granted, it's possible to marginally improve accuracy when mapping to a diploid genome by fully supporting degenerate symbols, but reads never contain them, official references never contain them, and really, they're just a small step on the path toward mapping to a graph rather than a linear reference. So I'm not really sure what the point is. Have you tried the existing solutions and found them to be inadequate, or are you just speculating that they might be?

If you really need it, though, it sounds like the easiest solution would be to post-process and clip bwbble's output reads that go off the ends of scaffolds. Alternatively, you might be able to use an amino-acid-space aligner with your own custom matrix that defines alignments between degenerate symbols as receiving full match points. You might not even need to translate the sequences.

ADD REPLY
1
Entering edit mode

I agree with Brian Bushnell, if you really want help it would be better if you describe why do you need this - I am afraid this may be a X-Y Problem.

Another solution is post-process the alignment, re-calculating alignment scores according to IUPAC on positions with ambiguity codes. Not sure if this would work, and even less sure if it is worth the trouble. My last suggestion is, if you are working with human samples, try mrsFAST-Ultra, which according to its paper can take into account SNPs. From the paper, it doesn't sound straightforward, but I did not try it really.

ADD REPLY
0
Entering edit mode

I have recently come to realize that support for full IUPAC code would be greatly helpful while doing Allele Specific Expression analyses.

After calling variants from/between two populations there will be sites (nucleotides) that are fully fixed between populations. In those cases substituting a regular base with another regular base works.

But, when a polymorphism exists at a site that are shared between two populations (but not fully) it would help if the ambiguity code could be introduced. Say, population#1 has A/G polymorphism at a site, then the reference base for this population could be set to R, while if the population#2 has A/T polymorphism at a site, then the reference base for this another population could be set to W. While doing competitive alignment of RNAseq data between the two reference we are then more likely to capture the true alignment if F1 hybrid was A/T, where A came from population#1 and T came from population#2. It would be even more helpful if the ambiguity codes could be introduced in structuring the haplotypes thus eliminating ambiguous mapping.

Please let me know if I am not saying the right things. I always appreciate feedback. If only I were a programmer I would have done this, but its beyond my capacity.

Thanks!

ADD REPLY
0
Entering edit mode
8.8 years ago
h.mon 35k

Try BWBBLE (seems exactly what you want) and BBMap (has a "semiperfectmode" where ambiguous reads in the reference are not considered as mismatches).

ADD COMMENT
1
Entering edit mode

Yes that BWBBLE seems like what I need on first glance, thank you very much! I'll try that out first (then Mosaik). Although BBMap is very actively maintained it is unfortunately not suitable for my special purpose - this is a quote from the author himself (contacted him on seqanswers):

Regarding IUPAC symbols - unfortunately, no. BBMap uses a 5-letter alphabet of A, C, G, T, N. U gets converted to T, and all other symbols get converted to N. This means that unlike many aligners, BBMap will not penalize an A aligning to N as much as A aligning to C; but it will consider A aligning to M as no different than A aligning to K. So: A-A is positive; A-N or A-(any degenerate symbol) is neutral; A-C, A-T, A-G are negative.

ADD REPLY
0
Entering edit mode

So ... unfortunately BWBBLE is useless to me because it cannot soft-clip the reads if they overhang the reference :-/

ADD REPLY

Login before adding your answer.

Traffic: 2483 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6