Question

Tophat2/Bowtie failing to align some examples of partial intron retention / exon loss

0

Entering edit mode

5.3 years ago

Travis ★ 2.8k

Hi all,

We have discovered a couple of examples where variants were predicted to affect normal splicing but Tophat2/Bowtie failed to identify the effect on splicing within the RNA-Seq alignments.

Specifically we have two recent examples, one of which resulted in a 2bp loss at the 5 prime end of a known exon, and another where the 5 prime end of the known exon was extended by several bases through intron retention. In neither case was the variant present in the spliced transcript therefore the variants themselves should not confuse the alignment.

In each case, the aligner produced no alignments for the affected reads (One member of the read pair is aligned but the one covering the affected splice junction is not aligned at all. Choosing to BLAT the unaligned read in IGV reveals the new splice junction as expected)

We have successfully detected many similar examples in the past and are not sure what may be different about these cases. We subsequently ran STAR aligner and it detected the aberrant splicing without issue.

I am wondering what settings might be adjustable in order to enable Bowtie/Tophat2 to correctly identify the aberrant splicing.

Does anyone have suggestions of what parameters we might look at first? More details available on request,

Best,

Gavin

RNA-Seq alignment • 1.9k views

ADD COMMENT • link updated 5.3 years ago by Charles Warden 8.2k • written 5.3 years ago by Travis ★ 2.8k

0

Entering edit mode

You probably know this already, but tophat2 has been superseded with hisat2 (or STAR if you prefer). Did you try these more recent tools?

ADD REPLY • link 5.3 years ago by ATpoint 81k

0

Entering edit mode

We tried STAR aligner and it detected both events. Problem is that Bowtie is ingrained in some of our pipelines so I was hoping to figure out whether it could be parameterized to make the calls.

ADD REPLY • link 5.3 years ago by Travis ★ 2.8k

score 0 · Answer 1 · 2019-01-03

0

Entering edit mode

5.3 years ago

Charles Warden 8.2k

If you have a solution with STAR, that is probably what matters most.

I would expect that you need to take some time to critically assess the results for every project, which may sometimes involve changing the alignment.

If you are trying to figure out what might be going on with TopHat, I think I have sometimes noticed something a little strange about the junctions.bed file. However, if you are quantifying the junctions from the .bam alignment (using a program like QoRTS, for example), it is possible that you might have better luck.

ADD COMMENT • link 5.3 years ago by Charles Warden 8.2k

0

Entering edit mode

As mentioned to the previous user, Tophat is currently ingrained in several of our pipelines and switching up aligners completely is non-trivial, hence the desire to attempt fine-tuning the Tophat alignments or at least understand the reason behind the missed event. Critically assessing alignment results for each project is similarly non-trivial based on throughputs and resources.

The missed junction is entirely absent from the Tophat-generated BAM, as checked in IGV, so is not an artifact of the junctions file.

Thanks.

ADD REPLY • link 5.3 years ago by Travis ★ 2.8k

0

Entering edit mode

While I think you kind of have to try to automate the analysis before you can realize the problem (as someone who developed COHCAP, with "pipeline" in the name), I have kind of shifted away from that more recently. While I am still trying to figure out exactly what to recommend people to do for their analysis support / training, I am concerned that people can underestimate the amount of time for analysis (and the total number of projects that can be handled) if they don't expect a need for what I vaguely called "critical assessment." Also, I think finding false negatives is something you will periodically encounter if you tried to lock down all the steps for the analysis ahead of time (so, regardless of what you used for "initial" analysis, I would expect you to run into something like an "unrecovered intron retention event" every so often; however, what I don't exactly know is whether this an issue for 1/3 of projects or 1/10 of projects or 1/20 projects of projects - and the severity of the individual false negative). However, from the perspective of the lab, you need to question the results that would affect your subsequent years of work, if you happen to be among the 33%-10%-5% of projects where a particular "pipeline" wasn't appropriate.

Plus, part of being able to publish higher-impact papers (improving the chances of future funding) requires doing something novel and/or questioning prior assumptions. So, that also requires an ability to change analysis strategies for each project.

I apologize for the long response that may have seemed like a bit of a tangent, but that is what I thought I could better address in terms of general forum support (and it seemed like that was an important part of your question / answer).

Getting back to the alignment question, it is hard for me to say what to do with your particular dataset. If I had to guess, STAR can be more tolerant to mutations than TopHat: do the (STAR-aligned) reads for the retained intron have good similarity to the genome reference? Is it possible that there could be an issue with the reference sequence / annotations?

Assuming it is a retained intron in a "treatment" sample (relative to a control), I assume there is a lack of coverage in intron region even with the STAR alignment in your control sample? Otherwise, if the divergent "intron" sequence is always there (but it never aligns with the TopHat alignment), maybe the intron reference sequence is not correct (and/or reads from a different region are being aligned to that location)?

ADD REPLY • link 5.3 years ago by Charles Warden 8.2k

score 0 · Answer 2 · 2019-01-03

Working in a core facility with high throughput, reassessment of underlying aligners etc on a per project basis is an impossibility. Furthermore we have recently observed Tophat to be much better at detecting fusion transcripts than STAR in our samples so I wouldn't be replacing it without extended testing which is not currently feasible. Yes, some type of ensemble approach might make sense as with any algorithmic task but that is a consideration for another day.

This is a human sequence problem. As detailed in the original post the exon loss/intron retention events are only a few bases each and the reads correspond identically to the underlying reference genome as the causative mutations do not occur within the read sequence. We have generally seen Tophat to detect these type.of events quite regularly.

My question was specifically if people are aware of why Tophat specifically may show artifacts like this or whether they are aware of any parameters that may enable fine tuning of this behavior. If not, then so be it but I am hopeful that someone out there has had some experience of parameter tweaking.