Question

Tool:Oncofuse: Prediction Of Driver Gene Fusions From Ngs Data

12

Entering edit mode

10.7 years ago

mikhail.shugay 3.5k

Dear fellow bioinformaticians,

I would like to announce a new tool dedicated to post-processing of NGS-based fusion prediction data. The main feature of this tool is the ranking of gene fusions based on their probability to be 'driver' rather than 'passenger' events. Top ranked fusions are promising candidates for validation and further study.

The input format is rather straightforward: either breakpoint coordinates in fusion partner genes or raw input from TopHat-Fusion/FusionCatcher software. Alongside the tool provide some other useful information on fusion gene composition such as the list of retained domains, etc

Please find the tool itself and documentation here.

Thanks in advance for your feedback,

Best regards,

Mikhail Shugay

UPDATE 22 November 2014

There were several major fixes and improvements since the original post, so everyone is highly recommended to use the latest version of Oncofuse. Up-to-date source code and binaries bundled with necessary genomic data are available at GitHub.

UPDATE Dec 2014 - Jan 2015

Some fixes, improved reporting and filtering of spanning/encompassing reads and a filter for discordant fusion partner gene direction.

annotation fusion rna-seq ngs • 7.6k views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 10.7 years ago by mikhail.shugay 3.5k

Ram · Answer 1 · 2013-09-06

0

Entering edit mode

10.6 years ago

ericliaowei • 0

Hi, Mikhail,

It was great to find out that oncofuse can be used for down-stream analysis after Tophat-fusion, it provides more information on selecting potential fusion genes.

When using "fusions.out" after Tophat, oncofuse runs.

However, I have a problem running it with "pontential fusions.txt" after tophat-fusion post filtering. I prefer "pontential fusions.txt" after running tophat-fusion-post, because it filters out a lot of false-positive fusions on this step. The program can't recognize the input. When I checked the format between "fusions.out" and "pontential fusions.txt", the difference is the added sample name at the beginning as highlighted here:

(Sample name) chr1-chr1    139410    139504    rr    1    0    0    0    51    49    1.000000    @    4 16 23 31 41     @    TGACCTCACCAGGCCCAGCTCATGCTTCTTTGCAGCCTCTCCAGGCCCAG CTCCTGCATCTTGGTGGCCCCTCCAGGCCCAGCCTCTGCCTCCCGTCAGC    @    CAGCCTCCACAGGCACAGCTCCATCGTTACAATGGCCTCTTTAGACCCAG CTCCTGCCTCCCAGCCTTCTCTCCAGGCCCTGAACTTTCTCAAGTTGACC    @    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1     @    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

Is there a way to solve this issue with Oncofuse?

Best regards,

Wei

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.6 years ago by ericliaowei • 0

1

Entering edit mode

Hello, Mike. I've been using Oncofuse on Tophat-fusion-post output and I'm excited about the results. I had a couple of questions about the criteria for judging reported fusion events and sifting out false positives, and I was wondering if you could help or point me in the right direction.

I'm finding that a really high percentage (easily 75% or more) of the reported events have one or both partners 'backwards', resulting in a head-to-head or tail-to-tail fusion transcript that's probably riddled with stop codons and often even missing a start:
```
------> <-----
<----  ------>
```
Oncofuse (and for that matter Tophat) doesn't look for or report 'backwards' reading when scoring functional domains/driver probability/etc of fusion partners. I know of one example of a head-to-head fusion being functional, but it seems like it would be extremely rare for this kind of event to even give a productive transcript. Am I wrong in that impression? Have you thought about adding something to Oncofuse to filter or flag these by comparing orientation to strand and order of the partners?
Kind of similarly, most of my list has one or both partners with an intronic breakpoint, often 100s or 1000s of BP away from the exon boundaries. Again, it seems like even if these events are real readthroughs or something, they would almost never produce a functional transcript. I've been checking them anyways, but I'm wondering if it would be more appropriate to just filter out using Oncofuse?

Thanks!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.3 years ago by mayars169 ▴ 20

0

Entering edit mode

Hello!

Sorry for quite late reply. Those are very interesting suggestions and here is what I think on those questions:

Could you please clarify if those are genes are from the same chromosome looking head-to-hand/tail-to-tail, or this is seen reads containing fusion junction? If this is the former case that this could easily happen with a process called chromosomal inversion which is relatively common mutation. As for the latter case, I've implemented this kind of filtering, see latest pre-release here. Please let me know if it works fine for your task.
As a follower of a theory that breakpoints happen at random (e.g. see this paper) and then oncogenic ones are selected based on their driver potential, I think that one would expect lots of them in introns. And as even with best RNA-seq one gets lot of intron coverage, those intronic breakpoints could actually be true events, present in unspliced mRNA. So I don't think filtering them would be a very good idea, and if they manage to get in the final list of fusion candidates some manual verification would be needed.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.3 years ago by mikhail.shugay 3.5k

1

Entering edit mode

Hey!

You're quite right, and I should have specified that it was the latter. I tried the new version and the results look excellent - all of the ones I've checked have had both partners in 'correct' orientation. I was off a bit, the 'backwards' candidates were 2/3s rather than 3/4ths of the reported events. Thank you so much for writing this, it's a massive time-saver over checking them manually and I'm going to recommend this to some people.
I see your point. I'd wondered if prespliced mRNA contributed significantly to reads for reported fusions, so that's good to know. Rechecking the candidates with intronic breakpoints that I was filtering before, I've already found a couple with promising partners. I've noticed in some papers (like this one) they do filter to purely exonic reads - do you think this is just time-saving since they're handling so much data?

Thanks again :)

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.3 years ago by mayars169 ▴ 20

0

Entering edit mode

Great!

As for the Nat Biotech paper I agree with you. I don't its possible to validate data from 675 cell lines, so they've taken this as a precaution step, together with only selecting discordant paired alignments that map to different chromosomes, etc. Btw thanks for sharing this paper, the data is great for benchmarking tools like Oncofuse based on fusion recurrence criteria.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.3 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Hello!

Finally managed to pack and upload a new version that supports Tophat-post. Please check the website mentioned above.

Regards,
Mike

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.6 years ago by mikhail.shugay 3.5k

Ram · Answer 2 · 2015-05-04

0

Entering edit mode

9.0 years ago

ethan.kaufman ▴ 380

(Note that this post originated here, and is now being moved over to the dedicated Oncofuse thread.) I've pasted below a case where Oncofuse (1.0.9b1) seems to get confused between a fusion and its reciprocal:

fusions.out    3378    EPI    5    0    chr1:16299533>chr1:16456083    ZBTB17    No    Intron    14    24541    804    0    EPHA2    Yes    Exon    2    155    891    1    1    0.005662748906153429    0.9999657760410134    -0.1151861020059215    BTB/POZ-like[Domain];BTB/POZ[Domain];Zinc finger C2H2-type/integrase DNA-binding domain[Domain];BTB/POZ fold[Domain];Zinc finger, C2H2-like[Domain];Zinc finger, C2H2[Domain]    Serine-threonine/tyrosine-protein kinase catalytic domain[Domain];Sterile alpha motif, type 1[Domain];Immunoglobulin-like fold[Domain];Protein kinase, ATP binding site[Binding_site];Sterile alpha motif/pointed domain[Domain];Ephrin receptor ligand binding domain[Domain];Ephrin receptor, transmembrane domain[Domain];Tyrosine-protein kinase, active site[Active_site];Insulin-like growth factor binding protein, N-terminal[Domain];Protein kinase-like domain[Domain];Sterile alpha motif domain[Domain];Galactose-binding domain-like[Domain];Protein kinase domain[Domain];Tyrosine-protein kinase, catalytic domain[Domain];Ephrin receptor type-A /type-B[Family];Tyrosine-protein kinase, receptor class V, conserved site[Conserved_site];Fibronectin, type III[Domain]                EPHA2;PIK3R1;SLA;EFNA1;EFNA4    0.017305156526071477    0.0    3.7249030480169165E-5    0.08791482785307511    0.0187250351050042    0.09058808633418107

In this sample output, ZBTB17 is given as the 5' partner and EPHA2 as the 3' partner, with the following breakpoint: chr1:16299533>chr1:16456083 (I have confirmed that this orientation is correct based on manual inspection of the alignment). The first problem is that the 5_Segment_ID (14) and 3_Segment_ID (2) do not seem to be aware that both genes are on the minus strand and hence the exon number should rather be 2 (ZBTB17) and 16 (EPHA2). This incorrect choice of exons seems to be reflected in the list of protein domains retained for the two fusion partners. Since the 5' breakpoint is in the 5'UTR of ZBTB17, it shouldn't retain any domains, and a quick inspection of the domain architecture of EPHA2 indicates that only possibly the Sterile-Alpha Motif should be retained given that only the last ~100 amino acids are retained in the fusion. That this record reports such a confident prediction of "driver" (>0.9999) seems to be attributable to the list of retained domains, which now seems highly questionable. This might be reasonable for the reciprocal fusion, in which EPHA2 is the 5' partner and ZBTB17 is the 3', but that should not be the case for this record. Have I misinterpreted the output?

Note that I am running Oncofuse with input_type "tophat".

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.0 years ago by ethan.kaufman ▴ 380

0

Entering edit mode

Should work fine in 1.0.9b, but there is some problem for 1.0.9b1. I'll investigate this issue further and let you know.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.0 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Thanks for looking into this. After analyzing a few more records, this really seems to be an issue of Oncofuse not considering the strand of the gene, in particular the 5'FPG. It seems that if the 5'FPG is on the plus strand then the data will be correct; if its on the minus strand then the reported segment_ids and retained domains list seem be counting from the wrong end.

Something else I noticed that is quite odd is that column in the output for '3_DOMAINS_BROKEN' is completely empty for every record in my output - which contains over 2 million records! I'm not sure how the list of broken domains for the 3'FPG influences the algorithm but this is probably also causing problems.

I downloaded and ran 1.0.9b but the results are identical to 1.0.9b1, as far as I can tell. Please let me know if there's any data files I can give you that may help. Best, Ethan

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.0 years ago by ethan.kaufman ▴ 380

0

Entering edit mode

Sorry if I have confused you, I've actually shared wrong binaries for 1.0.9b which were identical to 1.0.9b1, the 1.0.9b version seems to behave correctly, ZBTB17 is having 0 aa retained and it is exon 16 for EPHA2. Just in case, you can get correct 1.0.9b binaries by compiling corresponding sources from https://github.com/mikessh/oncofuse/archive/1.0.9b.tar.gz with Maven (mvn clean install command).

I've fixed this bug, see details here: https://github.com/mikessh/oncofuse/issues/5, please let me know if everything works. Here are the correct results:

SAMPLE_ID    FUSION_ID    TISSUE    SPANNING_READS    ENCOMPASSING_READS    GENOMIC    5_FPG_GENE_NAME    5_IN_CDS?    5_SEGMENT_TYPE    5_SEGMENT_ID    5_COORD_IN_SEGMENT    5_FULL_AA    5_FRAME    3_FPG_GENE_NAME    3_IN_CDS?    3_SEGMENT_TYPE    3_SEGMENT_ID    3_COORD_IN_SEGMENT    3_FULL_AA    3_FRAME    FPG_FRAME_DIFFERENCE    P_VAL_CORR    DRIVER_PROB    EXPRESSION_GAIN    5_DOMAINS_RETAINED    3_DOMAINS_RETAINED    5_DOMAINS_BROKEN    3_DOMAINS_BROKEN    5_PII_RETAINED    3_PII_RETAINED    CTF    G    H    K    P    TF
example_coord.txt    3    EPI    0    0    chr1:16299533>chr1:16456083    ZBTB17    No    Intron    2    0    0    0    EPHA2    Yes    Exon    16    1    88    1    1    0.28299984949789564    0.7877501128765783    -0.1151861020059215        Sterile alpha motif, type 1[Domain];Protein kinase-like domain[Domain];Sterile alpha motif/pointed domain[Domain];Sterile alpha motif domain[Domain];Ephrin receptor type-A /type-B[Family]                EPHA2    4.097106747166429E-4    0.0    0.0    0.02083983541682123    0.0    0.03164895176258025

As for domains broken, first I'm not sure where the 2 million records come from. To get a broken domain one needs to have a finely mapped breakpoint in a specific exon. If those are 2 million records from say 10 samples, this means that there are 200 000 fusions in a sample, looks not really biologically meaningful, right? Most of those are read-through, mapping artifacts (say two homologous genes that are separated by several kb on the same chromosome), etc. Tophat output yields many of those.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.0 years ago by mikhail.shugay 3.5k

1

Entering edit mode

Hi Mikhail, many thanks for this. Yes, the first problem is now fixed- the 5'FPG appears oriented correctly and the segment_ids and list of retained domains now make sense. Cool.

There is one more problem still remaining for the 3'FPG. I'm sorry for mentioning the 2 million number (from 100+ samples)- yes, this is tophat fusion output which predicts tons of false positive fusions, so not unexpected. I was just trying to point out that despite this large number, according to Oncofuse none of the fusions lead to a broken domain in the 3' partner, (hence the column is totally empty in the output), which is highly unlikely. I think there must be a bug in the code which populates this list of broken domains for the 3'FPG and mis-attributes them as retained domains. Consider the following record:

coord.txt    2549    EPI    0    0    chr3:196613564&gt;chr3:196554066    SENP5    Yes    Exon    2    1543    503    2    PAK2    Yes    Intron    13    6628    75    0    2    0.016082029992940436    0.999875814440209    -0.10897960459536826        Protein kinase-like domain[Domain];Protein kinase domain[Domain];Serine/threonine- / dual specificity protein kinase, catalytic  domain[Domain]                PAK2;RAC1;SRC;CDC42;MYLK    0.0017086487735901768    0.0    0.0    0.06178887370595963    0.0    0.0

This was an initially promising fusion in the Oncofuse output that doesn't hold up to scrutiny, in my opinion. I looked up the reported breakpoint in PAK2 with Interpro and found that it occurs in amino acid 450, which leaves 75aa in the fusion gene (consistent with the Oncofuse output). Looking at the Interpro PAK2 entry, the position of the breakpoint should break the kinase domain. However, Oncofuse reports that the kinase domain is retained in the fusion, which seems to account for the significant P-value in this case. This is a common occurrence- I can provide several other examples.

Thanks again for your help. Cheers, Ethan

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.9 years ago by ethan.kaufman ▴ 380

1

Entering edit mode

Hi Mikhail,

thanks for the tool and help. I'm having exactly the same issue as Ethan. The 3_DOMAINS_BROKEN appears always empty in a dataset containing half million of fusion genes from hundreds of cancer samples. I found some examples manually where the 3' domain should be broken but is not reported. Is there any update on this issue?

Cheers,

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by garal ▴ 10

0

Entering edit mode

Hello!

I'm really sorry for a delayed reply, this has been fixed in 1.1.1

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by mikhail.shugay 3.5k

Ram · Answer 3 · 2015-12-22

Hello,

Since I have fusions from star-fusion,I ran it using the input of star fusion.I don't get any output but the script seems to be fine.

more oncofuse_pm575.sh.o515938.1

Dec 22, 2015 12:20:12 PM org.codehaus.groovy.runtime.m12n.MetaInfExtensionModule newModule
WARNING: Module [groovy-all] - Unable to load extension class [org.codehaus.groovy.runtime.NioGroovyMethods]
[Tue Dec 22 12:20:13 EST 2015] Loading RefSeq data, assuming hg19
[Tue Dec 22 12:20:13 EST 2015] Reading input file, assuming STARFUSION format
Ignoring line #fusion_name JunctionReads SpanningFrags Splice_type LeftGene LeftBreakpoint RightGene RightBreakpoint
Ignoring line NCOA4--RET 57 35 ONLY_REF_SPLICE NCOA4^ENSG00000138293.15 chr10:51582939:+ RET^ENSG00000165731.13 chr10:43612032:+
Ignoring line DHRS3--CCDC27 28 14 ONLY_REF_SPLICE DHRS3^ENSG00000162496.4 chr1:12638746:- CCDC27^ENSG00000162592.4 chr1:3683099:+
Ignoring line RET--NCOA4 9 6 ONLY_REF_SPLICE RET^ENSG00000165731.13 chr10:43610184:+ NCOA4^ENSG00000138293.15 chr10:51584616:+
Ignoring line RP11-680G10.1--GSE1 4 2 ONLY_REF_SPLICE RP11-680G10.1^ENSG00000261567.1 chr16:85391249:+ GSE1^ENSG00000131149.13 chr16:85682158:+
Ignoring line RP11-680G10.1--GSE1 1 2 ONLY_REF_SPLICE RP11-680G10.1^ENSG00000261567.1 chr16:85391249:+ GSE1^ENSG00000131149.13 chr16:85667520:+
[WARNING] No valid fusions in input file!
[Tue Dec 22 12:20:13 EST 2015] Mapping breakpoints to known genes (this is going to filter a lot)
[Tue Dec 22 12:20:13 EST 2015] 0 fusions mapped out of 0. Dropped fusion candidates: 0 - 5'FPG not mapped, 0 - 3'FPG not mapped, 0 - mapped to same gene, 0 - discordant FPG directions
[Tue Dec 22 12:20:13 EST 2015] Getting features for FPG parts
[Tue Dec 22 12:20:13 EST 2015] Loading expression-related data
[Tue Dec 22 12:20:14 EST 2015] Loading domain and protein interaction interface-related data
[Tue Dec 22 12:20:15 EST 2015] Loading gene ontology data (time-consuming)
[Tue Dec 22 12:20:17 EST 2015] Annotating FPG parts
[Tue Dec 22 12:20:17 EST 2015] =Stage #1: raw data
[Tue Dec 22 12:20:17 EST 2015] ==Domain & PII related
[Tue Dec 22 12:20:17 EST 2015] ==Tissue specific
[Tue Dec 22 12:20:17 EST 2015] =Stage #2: some computations
[Tue Dec 22 12:20:17 EST 2015] ==Domain & PII related (time-consuming)
[Tue Dec 22 12:20:17 EST 2015] ==Tissue specific
[Tue Dec 22 12:20:17 EST 2015] Initializing classifier
[Tue Dec 22 12:20:17 EST 2015] Classifying fusions
[Tue Dec 22 12:20:17 EST 2015] Writing output

Command:

java -Xmx1G -jar $onco_dir/Oncofuse.jar $fusionresults_dir/star-fusion.fusion_candidates.final.abridged starfusion AVG $onco_dir/out_onco.txt

Please let me know what could be the error here.

Thanks,
Ron

Ram · Answer 4 · 2016-11-16

Hi Mikhail,

I have 2 questions about ONCOFUSE:

Is the annotation file in ONCOFUSE from UCSC refseq? I use NCBI reference/annotation files for the reads alignment and breakpoint annotation, and it seems there are a few inconsistencies between the ONCOFUSE annotation and NCBI annotation. i.e., this breakpoint 'chr1:246998194' is mapped to intergenic region using ONCOFUSE annotation, but is mapped to genic region using NCBI annotation. How can I still utilize ONCOFUSE when the annotation files in my pipeline are different from ONCOFUSE's?
I use tophat to call fusions, so fusions.out is the input file for ONCOFUSE, and ONCOFUSE filters out a lot of the tophat fusions. According to the out log, there are many tophat fusions filtered out due to 'discordant FPG directions'. I wonder what this 'discordant FPG directions' means? how is the FPG direction concordance determined?

Thanks,