Biostar Beta. Not for public use.
How can I easily remove overlapping transcripts, keeping only longest transcript, in a GFF file.
0
Entering edit mode
14 months ago
a.rex • 190

I have recently annotated a genome using mikado. However, the resultant gff file contains loci that are overlapping but not annotated as isoforms: i.e.

chr17   Mikado_loci     ncRNA_gene      1014098 1014976 17      -       .       ID=Fly1;Name=Fly1;multiexonic=True;superlocus=Mikado_superlocus:chr17-:1014059-1015028
chr17   Mikado_loci     ncRNA   1014098 1014976 17      -       .       ID=Fly1.1;Parent=Fly1;Name=all_MSTRG.6472.4;alias=all_MSTRG.6472.4;canonical_junctions=1,2,3;canonical_number=3;canonical_proportion=1.0;p$
chr17   Mikado_loci     exon    1014098 1014296 .       -       .       ID=Fly1.1.exon1;Parent=Fly1.1
chr17   Mikado_loci     exon    1014343 1014675 .       -       .       ID=Fly1.1.exon2;Parent=Fly1.1
chr17   Mikado_loci     exon    1014720 1014871 .       -       .       ID=Fly1.1.exon3;Parent=Fly1.1
chr17   Mikado_loci     exon    1014919 1014976 .       -       .       ID=Fly1.1.exon4;Parent=Fly1.1
chr17   Mikado_loci     ncRNA_gene      1014165 1015048 17      +       .       ID=Fly2;Name=Fly2;multiexonic=True;superlocus=Mikado_superlocus:chr17+:1014165-1015048
chr17   Mikado_loci     ncRNA   1014165 1015048 17      +       .       ID=Fly2.1;Parent=Fly2;Name=aaa_MSTRG.5952.1;alias=aaa_MSTRG.5952.1;canonical_number=0;canonical_on_reverse_strand=True;canonical_proportio$
chr17   Mikado_loci     exon    1014165 1014296 .       +       .       ID=Fly2.1.exon1;Parent=Fly2.1
chr17   Mikado_loci     exon    1014343 1015048 .       +       .       ID=Fly2.1.exon2;Parent=Fly2.1

You see that they are both nested, but annotated apart. How can I resolve this, keeping only the longest transcript? If I keep both, this will play with my protein predictions - giving me a result of two proteins instead of 1.

gff parse • 592 views
ADD COMMENTlink
0
Entering edit mode

Expected output would help better understand the issue

ADD REPLYlink
2
Entering edit mode
15 months ago
Malcolm.Cook ♦ 1.0k
kansas, usa

For starters, ncRNA probably stands for non-coding RNA so you should expect zero proteins from that sample of GFF

But even if there were coding... what would be your problem? Presumably you chose mikado for its predictions.... why do you want to throw any out? (FWIW: I don't know mikado other than as a comic opera)

Sigh, assuming you have your reason... (and you won't be alone)

If you program R, you could:

  • use the GenomicFeatures library, load your GFF into a transcript database, txdb
  • query txdb for all transcripts
  • use findOverlaps to build a table representing the relationship of overlapping
  • compute the Transitive closure of this relationship (table) using an implementation of Floyd–Warshall algorithm - I think there is one in igraph and RGBL librarys. This will give you the overlapping groups of transcripts.
  • Take the longest in each group.

I would advise you to plan ahead a little further before taking this route. What is your end game? Why exactly do you want to find the "longest".

ADD COMMENTlink
0
Entering edit mode

The example I gave above is for ncRNA, but that same is observed for genes. My problem is that as they are annotated as separate loci, extracting sequences and running Transdecoder to find proteins will result in 2 potential genes/proteins instead of 1.

ADD REPLYlink
1
Entering edit mode

Well, perhaps you want to include a real example of your problematic case.

In any case, why would you want to discard one of the predictions when from your two predictions of overlapping genes on opposite strands?

I just first time now looked at Mikado which I now understand to be a highly configurable tool to select "best transcripts" from among many alternate versions produced by sequencing/prediction/assembly etc. By my reading, if mikado outputs overlapping predictions on opposite strands, then, that is its considered opinion, based on your configuration, of a reasonable interpretation of the input data.

But wait... looking more closely at your results, I think we are ignoring something that mikado has already highlighted. Look at the annotation in column 9 : "canonical_on_reverse_strand". It is already telling you that this transcript is "suspicious". I'm guessing you are not using mikado to its fullest extent, or are sharing one of its intermediate results with us, but that in fact there is a way to get mikado to remove that from what you take further in your analysis, or at least for you to filter further yourself based on that annotation.

If you provide true example of the problem from a coding gene we might be able to help or comment further. As it stands, I think you have a problem of not understanding your tool fully, and/or presenting misleading examples of your problem.

ADD REPLYlink
0
Entering edit mode

If there is a non R implementation that you are aware of, that would be helpful. With GenomicFeatures, the strandedness seems to be important, and fails if there is a '.' as opposed to a +/-

ADD REPLYlink
1
Entering edit mode

I have Perl approach (two scripts to run) if you are interested in. But you need to have bioperl and few Perl modules.

ADD REPLYlink
0
Entering edit mode

That would be very useful, thanks

ADD REPLYlink
1
Entering edit mode

You need to git clone this repository: GAAS And install the prerequisites.

Then you launch the script gff3_sp_keep_longest_isoform.pl.

Actually only this script should do what you want.

--EDIT-- This will work only if you modify the line 89 of the following file: GAAS/annotation/BILS/Handler/GXFhandler.pm

my $overlapCheck = 0; # option to activate the check for overlapping locus

You should put 1. I forgot that I had deactivated it by default.

ADD REPLYlink
0
Entering edit mode

Thank you - I have started it, hoping the output resolves my issue. Should it work for my case where IDs are different (i.e. not isoforms) and delete the shorter sequence?

ADD REPLYlink
0
Entering edit mode

Yes. before to select the longest isoform per locus, (if you have set the $overlapCheck to 1), all the features are checked for overlaps and gathered into a same locus when there is an overlap of 2 features of the same type. By the same type I mean as example an overlap between an mRNA and a tRNA will not be considered.

ADD REPLYlink
0
Entering edit mode

even if they are on different strands?

ADD REPLYlink
0
Entering edit mode

No they have to be on the same strand too.

ADD REPLYlink
0
Entering edit mode

No they have to be on the same strand too.

ADD REPLYlink
0
Entering edit mode

GenomicFeatures is fully configurable as to treatment of strand. Please advise in what way it "fails". I have found otherwise.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3