Parser For Genscan And Fgenesh
3
1
Entering edit mode
13.3 years ago
Gvj ▴ 470

Did anyone succeed to convert the Genscan and Fgenesh output format to GFF and GTF ? I have found few on net but non of them is working. If you have a parser, please share it

conversion • 7.0k views
ADD COMMENT
3
Entering edit mode
13.3 years ago

Have you tried Bioperl's Bio::Tools::Genscan and Bio::Tools::Fgenesh parsers? Both of these are a Bio::SeqAnalysisParserI, so you will be able to obtain SeqFeatureI from them. In combination with Bio::Tools::GFF you will be able to make GFF2 or GTF.

ADD COMMENT
2
Entering edit mode
13.3 years ago
Malcolm.Cook ★ 1.5k

For-instance, you can produce some version of GFF from fgenesh output with this script

#!/usr/bin/env perl

# PURPOSE: parse fgenesh output into gff
# USAGE: fgenesh fish somefish.dna | fgenesh2gff > somefish.dna.fgenesh.gff

use strict;
use warnings;

use Bio::Tools::Fgenesh;    # to parse output into feature
use Bio::Tools::GFF;

# Remaining options should name files to process, but if none, process
# standard input:

@ARGV = ('-') unless @ARGV; 
my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV);
my $featureout = new Bio::Tools::GFF(-gff_version=>2);
my $IDNUM = 0;
while (my $gene = $fgenesh->next_prediction()) {
  my $ID =  $gene->seq_id . "_fgenesh_" . ++ $IDNUM;
  $gene->add_tag_value('ID', $ID);
  foreach ($gene->features) {
    $_->add_tag_value('Parent', $ID);
    $_->seq_id($gene->seq_id);
    $featureout->write_feature($_);
  }
}
$fgenesh->close();    
exit 0;

... which will give you output like:

LanFP_DNA34 Fgenesh Poly_A_site 1224    1224    1.26    -   .   Parent LanFP_DNA34_fgenesh_1 ; score "1.26" 
LanFP_DNA34 Fgenesh TerminalExon    1844    2024    26.02   -   2   Parent LanFP_DNA34_fgenesh_1 ; score "26.02" 
LanFP_DNA34 Fgenesh InternalExon    2492    2622    20.19   -   0   Parent LanFP_DNA34_fgenesh_1 ; score "20.19" 
LanFP_DNA34 Fgenesh InternalExon    3243    3342    25.15   -   2   Parent LanFP_DNA34_fgenesh_1 ; score "25.15" 
LanFP_DNA34 Fgenesh InternalExon    3517    3668    18.92   -   0   Parent LanFP_DNA34_fgenesh_1 ; score "18.92" 
LanFP_DNA34 Fgenesh InternalExon    4184    4276    11.42   -   0   Parent LanFP_DNA34_fgenesh_1 ; score "11.42" 
LanFP_DNA34 Fgenesh InternalExon    4569    4694    14.86   -   0   Parent LanFP_DNA34_fgenesh_1 ; score "14.86" 
LanFP_DNA34 Fgenesh InitialExon 5384    5566    2.09    -   0   Parent LanFP_DNA34_fgenesh_1 ; score "2.09"

.... when run on input like:

FGENESH 2.4 Prediction of potential genes in Fish genomic DNA
 Time    :   Mon Jul 10 14:18:02 2006
 Seq name: LanFP_DNA34 Clipped to 31-5694 
 Length of sequence: 5663 
 Number of predicted genes 1 in +chain 0 in -chain 1
 Number of predicted exons 7 in +chain 0 in -chain 7
 Positions of predicted genes and exons: Variant   1 from   1, Score:105.654358 
   G Str   Feature   Start        End    Score           ORF           Len

   1 -      PolA      1224                1.26
   1 -    1 CDSl      1844 -      2024   26.02      1844 -      2023    180
   1 -    2 CDSi      2492 -      2622   20.19      2494 -      2622    129
   1 -    3 CDSi      3243 -      3342   25.15      3243 -      3341     99
   1 -    4 CDSi      3517 -      3668   18.92      3519 -      3668    150
   1 -    5 CDSi      4184 -      4276   11.42      4184 -      4276     93
   1 -    6 CDSi      4569 -      4694   14.86      4569 -      4694    126
   1 -    7 CDSf      5384 -      5566    2.09      5384 -      5566    183

Predicted protein(s):
>FGENESH:   1   7 exon (s)   1844  -   5566   321 aa, chain -
MIHPTKICFTALGSKCADIGTVVHRIRVLFCPLKTDSSGQWPSGWSVRLTYTYCRFDSIT
FETPPTRYTRERHKKALPGTAPHFPNKLSSRVHPRPAKIRATMPLPATHDIHLHGSINGH
EFDMVGGGKGDPNAGSLVTTAKSTKGALKFSPYLMIPHLGYGYYQYLPYPDGPSPFQTSM
LEGSGYAVYRVFDFEDGGKLTTEFKYSYEGSHIKADMKLMGSGFPDDGPVMTSQIVDQDG
CVSKKTYLNNNTIVDSFDWSYNLQNGKRYRARVSSHYIFDKPFSADLMKKQPVFVYRKCH
VKASKTEVTLDEREKAFYELA

Tweak and repeat

ADD COMMENT
0
Entering edit mode

When I am trying with multiple genes in the same contig, its only give the output of first gene. Have you faced that?

ADD REPLY
0
Entering edit mode

hmmmm - I don't recall that being an issue at all. I seem to recall something related... namely that fgenesh only processes the first sequence in a multi-fasta file.... but that is not what you are experiencing. Good luck.

ADD REPLY
0
Entering edit mode

is it possible to parse the CDS and the protein sequence with this module

ADD REPLY
1
Entering edit mode
13.3 years ago

DAWGPAWS does what you wanted. Aside from support for gene prediction programs, there are also parsers for transposable element predictions. Most of the annotations files generated are in the GFF format.

ADD COMMENT
0
Entering edit mode

If I try the fgenesh result without fasta seq in it, the program uses the first contig as the seq_id for all genes. With fasta seq in fgenesh output, it through the error:

------------- EXCEPTION -------------
MSG: Attempting to set the sequence to BLABLA which does not look healthy.
STACK Bio::PrimarySeq::seq /perl/5.8.8/Bio/PrimarySeq.pm:283
STACK Bio::Tools::Fgenesh::next_prediction /perl/5.8.8/Bio/Tools/Fgenesh.pm:247
STACK DAWGPAWS::fgenesh2gff /opt/dawgpaws-1.1/scripts/cnv_fgenesh2gff.pl:286
STACK toplevel /opt/dawgpaws-1.1/scripts/cnv_fgenesh2gff.pl:216

Any Idea ??

ADD REPLY

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6