Is It True That There Is No Parser For Interproscan 5 Xml?
1
1
Entering edit mode
10.1 years ago
Michael 54k

I have the output of an InterProScan 5 RC7 run in XML format. Unfortunately, I was unable to locate an appropriate parser. The BioPerl parser http://search.cpan.org/~cjfields/BioPerl/Bio/SeqIO/interpro.pm doesn't understand it, it seems to support up to version 4, error message:

no element found at line 206, column 0, byte 14045 at /opt/local/lib/perl5/site_perl/5.12.3/darwin-thread-multi-2level/XML/Parser.pm line 187

Also, this seems to be a known issue: https://redmine.open-bio.org/issues/3452

I was searching for a while but wasn't able to locate:

  • a parser Bio* library in any language (except ofc generic XML parser, please do not recommend generic XML parsing)
  • an XSLT stylesheet to convert interproscan 5 RC7 to e.g. interproscan 4 format

The schema definitions are here: http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5 There are two of them, RC1-6 and RC7 schemata.

Maybe if you are among the responsible people for this project, you could also explain:

  • why this drastic change to a completely different format
  • without providing a parser or conversion solution
  • or without informing the Bioperl/python communities?

Edit: Thank you for prooving me wrong, the conversion has been taken care of by the developers already from the beginning. So, we are just lacking native BioPerl/Python support.

This is how my file begins. I think that this is also a mistake because the schema is not referenced (correctly, nothing points to a correct schema version).


<protein-matches xmlns="&lt;a href=" http:="" www.ebi.ac.uk="" interpro="" resources="" schemas="" interproscan5"="" rel="nofollow">http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5">
<protein>
    <sequence md5="alksjfojfjkjhaiuy948iued">MCCXXX ...
• 4.3k views
ADD COMMENT
3
Entering edit mode
10.1 years ago
hpmcwill ★ 1.2k

InterProScan 5RC7 is rather old now and uses an obsolete version of the InterPro data, so I suggest you upgrade to the current version if possible (see https://code.google.com/p/interproscan/).

You can use the conversion mode (see https://code.google.com/p/interproscan/wiki/InterProScan5ConvertMode) to reformat InterProScan 5 XML into the other InterProScan 5 formats, and the old InterProScan 4 raw format (tab-delimited text) which is used by many tools which supported InterProScan 4. Alternatively since InterProScan 5 can produce GFF3, you should be able to use modules such as Bio::Tools::GFF to parse the InterProScan 5 GFF3 output and get most of the information which available in the InterProScan 5 XML.

If you have feedback, comments, suggestions or issues for InterProScan 5 please direct them to the authors as detailed in the InterProScan documentation: https://code.google.com/p/interproscan/wiki/InterProScan5Feedback. This will ensure they are aware of the issue, and can prioritise any related work based on feedback from the user community.

For what it is worth... I am aware of support for InterProScan 5 output in:

So they may be useful options to investigate, depending on what you are attempting to do.

ADD COMMENT
0
Entering edit mode

Thanks a lot (and sorry for raging ;)

ADD REPLY

Login before adding your answer.

Traffic: 2643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6