How To Convert Compound Names Encoded In Html To Utf-8 (Or Similar)?
2
0
Entering edit mode
10.4 years ago

I am converting data from LycoCyc 3.3 into a BridgeDb file, and the compounds.dat of LycoCyc has names in HTML encoding, such as:

<i>D</i>-glycerate

However, the BridgeDb will have names without HTML, e.g. with UTF-8, though I would be happy to have the italics just removed. Is there a Java library that can convert a HTML string like that given above into a non-HTML string, for example in ASCII or even UTF-8 for some superscripted and subscripted digits?

• 2.5k views
ADD COMMENT
0
Entering edit mode

using xslt ?.

ADD REPLY
0
Entering edit mode

a possibility...

ADD REPLY
0
Entering edit mode

forget it , I thought you only wanted to grab the HTML pages.

ADD REPLY
1
Entering edit mode
10.4 years ago

What you are looking for are tools that implement the functionality called "strip html tags". This can be done with regular expressions, for example with sed

sed 's/<[^>]*>//g' index.html  | more

or with various libraries like so:

http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python

now moving that into superscript/subscript will require a custom HTML parsing and formatting.

ADD COMMENT
0
Entering edit mode

Actually, this does not take care of HTML entities, like α ... other than that, I was hoping for a Java library, though I obviously can do regex's there too...

ADD REPLY
0
Entering edit mode
10.4 years ago

I found Jsoup that may serve my needs, or at least some. The library is small, and the Java code simple (or Groovy in this snippet):

  println "val: " + value
  value = Jsoup.parseBodyFragment(value).text()
  println "val: " + value

This takes care of removing the HTML tags and the decoding of the HTML entities, but not the super/subscripting of digits. Example output:

val: trans-&Delta;<SUP>2</SUP>, cis-&Delta;<SUP>4</SUP>-decadienoyl-CoA
val: trans-Δ2, cis-Δ4-decadienoyl-CoA

I welcome other answers very much, particularly if they are smart about sub- and superscript!

ADD COMMENT

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6