parse nxml file from pubmed
2
0
Entering edit mode
6.5 years ago
Quak ▴ 490

I have downloaded pubmed articles from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/ - the files come in nxml format, and I would like to maneuver on each paper and do some NLP.

I already tried two packages,

1)

library(xml2) tt = read_xml("BMC_Cell_Biol/PMC1079802.nxml")

2)

paper1 <- xmlParse("BMC_Cell_Biol/PMC1079802.nxml") xml_data < xmlToList(paper1)

but none of really parse the whole file very well - for example, you can't get into the introduction section ! I was wondering if some one can share some scripts (preferably in R) regarding this, otherwise, I am planning to do it in python ...

nxml parsing pubmed • 3.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
6.5 years ago

Using xslt:

example:

$ curl -s "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/comm_use.A-B.xml.tar.gz" | tar xvz  3_Biotech/PMC4624140.nxml  --to-command 'xsltproc --novalid transform.xsl 3_Biotech/PMC4624140.nxml'
3_Biotech/PMC4624140.nxml
Microorganisms are one of the tools used to detoxify toxic compounds present in the environment. Free suspended or immobilized microbial cells can be used for this purpose. However, the immobilized microbial cells have many advantages over free suspended cells under different conditions. For instance, the immobilization of whole cells increases degradation rate owing to increased cell population density, cell wall permeability, and extracellular microbial enzymes stability are improved, cells can be easily removed from the reaction mixture, higher operational stability and storage stability, reuse of immobilized cells in continuous reactors, and allows the bioreactors to operate at flow rates different from the growth rate of the microorganisms (Bettmann and Rehm 1984; Hall and Rao 1989; Cassidy et al. 1996; Ha et al. 2009; Zheng et al. 2009). In addition, the immobilized cell systems act as a protective cover in the presence of toxic compounds and are more resistant to pH or temperature changes. However, free suspended cells have better mass transfer aspects compared to immobilized bacterial or fungal cells (Trevors et al. 1992; Zheng et al. 2009).In the last two decades, there have been intensive researches on the use of immobilized microbial cells as biocatalysts, using numerous reactors like fed batch, semi-continuous fed batch, and continuous packed bed reactor. Each reactor type possesses its disadvantages and advantages, and the choice of a particular type of a reactor may depend on the operational conditions, and inexpensive and non-toxic support inert material for microbial cell immobilization, etc., (Zheng et al. 2009). Bacterial cells immobilized on various matrices have been used extensively for biodegradation of various toxic nitroaromatics such as trinitrotoluene (TNT) (Rho et al. 2001; Ullah et al. 2010), nitrobenzene (Zheng et al. 2009; Qi et al. 2012), 2-nitrotoluene (Mulla et al. 2013), and 3-nitrobenzoate (Mulla et al. 2012).Pendimethalin [N-(1-ethyl propyl) 2,6-dinitro-3,4-xylidine], a common water and soil contaminant, herbicide of dinitroaniline group, is used to control weeds in various crop plants. The use of pendimethalin may adversely affect endangered species of terrestrial and semi-aquatic plants and invertebrates (Kole et al. 1994). One of the best strategies to degrade the hazardous compounds (including pendimethalin) is to use microorganisms. There are few reports on the degradation of pendimethalin by free cells of Fusarium oxysporum and Paecilomyces variotii (Singh and Kulshrestha 1991), Azotobacter chroococcum (Kole et al. 1994), Bacillus circulans (Megadi et al. 2010), and fungus Lecanicillium saksenae (Pinto et al. 2012). However, there is no report on the degradation of pendimethalin by immobilized bacterial or fungal cells. The aim of the present investigation was therefore to compare the pendimethalin degradation by freely suspended and immobilized cells of Bacillus lehensis XJU on various matrices in batch and semi-continuous degradation, and to evaluate the effect of pH, temperature, and storage stability of pendimethalin degradation rate by polyurethane foam (PUF)-immobilized bacterial cells.
ADD COMMENT
0
Entering edit mode

thanks, If I understand correctly, I made the transform.xls file, and then used it as

cat PMC4729119.nxml | xsltproc --novalid transform.xsl

but then I got,

transform.xsl:1: namespace error : xmlns:xsl: '<a href=' is not a valid URI
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                       ^
transform.xsl:1: parser error : error parsing attribute name
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                        ^
transform.xsl:1: parser error : attributes construct error
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                        ^
transform.xsl:1: parser error : Couldn't find end of Start Tag stylesheet line 1
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                        ^
transform.xsl:1: parser error : Extra content at the end of the document
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                        ^
ADD REPLY
0
Entering edit mode

ah, biostars messed up my code, I'll replace with a gist...

ADD REPLY
0
Entering edit mode

thanks, just to update you, that now works, but doesn't catch anything ... may be I should play with <xsl:apply-templates select="//sec[title/text() = 'Introduction']"/> I wish there was a way to parse it into a structure format, and then play with different segments.

ADD REPLY
0
Entering edit mode

worked with PMC4624140, I was just searching for

<sec>
<title>Introduction</title>

may be PMC4729119 has a different structure... could be Abstract or Background

ADD REPLY
0
Entering edit mode

thanks so much - I can see that ... can you also add where/how I can learn tweaking the pattern matching - e.g I would like to capture methods which in the original xml looks sec-type="materials|methods"><title>Methods</title><sec< p="">

ADD REPLY
1
Entering edit mode

search for an xpath tutorial.

could be something like:

//sec[title/text() = 'materials' or title/text() = 'methods' ]
ADD REPLY
0
Entering edit mode
6.5 years ago
shoujun.gu ▴ 380

have you tried xtract?

https://dataguide.nlm.nih.gov/edirect/xtract.html

ADD COMMENT

Login before adding your answer.

Traffic: 1807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6