Question

Accessing Geoprofiles Data Via Entrez

1

Entering edit mode

11.4 years ago

viraptor ▴ 10

How can I get the data from geoprofiles database parsed into some sane way? For example after the search, I get a result with a couple of ids. Let's say I want to download 64663643 (http://www.ncbi.nlm.nih.gov/geoprofiles/64663643). Specifically I'd like to get the GDS's summary from it.

But after doing the standard:

Bio.Entrez.read(Bio.Entrez.esummary(db='geoprofiles', id='64663643'))

I get DTD errors (missing tag ENTREZ_GENE_ID). If I try without validation, I get a lot of data without a proper structure:

{u'DocumentSummarySet': ListElement([ListElement(['3682', '2896', 'fKTC', 'zFJA', 'Thiamine supplementation effect on non-insulin-dependent diabetes model: liver', 'Rattus norvegicus', '476602p1p1p1', 'Expression profiling by array', 'count', '46103', 'Gja7', 'gap junction membrane channel protein alpha 7', '', '', 'Rattus norvegicus gap junction channel protein connexin 45 mRNA, partial cds', 'AF536559.4', '', '', '', '', '476602p1;476604p1', '9;9', '5.231620', '346.305270', '', '22500', '0', '88', '30'], attributes={u'uid': u'64663643'})], attributes={u'status': u'OK'})}

What should I do differently to get a proper parsed result?

biopython entrez • 2.8k views

ADD COMMENT • link updated 11.4 years ago by Chris Maloney ▴ 360 • written 11.4 years ago by viraptor ▴ 10

0

Entering edit mode

I don't have an answer to your question, but may I ask: if you want to retrieve data from a GDS, why don't you use the GDS ID (GDS3682) ? And maybe you may find this website and related SQLite DB useful : http://gbnci.abcc.ncifcrf.gov/geo/index.php. Julien

ADD REPLY • link 11.4 years ago by Julien Textoris ▴ 430

0

Entering edit mode

I'm going to do that, but first I need to get the GDS id from the geoprofiles entry.

ADD REPLY • link 11.4 years ago by viraptor ▴ 10

score 0 · Answer 1 · 2012-12-27

Regarding the missing tag error message, if you look at the raw XML you'll see the ENTREZ GENE ID is actually missing - apparently that is counter to what the XML file's DTD says to expect (and if so that is an NCBI bug):

>>> from Bio import Entrez
>>> print Entrez.esummary(db='geoprofiles', id='64663643').read()
...
<ENTREZ_GENE_ID></ENTREZ_GENE_ID>
....

What exactly are you trying to get from the XML? You might prefer to use one of the Python XML parsing libraries directly, e.g. ElementTree, which doesn't depend on the DTD file and the XML file actually following it.

score 0 · Answer 2 · 2012-12-27

I am not familiar with Biopython, but you can see the raw XML results from ESummary here (just open in your browser): http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?retmode=xml&version=2.0&db=geoprofiles&id=64663643.

Reverse engineering a bit, it looks like biopython is turning the <DocumentSummary> element into a "ListElement", and then giving an array of string values under that, one for each child element. I think that the geoprofiles esummary always returns every possible child element, so you can access the individual data elements by their numeric position. For example, geneDesc would be string # 11 (zero-based), 'gap junction membrane channel protein alpha 7'.

You could also fix the DTD by adding the requisite elements that are missing. Here's a diff between the existing DTD on the NCBI site and a fixed one:

$ diff eSummary_geoprofiles.dtd eSummary_geoprofiles.fixed.dtd 
23a24
> <!ELEMENT ENTREZ_GENE_ID %T_string;>
38a40,41
> <!ELEMENT groups %T_string;>
> <!ELEMENT abscall %T_string;>
62a66
>       | ENTREZ_GENE_ID
77a82,83
>       | groups
>       | abscall