Question

Parsing SRA summary using Biopython Entrez

0

Entering edit mode

3.6 years ago

kmyers2 ▴ 80

I am trying to parse the descriptions of SRA files in order to compile them into a table to export to a TXT file. I'm using Biopython Entrez module for this. Here is my code:

sraList = []
handle = Entrez.esearch(db="sra", term=searchTerm, retmax = '100000')
result = Entrez.read(handle)
for each in result['IdList']:
    sraList.append(each)
for each in sraList:
    test = Entrez.esummary(db="sra", id=each)
    record = Entrez.read(test)
    for entry in record:
        with open(outFile, 'a') as f:
            f.write()

The issue arrises when I run the record = Entrez.read(test). The record is a dictionary, but the entry with the experimental metadata I need is in an XML format:

for each in record:
    print(each.keys()) 
dict_keys(['Item', 'Id', 'ExpXml', 'Runs', 'ExtLinks', 'CreateDate', 'UpdateDate'])

for each in record:
    print(each["ExpXml"])

<Summary><Title>Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome</Title><Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform><Statistics total_runs="1" total_spots="26824861" total_bases="5364972200" total_size="2357069488" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA623091" center_name="JGI" contact_name="JGI SRA" lab_name=""/><Experiment acc="SRX3316534" ver="1" status="public" name="Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome"/><Study acc="SRP121239" name="Zymomonas mobilis mobilis ZM4 transcriptome - GS-26"/><Organism taxid="264203" ScientificName="Zymomonas mobilis subsp. mobilis ZM4 = ATCC 31821"/><Sample acc="SRS2622106" name=""/><Instrument ILLUMINA="Illumina HiSeq 2500"/><Library_descriptor><LIBRARY_NAME>ANSWP</LIBRARY_NAME><LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY><LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>other</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT><LIBRARY_CONSTRUCTION_PROTOCOL>Low Input (RNA)</LIBRARY_CONSTRUCTION_PROTOCOL></Library_descriptor><Bioproject>PRJNA409960</Bioproject><Biosample>SAMN07686944</Biosample>

I have tried to parse this with xmltodict, but I get an error:

for entry in record:
    summary = entry["ExpXml"]
    parsed = xmltodict.parse(summary, xml_attribs=False)
    print(parsed)

ExpatError: junk after document element: line 1, column 304

I don't have much experience with XML files, but from what I can tell this suggests there is a problem with the XML formatting from NCBI. If that's the case, I don't have the experience to know how to fix it.

Does anyone have any suggestions on how to solve this problem?

python biopython entrez sra xml • 2.2k views

ADD COMMENT • link updated 3.6 years ago by Istvan Albert 100k • written 3.6 years ago by kmyers2 ▴ 80

score 1 · Answer 1 · 2020-10-15

I have seen many problems with XML from NCBI, they don't work that well with tools that require well formed XML.

to parse this kind of output I would recommend using the command line version of tools chained up like so:

esearch -db genome -query "22954[uid]" | \
elink -target bioproject | \
efetch -format xml | \
xtract -pattern DocumentSummary -element Salinity OxygenReq OptimumTemperature TemperatureRange Habitat

will print:

eMesophilic     eAerobic        85      eHyperthermophilic      eAquatic

Example taken from: https://github.com/NCBI-Hackathons/EDirectCookbook

The entrez direct manual has many more examples: https://www.ncbi.nlm.nih.gov/books/NBK179288/