Parsing SRA summary using Biopython Entrez
1
0
Entering edit mode
3.6 years ago
kmyers2 ▴ 80

I am trying to parse the descriptions of SRA files in order to compile them into a table to export to a TXT file. I'm using Biopython Entrez module for this. Here is my code:

sraList = []
handle = Entrez.esearch(db="sra", term=searchTerm, retmax = '100000')
result = Entrez.read(handle)
for each in result['IdList']:
    sraList.append(each)
for each in sraList:
    test = Entrez.esummary(db="sra", id=each)
    record = Entrez.read(test)
    for entry in record:
        with open(outFile, 'a') as f:
            f.write()

The issue arrises when I run the record = Entrez.read(test). The record is a dictionary, but the entry with the experimental metadata I need is in an XML format:

for each in record:
    print(each.keys()) 
dict_keys(['Item', 'Id', 'ExpXml', 'Runs', 'ExtLinks', 'CreateDate', 'UpdateDate'])

for each in record:
    print(each["ExpXml"])

<Summary><Title>Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome</Title><Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform><Statistics total_runs="1" total_spots="26824861" total_bases="5364972200" total_size="2357069488" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA623091" center_name="JGI" contact_name="JGI SRA" lab_name=""/><Experiment acc="SRX3316534" ver="1" status="public" name="Zymomonas mobilis subsp. mobilis ZM4 ATCC 31821 - transcriptome"/><Study acc="SRP121239" name="Zymomonas mobilis mobilis ZM4 transcriptome - GS-26"/><Organism taxid="264203" ScientificName="Zymomonas mobilis subsp. mobilis ZM4 = ATCC 31821"/><Sample acc="SRS2622106" name=""/><Instrument ILLUMINA="Illumina HiSeq 2500"/><Library_descriptor><LIBRARY_NAME>ANSWP</LIBRARY_NAME><LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY><LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>other</LIBRARY_SELECTION><LIBRARY_LAYOUT> <PAIRED/> </LIBRARY_LAYOUT><LIBRARY_CONSTRUCTION_PROTOCOL>Low Input (RNA)</LIBRARY_CONSTRUCTION_PROTOCOL></Library_descriptor><Bioproject>PRJNA409960</Bioproject><Biosample>SAMN07686944</Biosample>

I have tried to parse this with xmltodict, but I get an error:

for entry in record:
    summary = entry["ExpXml"]
    parsed = xmltodict.parse(summary, xml_attribs=False)
    print(parsed)

ExpatError: junk after document element: line 1, column 304

I don't have much experience with XML files, but from what I can tell this suggests there is a problem with the XML formatting from NCBI. If that's the case, I don't have the experience to know how to fix it.

Does anyone have any suggestions on how to solve this problem?

python biopython entrez sra xml • 2.2k views
ADD COMMENT
1
Entering edit mode
3.6 years ago

I have seen many problems with XML from NCBI, they don't work that well with tools that require well formed XML.

to parse this kind of output I would recommend using the command line version of tools chained up like so:

esearch -db genome -query "22954[uid]" | \
elink -target bioproject | \
efetch -format xml | \
xtract -pattern DocumentSummary -element Salinity OxygenReq OptimumTemperature TemperatureRange Habitat

will print:

eMesophilic     eAerobic        85      eHyperthermophilic      eAquatic

Example taken from: https://github.com/NCBI-Hackathons/EDirectCookbook

The entrez direct manual has many more examples: https://www.ncbi.nlm.nih.gov/books/NBK179288/

ADD COMMENT

Login before adding your answer.

Traffic: 1584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6