Avoiding empty columns when using EDirect's xtract
1
0
Entering edit mode
9.2 years ago
glwinsor • 0

Hi,

I'm evaluating NCBI's EDirect command line tool for the first time (as an alternative to using E-utilities and parsing the XML). It looks pretty good so far but I'm having trouble dealing with cases where I attempt to extract individual fields from a docsum and one of the fields are empty. For example, when I do a very simple query of the bioproject database for all "Pseudomonas aeruginosa PAO1" records:

esearch -db bioproject -query "Pseudomonas aeruginosa PAO1" | efetch -format docsum  | xtract -pattern DocumentSummary -element  Organism_Strain Project_Acc

some of the records returned do not have an Organism_Strain value returned and the Project_Acc value gets shifted to the first column:

MPAO1    PRJNA273663
PRJEB8227
PRJNA268347
PAO1    PRJNA266474
PRJNA265367
PA01    PRJNA264943
PAO1    PRJNA258237
PRJNA256959
PRJNA252740
PRJNA252560

I would like to have xtract return an empty string or even "-" for cases where an Organism_name is missing so that it would have two columns instead of a combination of one or two column lines:

PA01    PRJNA264943
PAO1    PRJNA258237
-            PRJNA256959
-            PRJNA252740
-            PRJNA252560

I've looked over the official NCBI documentation and tried to assign each field to an initialized variable doing different combinations of the following:

esearch -db bioproject -query "Pseudomonas aeruginosa PAO1" | efetch -format docsum | xtract -pattern DocumentSummary -element "&STRAIN" Organism_Strain  -STRAIN "(-)" -element Bioproject_acc

but I can't seem to find the correct syntax to make things work the way I want. Does anyone have any experience with this kind of problem?

Thanks

NCBI EDirect • 2.0k views
ADD COMMENT
1
Entering edit mode
9.2 years ago

Use the following XSLT stylesheet

curl -s 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=bioproject&id=273663,268347,266474,265367,264943,258237,256959,253371,252560,246577,246187,243443,243442,243441,243440,243439,243438,238552,234392,231042,229797,229796,228966,227545,225027,225026,222984,213568,213564,212862,212861,210078,209744,209743,206372,204979,203313,202063,201067,201024,197381,197270,186945,178535,169508,163909,153781,153301,152899,152815,152481,151585,150221,149213,149177,149097,148359,146229,144393,143701,141069,140783,140437,139731,139627,139393,134617,128339,127841,126779,125979,119669,117261,111043,109535,108523,108169,106267,103929,102875,102387,101467,100923,100723,99427,98977,98025,97617,96113,95321,95131,94055,93093,92577,91591,83447,66135,57945,331' | xsltproc stylesheet.xsl -
MPAO1    PRJNA273663
-    PRJNA268347
PAO1    PRJNA266474
-    PRJNA265367
PA01    PRJNA264943
PAO1    PRJNA258237
-    PRJNA256959
-    PRJNA253371
-    PRJNA252560
PA01    PRJNA246577
-    PRJNA246187
PAO1    PRJNA243443
PAO1    PRJNA243442
PAO1    PRJNA243441
PAO1    PRJNA243440
PAO1    PRJNA243439
PAO1    PRJNA243438
-    PRJNA238552
-    PRJNA234392
PAO1    PRJNA231042
PA01    PRJNA229797
PA01    PRJNA229796
-    PRJNA228966
PAO1-GFP    PRJNA227545
PAO1-VE13    PRJNA225027
PAO1-VE2    PRJNA225026
-    PRJNA222984
PAO1    PRJNA213568
PA01    PRJNA213564
PAO1-VE13    PRJNA212862
PAO1-VE2    PRJNA212861
-    PRJNA210078
PAO1    PRJNA209744
PAO1-CipR    PRJNA209743
PAK    PRJNA206372
-    PRJNA204979
-    PRJNA203313
PAO1-CipR    PRJNA202063
-    PRJNA201067
PAO1    PRJNA201024
PAO1    PRJNA197381
-    PRJNA197270
-    PRJNA186945
-    PRJNA178535
PAO1    PRJNA169508
-    PRJNA163909
-    PRJNA153781
-    PRJNA153301
-    PRJNA152899
-    PRJNA152815
-    PRJNA152481
-    PRJNA151585
-    PRJNA150221
-    PRJNA149213
-    PRJNA149177
-    PRJNA149097
-    PRJNA148359
-    PRJNA146229
-    PRJNA144393
-    PRJNA143701
-    PRJNA141069
-    PRJNA140783
-    PRJNA140437
-    PRJNA139731
-    PRJNA139627
-    PRJNA139393
-    PRJNA134617
-    PRJNA128339
-    PRJNA127841
-    PRJNA126779
-    PRJNA125979
-    PRJNA119669
-    PRJNA117261
-    PRJNA111043
-    PRJNA109535
-    PRJNA108523
-    PRJNA108169
-    PRJNA106267
-    PRJNA103929
-    PRJNA102875
-    PRJNA102387
-    PRJNA101467
-    PRJNA100923
-    PRJNA100723
-    PRJNA99427
-    PRJNA98977
-    PRJNA98025
-    PRJNA97617
-    PRJNA96113
-    PRJNA95321
-    PRJNA95131
-    PRJNA94055
-    PRJNA93093
-    PRJNA92577
-    PRJNA91591
PAO1H2O    PRJNA83447
PAK    PRJNA66135
PAO1    PRJNA57945
PAO1    PRJNA331
ADD COMMENT
0
Entering edit mode

Thanks for this. I overly simplified my example and requirements for the sake of clarity but it looks like the XLST approach will handle them very well!

ADD REPLY

Login before adding your answer.

Traffic: 2769 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6