How to retrive information on BioSample isolation sources from NCBI ?
1
0
Entering edit mode
3.9 years ago

Hi,

I saw the following post that targets the Pseudomonas genome sequences. I would like to extract all Biosample IDs and their corresponding isolation sources from NCBI. Is that possible with bash using esearch, esummary and/or xtract? Does anyone know a script for this purpose?

How to retrieve information on bacterial source from NCBI? How to retrieve information on bacterial source from NCBI?

Best regards

sequencing next-gen • 3.6k views
ADD COMMENT
1
Entering edit mode

extract all Biosample IDs and their corresponding isolation sources from NCBI.

While it may be possible it is likely not practical for all IDs. You may want to check biosample file NCBI makes available here (large file!) to see if you can pare down to a smaller set of ID's and then use answer from the thread you linked above.

ADD REPLY
0
Entering edit mode

Dear genomax,

Thank you for your quick comment ! It turns out that it is a big size and the extraction is not realistic. Sorry to bother you again, but do you know how to get all BioSample isolation sources of the environmental metagenomes? I still don't know how to specify db and can't get any data. If possible, I would appreciate it if you could give me an example of script. I am sorry that I am not familiar with Linux-based analysis.

Best regards

ADD REPLY
0
Entering edit mode

Using EntrezDirect. Following is a vague start, it would be challenging to deal with a query like "metagenome" since there are 1537782 hits as of today.

$ esearch -db biosample -query "metagenome" | efetch -format docsum | grep SampleData | head -2
    <SampleData><BioSample access="public" publication_date="2020-06-21T00:00:00.000" last_update="2020-06-21T12:28:15.543" submission_date="2020-06-21T12:00:07.830" id="15337269" accession="SAMN15337269">   <Ids>     <Id db="BioSample" is_primary="1">SAMN15337269</Id>     <Id db_label="Sample name">19_6</Id>     <Id db="SRA">SRS6881635</Id>   </Ids>   <Description>     <Title>MIMS Environmental/Metagenome sample from mouse gut metagenome</Title>     <Organism taxonomy_id="410661" taxonomy_name="mouse gut metagenome">       <OrganismName>mouse gut metagenome</OrganismName>     </Organism>     <Comment>       <Paragraph>Keywords: GSC:MIxS;MIMS:5.0</Paragraph>     </Comment>   </Description>   <Owner>     <Name>Kyung Hee University</Name>     <Contacts>       <Contact email="dhkim311@gmail.com">         <Name>           <First>Dong-Hyun</First>           <Last>Kim</Last>         </Name>       </Contact>     </Contacts>   </Owner>   <Models>     <Model>MIMS.me</Model>     <Model>MIGS/MIMS/MIMARKS.host-associated</Model>   </Models>   <Package display_name="MIMS: metagenome/environmental, host-associated; version 5.0">MIMS.me.host-associated.5.0</Package>   <Attributes>     <Attribute attribute_name="collection_date" harmonized_name="collection_date" display_name="collection date">2017</Attribute>     <Attribute attribute_name="env_broad_scale" harmonized_name="env_broad_scale" display_name="broad-scale environmental context">mouse gut</Attribute>     <Attribute attribute_name="env_local_scale" harmonized_name="env_local_scale" display_name="local-scale environmental context">C57BL/6 mouse gut</Attribute>     <Attribute attribute_name="env_medium" harmonized_name="env_medium" display_name="environmental medium">feces</Attribute>     <Attribute attribute_name="geo_loc_name" harmonized_name="geo_loc_name" display_name="geographic location">Korea:Seoul</Attribute>     <Attribute attribute_name="host" harmonized_name="host" display_name="host">Mus musculus</Attribute>     <Attribute attribute_name="lat_lon" harmonized_name="lat_lon" display_name="latitude and longitude">not applicable</Attribute>     <Attribute attribute_name="treat">ginkgo leaf</Attribute>     <Attribute attribute_name="replicate">replicated 6</Attribute>   </Attributes>   <Links>     <Link type="entrez" target="bioproject" label="PRJNA449459">449459</Link>   </Links>   <Status status="live" when="2020-06-21T12:00:07.829"/> </BioSample> </SampleData>
ADD REPLY
3
Entering edit mode
3.9 years ago
GenoMax 142k
$ esearch -db biosample -query "metagenome" | esummary | xtract -pattern DocumentSummary -first Title -element Accession -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute
MIMS Environmental/Metagenome sample from mouse gut metagenome  SAMN15337269
MIMS Environmental/Metagenome sample from mouse gut metagenome  SAMN15337268
Metagenome or environmental sample from metagenome  SAMN15337148
Metagenome or environmental sample from metagenome  SAMN15337147
Metagenome or environmental sample from metagenome  SAMN15337146
Metagenome or environmental sample from metagenome  SAMN15337145
Metagenome or environmental sample from metagenome  SAMN15337144
ADD COMMENT
0
Entering edit mode

Dear genomax,

I really appreciate your great help! In fact, the script worked fine on my computer, but when I got the data from 169,159 biosamples, I got the following messages:

(1st time) 500 Can't connect to eutils.be-md.ncbi.nlm.nih.gov:443 (Operation timed out) No do_post output returned from 'https://eutils.be-md.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=biosample&query_key=.......

(2nd time) 502 Bad Gateway

I think either my computer caused an error or NCBI stopped my download, but normally we can get all the data, right?

Best regards

ADD REPLY
1
Entering edit mode

Great. You should sign up for NCBI's API_KEY as described here.

ADD REPLY
0
Entering edit mode

Thank you for your continued help! I understood that I have to create an API key. Sorry to bother you again, but does this mean I need to insert "-api_key ???" in the script above?

ADD REPLY
0
Entering edit mode

To use that key set and export a variable like this. export NCBI_API_KEY=your_API_key.

ADD REPLY
0
Entering edit mode

Dear genomax,

Thank you for all your comments and great help. I tried the following script, but my job stopped, probably because of the large amount of data:

esearch -db biosample -query "metagenome" | esummary | xtract -pattern DocumentSummary -first Title -element Accession -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute export NCBI_API_KEY="my_api_key" > result.list

So, in order to retrieve only the biosample information I need, I tried the following script and test file, but it didn't work either (there was only one result...). Is this also my script problem this time? I would appreciate it if you could check it when you have time.

Test file - biosample.id.list

SAMN08886454

SAMN02941851

SAMN02436271

Test script

$ while read line; do title=$(esearch -db biosample -query $line | esummary | xtract -pattern DocumentSummary -element Accession -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute); echo "$line $title"; done<biosample.id.list > result.list

Test result - result.list

x x SAMN08886454 single cell amplified by WGA-X; seawater

Best regards.

ADD REPLY
2
Entering edit mode

Use epost method:

$ cat biosample.id.list |  epost -db biosample -format acc | esummary | xtract -pattern DocumentSummary -first Title -element Accession -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute 
Microbe sample from SAR116 cluster bacterium AG-426-B08 SAMN08886454    
single cell amplified by WGA-X; seawater MIMS Environmental/Metagenome sample from SAR11 cluster bacterium PRT-SC02      SAMN02941851    
single cell amplified by MDA; hadopelagic water column of the Puerto Rico Trench at 8200 m depth
Generic sample from Roseobacter sp. GAI101      SAMN02436271    seawater off the coast of Georgia

Note: You don't include NCBI_API_KEY in actual command. Just export that variable in the terminal you are running this search from once.

ADD REPLY
0
Entering edit mode

Dear genomax,

I'm sorry for the late reply. Everything went well according to your comments and suggestions. Thank you very much for your kind help!!

Best regards

ADD REPLY
0
Entering edit mode

Can you accept the original answer (green checkmark) to provide closure to this thread?

ADD REPLY
0
Entering edit mode

Any idea how is it possible to combine different atrributes in once single command? I would like to generate a table with "isolation_source", "host", "collection_date" from ~1000 Biosamples id. Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1343 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6