Biostar Beta. Not for public use.
Question: How to retrieve information on bacterial source from NCBI?
0
Entering edit mode

Hi,

I want to get the isolation source (clinical/environmental) information for all RefSeq Pseudomonas aeruginosa genomes. Roughly, I know around 2000 sequenced Pseudomonas aeruginosa are available in NCBI. Sometimes the isolation source are mentioned in the Biosample e.g. https://www.ncbi.nlm.nih.gov/biosample/SAMN02732279/ . As I want to get the info at a time for 2000 genomes, how can I retrieve it by using bash? Any known script for this purpose?

Cheers

ADD COMMENTlink 15 months ago saadleeshehreen • 60 • updated 15 months ago vkkodali ♦ 1.1k
3
Entering edit mode

You can use Entrez Direct for this. As you know, not all of the BioSample entries have all of the information you want and even when they do, it is not always under the same attribute. You may want to look at the XML output of esummary and come up with a suitable xtract command that will fetch all of the fields you want. As an example, you can use the following query to fetch the name, Biosample accession and the isolation source in a three column tab-delimited format:

## WARNING: returns >3000 results; only first five are shown here
esearch -db assembly -q '"Pseudomonas aeruginosa"[Organism] AND latest_refseq[filter]' \
    | elink -db assembly -target biosample -name assembly_biosample \
    | esummary \
    | xtract -pattern DocumentSummary -first Title -element Accession \
        -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute

Pseudomonas aeruginosa CLJ1     SAMN07372049    lungs (tracheal aspirate)
Pseudomonas aeruginosa CLJ3     SAMN07372048    lungs (tracheal aspirate)
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374626    skin
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374625    Bronchial aspirate
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374624    Biopsy
ADD COMMENTlink 15 months ago vkkodali ♦ 1.1k
Entering edit mode
0

Thanks for that. It works for me, but I got only 1000 results. I have a table with all assembly_ID (eg. GCF_000006765) of Pseudomonas aeruginosa, so I need to map back this table. How can I map back assembly id with biosample accession?

ADD REPLYlink 15 months ago
saadleeshehreen
• 60
Entering edit mode
0

but I got only 1000 results

Could this be because a large number of the Biosample entries lack isolation_source information? If you run the command as shown above, you should see >3000 rows in the results but the cases lacking isolation source information will only have two columns of data instead of three. You can pick out a few of those and go digging around in the Biosample DocumentSummary XML for other attributes that may be of use to you.

How can I map back assembly id with biosample accession?

You can use Entrez Direct for this as shown below. Once you have this table for all of your data, you can join it to the one with isolation source results on column 2.

esearch -db assembly -q 'GCF_000006765' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,BioSampleAccn
GCF_000006765.1 SAMN02603714
ADD REPLYlink 15 months ago
vkkodali
♦ 1.1k
Entering edit mode
0

Hi vkkodali ! Could you please post a tutorial how to annotate a bacterial assembly using NCBI eutils? If possible, both online and offline annotation. This would help many visitors here.

ADD REPLYlink 15 months ago
cpad0112
11k
Entering edit mode
0

One solution, I have just got: esearch -db assembly -query GCF_000647595.2 | elink -related -cmd neighbor -name assembly_biosample | xml2 | grep "/eLinkResult/LinkSet/LinkSetDb/Link/Id=" | awk 'BEGIN{FS="="} {print $2}' You just need a xml2 to download.

ADD REPLYlink 15 months ago
saadleeshehreen
• 60

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0