Extracting strain name from several assemblies
0
0
Entering edit mode
5.3 years ago

Hey guys,

Another question: Some of the outputs don't have the strain name. I guess the reason is that the organism name doesn't have that info. For example here https://www.ncbi.nlm.nih.gov/assembly/GCF_003290365.1/. If I use

for f in GCF* ; do term=$(echo $f | cut -f1,2 -d'_') ; esearch -db assembly -q $term | esummary | xtract -pattern DocumentSummary -sep ' ' -element Organism,Strain,AssemblyAccession | sed 's/ /_/g' ; done > filenames.txt

The strain name doesn't appear on filenames.txt. Could you please let me know what I'm doing wrong?

Cheers

assembly genome • 730 views
ADD COMMENT
1
Entering edit mode

If I just run the example you posted it works but does not print a strain info:

$ esearch -db assembly -q GCF_003290365.1 | esummary | xtract -pattern DocumentSummary -sep ' ' -element Organism,Strain,AssemblyAccession
Pseudomonas putida (g-proteobacteria) GCF_003290365.1

It looks like the strain number is in a different field (sub_value) which you may need to include:

$ esearch -db assembly -q GCF_003290365.1 | esummary | xtract -pattern DocumentSummary -sep ' ' -element Organism,Sub_value,AssemblyAccession
Pseudomonas putida (g-proteobacteria) NX-1 GCF_003290365.1

You can try this and let us know if this works for other items on your list.

Edit: Re-reading your post it seems that you are not able to generate an answer (strain name). In that case you need to investigate term=$(echo $f | cut -f1,2 -d'_') to see what values you are getting for term. Put an echo $term to examine that variable in your loop (remove the esearch command temporarily, if needed).

ADD REPLY

Login before adding your answer.

Traffic: 1021 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6