Fetch relevant metadata from accession numbers
2
2
Entering edit mode
4.4 years ago
lucslapping ▴ 20

I was wondering which ways are available for getting metadata from accession numbers. I have seen other tools such as Nextstrain make use of a so called "metadata" file to describe used sequences. The file looks something like this: Metadata for sequences

https://imgur.com/a/uL3m7T5

It shows various data from NCBI for the accession numbers such as virus strain, country, date, URL, etc. For me the most import ones are strain, country and date. Are there ways to download such data automatically when you have a list of accession numbers?

Any help is appreciated.

R ncbi metadata accession number nextstrain • 1.8k views
ADD COMMENT
3
Entering edit mode
4.4 years ago
GenoMax 141k

Using EntrezDirect :

$ esearch -db nuccore -query "KY317939" | esummary | xtract -pattern DocumentSummary -element SubName
ZIKV/Homo_sapiens/Colombia/2016/ZC204Se|Homo sapiens|Colombia|serum|06-Jan-2016|Antibody Systems Inc

Fields you are getting above are (separated by |)

isolate|host|country|isolation_source|collection_date|collected_by
ADD COMMENT
0
Entering edit mode

Thanks you, this brings up some desired fields that I mentioned, however is there a way I can submit a list of accession numbers and save the output to a csv, tsv or txt file?

ADD REPLY
0
Entering edit mode

Use epost with your accession numbers of interest in a file (one per line).

$ epost -db nuccore -format acc -input acc| esummary | xtract -pattern DocumentSummary -element Caption,SubName | sed 's/|/\,/g'
MF574578        ZIKV/Homo sapiens/COL/PRV_00028/2015,Homo sapiens,C6/36 cell-derived; 5 passages in Vero followed by one passage in C6/36; passage history: Vero (5), C6/36 (1),Asian,Colombia: Barranquilla,Dec-2016
MF574562        ZIKV/Homo sapiens/COL/FLR_00008/2015,Homo sapiens,Vero cell-derived; 3 passages in C6/36 followed by 4 pasages in Vero; passage history: C6/36 (3), Vero (4),Asian,Colombia: Barranquilla,Dec-2015
KY558989        ZIKV/Homo_sapiens/Brazil/2015/ZBRA105,Homo sapiens,Asian,Brazil: Joao Camara, Rio Grande do Norte,23-Feb-2015,ZiBRA team
KY317939        ZIKV/Homo_sapiens/Colombia/2016/ZC204Se,Homo sapiens,Colombia,serum,06-Jan-2016,Antibody Systems Inc
ADD REPLY
0
Entering edit mode

Thanks again, this worked for me, however some records appear to be in the wrong order for my case. Could this be due to mistakes in the database?

ADD REPLY
0
Entering edit mode

What do you mean by wrong order? Can you provide an example? We are doing a direct databaseq query so the information should be what is in the db.

ADD REPLY
0
Entering edit mode

I have a input text file with accession numbers and here is what the first few lines look like:

MK419834.1

MK230890.1

MK230891.1

MK230892.1

MK230893.1

In the output CSV file I see that some entries dont have all 6 fields that you specified:

isolate|host|country|isolation_source|collection_date|collected_by

Some entries only have 4 out of those 6 fields for example. In the CSV output I see for certain entries that the country is in the second column and that the host is in the third column, this is a different order than what most entries have in the output file. I would like to have each result in the right column basically.

ADD REPLY
0
Entering edit mode

Unfortunately it is possible that blank fields from some of those records are messing up the output. You could leave the output as is, bring the data into excel (breaking records on |) and then check if the fields stay aligned.

ADD REPLY
1
Entering edit mode
4.4 years ago
JC 13k

Use Entrez Direct tools

ADD COMMENT

Login before adding your answer.

Traffic: 1748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6