Biostar Beta. Not for public use.
Extracting the Country from a >200 sequences Genbank file
0
Entering edit mode
14 months ago
tpaisie • 70
University of Florida

Hey guys, I do phylogenetics of viruses and I'm currently working on an outbreak analysis. So I'm doing some phylogeography too. Obviously if there is no country of origin or collection date I have to take the sequence out of my dataset. I have >200 sequences per dataset and I really don't want to waste my precious time by going through genbank manually. I haven't been successful making or editing any biopython scripts to extract the country from the genbank file. Any help would be appreciated! Thanks!

ADD COMMENTlink
1
Entering edit mode

can you give a couple of examples of genbank entries (accession numbers) and the field which contains country annotation? Also are you looking for python-only solution?

ADD REPLYlink
0
Entering edit mode

Here are a couple accession numbers: KT279761 KC692509 KC692496

The country annotation is in the Features, then source, for example:

FEATURES Location/Qualifiers source 1..10735 /organism="Dengue virus 1" /mol_type="genomic RNA" /serotype="1" /isolate="HNRG14635" /isolation_source="serum" /host="Homo sapiens" /db_xref="taxon:11053" /country="Argentina: Buenos Aires" /collection_date="05-May-2009"

I'm looking for any solution, but i thought python was my best bet, with Biopython and all.

ADD REPLYlink
3
Entering edit mode
19 months ago
Buenos Aires, Argentina

I'm adding a Python solution that you may later modify to include more data. You just need to specify the location of a file with the accessions (one per line), where it says accessions.txt:

The output is separated by commas, so you can later read it as a CSV. Using your IDs, I got:

KT279761,Haiti
KC692509,Argentina: Buenos Aires
KC692496,Argentina: Buenos Aires
ADD COMMENTlink
0
Entering edit mode

What can I add to get the 'date' ?

ADD REPLYlink
0
Entering edit mode
15 months ago
France/Nantes/Institut du Thorax - INSE…

using this simple XSLT stylesheet:

run:

$ curl -s  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=KT279761,KC692509,KC692496&retmode=xml" |\
xsltproc --novalid transform.xsl -

KT279761 Haiti
KC692509 Argentina: Buenos Aires
KC692496 Argentina: Buenos Aires
ADD COMMENTlink
0
Entering edit mode

Thank you! Instead of a link can I replace that with the xml file I have?

Also I am getting this error when I try to run it:

"transform.xsl:1: namespace error : xmlns:xsl: '

ADD REPLYlink
1
Entering edit mode

yes, I've replaced my text with a gist on github.

ADD REPLYlink
0
Entering edit mode

Thank you for your help!

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3