Question

Extracting the Country from a >200 sequences Genbank file

0

Entering edit mode

7.4 years ago

tpaisie ▴ 80

Hey guys, I do phylogenetics of viruses and I'm currently working on an outbreak analysis. So I'm doing some phylogeography too. Obviously if there is no country of origin or collection date I have to take the sequence out of my dataset. I have >200 sequences per dataset and I really don't want to waste my precious time by going through genbank manually. I haven't been successful making or editing any biopython scripts to extract the country from the genbank file. Any help would be appreciated! Thanks!

phylogenetics python genbank extract bipython • 6.1k views

ADD COMMENT • link updated 7.4 years ago by Juan Manuel Berros ▴ 120 • written 7.4 years ago by tpaisie ▴ 80

1

Entering edit mode

can you give a couple of examples of genbank entries (accession numbers) and the field which contains country annotation? Also are you looking for python-only solution?

ADD REPLY • link 7.4 years ago by Santosh Anand 5.7k

0

Entering edit mode

Here are a couple accession numbers: KT279761 KC692509 KC692496

The country annotation is in the Features, then source, for example:

FEATURES Location/Qualifiers source 1..10735 /organism="Dengue virus 1" /mol_type="genomic RNA" /serotype="1" /isolate="HNRG14635" /isolation_source="serum" /host="Homo sapiens" /db_xref="taxon:11053" /country="Argentina: Buenos Aires" /collection_date="05-May-2009"

I'm looking for any solution, but i thought python was my best bet, with Biopython and all.

ADD REPLY • link 7.4 years ago by tpaisie ▴ 80

score 4 · Answer 1 · 2016-11-22

4

Entering edit mode

7.4 years ago

Juan Manuel Berros ▴ 120

I'm adding a Python solution that you may later modify to include more data. You just need to specify the location of a file with the accessions (one per line), where it says accessions.txt:

The output is separated by commas, so you can later read it as a CSV. Using your IDs, I got:

KT279761,Haiti
KC692509,Argentina: Buenos Aires
KC692496,Argentina: Buenos Aires

ADD COMMENT • link 7.4 years ago by Juan Manuel Berros ▴ 120

0

Entering edit mode

What can I add to get the 'date' ?

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

Hi,

As tpaisie I also want to extract specific information ("collection_date" for my case) from a genbank file. I tried to slightly modify your code in order to get the "collection_date", but it seems to not working. Any idea? Here is my code below :

from Bio import Entrez

# Read the accessions from a file
accessions_file = 'accession_nuber_list.txt'
with open(accessions_file) as f:
    ids = f.read().split('\n')

# Fetch the entries from Entrez
Entrez.email = 'emailadress'  # Insert your email here
handle = Entrez.efetch('nuccore', id=ids, retmode='xml')
response = Entrez.read(handle)

# Parse the entries to get the country
def extract_date(entry):
    sources = [feature for feature in entry['GBSeq_feature-table']
               if feature['GBFeature_key'] == 'source']

    for source in sources:
        qualifiers = [qual for qual in source['GBFeature_quals']
                      if qual['GBQualifier_name'] == 'collection']

        for qualifier in qualifiers:
            yield qualifier['GBQualifier_value']

for entry in response:
    accession = entry['GBSeq_primary-accession']
    for date in extract_date(entry):
        print(accession, date, sep=',')

ADD REPLY • link 3.7 years ago by gabrieldupre28 • 0

0

Entering edit mode

It seems the key for the date is collection_date, not collection. Try changing that.

ADD REPLY • link 3.7 years ago by Juan Manuel Berros ▴ 120

score 0 · Answer 2 · 2016-11-11

0

Entering edit mode

7.4 years ago

Pierre Lindenbaum 161k

using this simple XSLT stylesheet:

run:

$ curl -s  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=KT279761,KC692509,KC692496&retmode=xml" |\
xsltproc --novalid transform.xsl -

KT279761 Haiti
KC692509 Argentina: Buenos Aires
KC692496 Argentina: Buenos Aires

ADD COMMENT • link 7.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you! Instead of a link can I replace that with the xml file I have?

Also I am getting this error when I try to run it:

"transform.xsl:1: namespace error : xmlns:xsl: '

ADD REPLY • link 7.4 years ago by tpaisie ▴ 80

1

Entering edit mode

yes, I've replaced my text with a gist on github.

ADD REPLY • link 7.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you for your help!

ADD REPLY • link 7.4 years ago by tpaisie ▴ 80