Biopython error parsing feature locations in GenBank file
1
1
Entering edit mode
8.4 years ago
Christelle ▴ 20

Hi there,

I attempted to parse an old GenBank file (see URL below) to extract various features (which I would then want to write in GFF format). I encountered an error (please see below) when parsing the GenBank file with Biopython (version 1.64) using SeqIO.parse method to access the records.

GenBank file: ftp://ftp.ensembl.org/pub/release-22/human-22.34d/data/flatfiles/genbank/Homo_sapiens.3000.dat.gz

Biopython error: /opt/apps/python/2.7.3/lib/python2.7/site-packages/Bio/GenBank/__init__.py:1108: BiopythonParserWarning: Couldn't parse feature location: 'AL358792.24.1.166931:3274..3461'
 % (location_line)))

I looked at the Bio/GenBank/__init__.py file and found many regular expressions that check the format of the feature locations and these regexps seem to include the format of the location I encounter i.e. 'AL358792.24.1.166931:3274..3461' (please see the example regexp below for complex location from the __init__.py file). So I am not quite sure why the code raises the BiopythonParserWarning error.

Regexp in Bio/GenBank/__init__.py: _complex_location = r"([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" % (_pair_location, _solo_location, _between_location, _within_location, _oneof_location)

Could anybody please help me solve this parsing issue?

Thank you very much for your help.

GenBank Parser Biopython • 3.9k views
ADD COMMENT
1
Entering edit mode

Thank you Peter for investigating this and reporting the issue on the Biopython bug listing. Very much appreciated your help with this.

ADD REPLY
0
Entering edit mode

You described a warning which I can reproduce (well, lots and lots of warnings about problematic locations, probably due to the period/dots in the sequence reference name), but what is the error? I don't see any exception and traceback in your question - or do you mean how can we fix the warning?

ADD REPLY
0
Entering edit mode

You're right, as such there is no error triggered but a warning is raised to do with the impossibility to parse the problematic feature locations. So yes, I meant how can we fix the BiopythonParserWarning issue - so that we can retrieve the locations for those "problematic" features. Thank you very much for your help.

ADD REPLY
2
Entering edit mode
8.4 years ago
Peter 6.0k

I can't find anything in http://www.insdc.org/files/feature_table.html#3.4 to say these locations are invalid, nor in 3.4.12.2 Feature Location of ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt - but on the other hand they don't explicitly describe the valid form of an external reference within a feature location. So this does look like a well defined corner case where the Biopython parser needs to be extended:

https://github.com/biopython/biopython/issues/677

Update: This has been fixed by relaxing the regular expression to allow multiple dots within the external reference name of an INSDC Feature Location, and would be included with Biopython 1.67 onwards.

Christelle: Until Biopython 1.67 is released, to get this fix I would recommend installing Biopython from git from source. However, if the compilation would be a hassle (e.g. use on Windows), you would be safe updating to Biopython 1.66 and then manually updating just Bio/GenBank/__init__.py with the new regex as in the commit linked to from the Biopython issue.

Note you should update from Biopython 1.64 anyway due to a problem with compound locations from EMBL/ENSEMBL style joins which was fixed in Biopython 1.66.

ADD COMMENT

Login before adding your answer.

Traffic: 1903 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6