Problems Parsing Genbank Flatfiles Generated By Ensembl
2
4
Entering edit mode
13.1 years ago

Hi BioStar people,

Here I go again in my odyssey to master EnsEBML. After learning how to use the API and the site, I'm facing difficulties to parse site-generated flatfiles. Tried BioPerl and BioPython without sucess. A common error from python, just to illustrate:

/usr/local/lib/python2.6/dist-packages/Bio/GenBank/Scanner.py:950: UserWarning: Malformed LOCUS line found - is this correct?
LOCUS       11 7208 bp DNA HTG 25-MAR-2011

If I edit LOCUS by hand, other parsing errors appear elsewhere. Am I wrong supposing that these EnsEMBL flatfiles are Bio* parseable? My biopython was compiled from the last source version and my Bioperl came from a ubuntu package.

ensembl genbank parsing • 6.0k views
ADD COMMENT
0
Entering edit mode

Can you include a small code snippet which will retrieve a flatfile, and/or link to an example of a retrieved flatfile? It will help diagnosis if we can look at the file and try parsing it.

ADD REPLY
0
Entering edit mode

I've just loaded SeqIO on a python/perl shell and tried to parse the file. I was just exploring the flatfile when noted this problem.

ADD REPLY
0
Entering edit mode

@Jarretinha, do you still remember (or maybe had posted somewhere) the issues you had with non-standard features/annotations of EnsEMBL GenBank files, and efforts to bypass those? I'm asking in the context of LOCUS header lines fixed recently in Biopython: https://github.com/biopython/biopython/pull/16

ADD REPLY
2
Entering edit mode
13.1 years ago
Neilfws 49k

The NCBI has a link to an annotated GenBank sample record. The LOCUS line is supposed to contain (after the word LOCUS):

  1. Locus name
  2. Sequence length
  3. Molecule type
  4. Genbank division
  5. Modification date

In your example, it looks as though locus name = 11. This is not a valid accession. In addition, it may be that the parser is reading "11" and "7208" as "11 7208" (sequence length). If you look at the Biopython code, all it does is split the LOCUS line on space and the warning is generated if the resulting array length looks wrong.

I'm not sure what the best solution is, other than to try and get an accession/identifier from Ensembl, then use that to get valid GenBank format from another source (e.g. NCBI).

ADD COMMENT
0
Entering edit mode

I've noted this incongruence. But, this is just how the file was generated. I can do it yourself too. My steps in EnsEMBL genome browser: Search for MEN1 -> Entered the location -> Export data -> Flat File(GenBank) with select all. It will generate a lot of non standard annotations/features. With a little effort features are parseable. But the annotations are another story . . .

ADD REPLY
2
Entering edit mode
13.1 years ago

I used the following script to parse a few files generated by ensembl, without any problems:

use strict;
use Bio::SeqIO;    
my $seqio_object = Bio::SeqIO->new(-file => "MEN1.gb");

my $seq_object = $seqio_object->next_seq;
while (defined $seq_object) {
   my $accession = $seq_object->accession();
   print "$accession\n";
   my $display_id  = $seq_object->display_id();
   print "$display_id\n";
   my $length = $seq_object->length();
   print "$length\n";

   print "Print sequence object annotaton:\n";
   my $anno_collection = $seq_object->annotation;

   my @annotations = $anno_collection->get_Annotations();
   for my $value ( @annotations ) {
      print "tagname : ", $value->tagname, "\n";
      print "  annotation value: ", $value->as_text, "\n";
   }

   print "Print all the data in the features of a Seq object:\n";
   for my $feat_object ($seq_object->get_SeqFeatures) {          
      print "primary tag: ", $feat_object->primary_tag, "\n";          
      for my $tag ($feat_object->get_all_tags) {             
         print "  tag: ", $tag, "\n";             
         for my $value ($feat_object->get_tag_values($tag)) {                
            print "    value: ", $value, "\n";             
         }          
      }       
   }

   $seq_object = $seqio_object->next_seq;
}

If you're still encountering issues with parsing, can you please email Ensembl Helpdesk at helpdesk@ensembl.org, attaching your script and a URL to the file you're trying to parse.

Hope this helps

Monika Komorowska
Ensembl Developer

ADD COMMENT
0
Entering edit mode

Please use the code formatting (4 spaces at the beginning of each line) so it is easier to read.

ADD REPLY
0
Entering edit mode

I was able to parse from the API just as you demonstrated. But, I think that is strange. Why should I construct a Seq/SeqFeature object from scratch I (in thesis) could request one already built? I wasn't able to understand why EnsEMBL generates non-conforming fields/annotations as illustrated by the LOCUS case.

ADD REPLY
0
Entering edit mode

I am not 100% sure what you mean by request one already built? As for the non-conforming records this is due to the Ensembl dumping code not using offset fields and the only way to code around this is to use any run of spaces a field separator which it looks like this is the way BioPerl can parse the records. We will look into these flat-file dumps soon so your feedback is good to see how we can improve.

ADD REPLY

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6