Biostar Beta. Not for public use.
Problems Parsing Genbank Flatfiles Generated By Ensembl
4
Entering edit mode
9.8 years ago
@Jarretinha148

Hi BioStar people,

Here I go again in my odyssey to master EnsEBML. After learning how to use the API and the site, I'm facing difficulties to parse site-generated flatfiles. Tried BioPerl and BioPython without sucess. A common error from python, just to illustrate:

/usr/local/lib/python2.6/dist-packages/Bio/GenBank/Scanner.py:950: UserWarning: Malformed LOCUS line found - is this correct?
LOCUS       11 7208 bp DNA HTG 25-MAR-2011

If I edit LOCUS by hand, other parsing errors appear elsewhere. Am I wrong supposing that these EnsEMBL flatfiles are Bio* parseable? My biopython was compiled from the last source version and my Bioperl came from a ubuntu package.

ensembl genbank parsing • 3.1k views
ADD COMMENTlink
0
Entering edit mode

Can you include a small code snippet which will retrieve a flatfile, and/or link to an example of a retrieved flatfile? It will help diagnosis if we can look at the file and try parsing it.

ADD REPLYlink
0
Entering edit mode

I've just loaded SeqIO on a python/perl shell and tried to parse the file. I was just exploring the flatfile when noted this problem.

ADD REPLYlink
0
Entering edit mode

@Jarretinha, do you still remember (or maybe had posted somewhere) the issues you had with non-standard features/annotations of EnsEMBL GenBank files, and efforts to bypass those? I'm asking in the context of LOCUS header lines fixed recently in Biopython: https://github.com/biopython/biopython/pull/16

ADD REPLYlink
2
Entering edit mode
9.8 years ago
Neilfws 48k
@Neilfws66

The NCBI has a link to an annotated GenBank sample record. The LOCUS line is supposed to contain (after the word LOCUS):

  1. Locus name
  2. Sequence length
  3. Molecule type
  4. Genbank division
  5. Modification date

In your example, it looks as though locus name = 11. This is not a valid accession. In addition, it may be that the parser is reading "11" and "7208" as "11 7208" (sequence length). If you look at the Biopython code, all it does is split the LOCUS line on space and the warning is generated if the resulting array length looks wrong.

I'm not sure what the best solution is, other than to try and get an accession/identifier from Ensembl, then use that to get valid GenBank format from another source ( _e.g._ NCBI).

ADD COMMENTlink
0
Entering edit mode

I've noted this incongruence. But, this is just how the file was generated. I can do it yourself too. My steps in EnsEMBL genome browser: Search for MEN1 -> Entered the location -> Export data -> Flat File(GenBank) with select all. It will generate a lot of non standard annotations/features. With a little effort features are parseable. But the annotations are another story . . .

ADD REPLYlink
2
Entering edit mode
9.8 years ago
@Monika Komorowska1581

I used the following script to parse a few files generated by ensembl, without any problems:

use strict;
use Bio::SeqIO;    
my $seqio_object = Bio::SeqIO->new(-file => "MEN1.gb");

my $seq_object = $seqio_object->next_seq;
while (defined $seq_object) {
   my $accession = $seq_object->accession();
   print "$accession\n";
   my $display_id  = $seq_object->display_id();
   print "$display_id\n";
   my $length = $seq_object->length();
   print "$length\n";

   print "Print sequence object annotaton:\n";
   my $anno_collection = $seq_object->annotation;

   my @annotations = $anno_collection->get_Annotations();
   for my $value ( @annotations ) {
      print "tagname : ", $value->tagname, "\n";
      print "  annotation value: ", $value->as_text, "\n";
   }

   print "Print all the data in the features of a Seq object:\n";
   for my $feat_object ($seq_object->get_SeqFeatures) {          
      print "primary tag: ", $feat_object->primary_tag, "\n";          
      for my $tag ($feat_object->get_all_tags) {             
         print "  tag: ", $tag, "\n";             
         for my $value ($feat_object->get_tag_values($tag)) {                
            print "    value: ", $value, "\n";             
         }          
      }       
   }

   $seq_object = $seqio_object->next_seq;
}

If you're still encountering issues with parsing, can you please email Ensembl Helpdesk at helpdesk@ensembl.org, attaching your script and a URL to the file you're trying to parse. Hope this helps

Monika Komorowska Ensembl Developer

ADD COMMENTlink
0
Entering edit mode

Please use the code formatting (4 spaces at the beginning of each line) so it is easier to read.

ADD REPLYlink
0
Entering edit mode

I was able to parse from the API just as you demonstrated. But, I think that is strange. Why should I construct a Seq/SeqFeature object from scratch I (in thesis) could request one already built? I wasn't able to understand why EnsEMBL generates non-conforming fields/annotations as illustrated by the LOCUS case.

ADD REPLYlink
0
Entering edit mode

I am not 100% sure what you mean by request one already built? As for the non-conforming records this is due to the Ensembl dumping code not using offset fields and the only way to code around this is to use any run of spaces a field separator which it looks like this is the way BioPerl can parse the records. We will look into these flat-file dumps soon so your feedback is good to see how we can improve.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3