Question

Must there be spaces after "ORIGIN" in GenBank records ?

0

Entering edit mode

6.5 years ago

Charles Plessy ★ 2.9k

I found that the virus genomes distributed in GenBank format by the PaVE site are not parsed correctly by the genbankr Bioconductor package, because there are no spaces after the "ORIGIN" field. Either the package is too strict, or the files are missing the spaces, but I could not find a specification of the GenBank format that would tell me which is right and which is wrong. Is that part standardised ?

Edited (26/10/2017):

Here is a minimal example:

LOCUS       TEST                240 bp    DNA     linear   VRL 13-Sep-2017
DEFINITION  Human papillomavirus 16 (HPV16), complete genome
ACCESSION   TEST
VERSION     TEST.1  GI:123456
KEYWORDS    .
SOURCE      TEST
  ORGANISM  TEST
            .
FEATURES             Location/Qualifiers
     source          1..240
     gene            42..56
                     /gene="L2"
ORIGIN
        1 actacaataa ttcatgtata aaactaaggg cgtaaccgaa atcggttgaa ccgaaaccgg
       61 ttagtataaa agcagacatt ttatgcacca aaagagaact gcaatgtttc aggacccaca
      121 ggagcgaccc agaaagttac cacagttatg cacagagctg caaacaacta tacatgatat
      181 aatattagaa tgtgtgtact gcaagcaaca gttactgcga cgtgaggtat atgactttgc

Here is what happens when importing it (after loading the rtracklayer and genbankr packages).

> genes(import("test.gbk"))
Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
No exons read from genbank file. Assuming sections of CDS are full exons
GRanges object with 1 range and 4 metadata columns:
      seqnames    ranges strand |        type
         <rle> <iranges>  <rle> | <character>
  [1]      unk  [42, 56]      + |        gene
                                                                                                                                                                                                                                                                                           gene
                                                                                                                                                                                                                                                                                    <character>
  [1] L2ORIGIN1 actacaataa ttcatgtata aaactaaggg cgtaaccgaa atcggttgaa ccgaaaccgg61 ttagtataaa agcagacatt ttatgcacca aaagagaact gcaatgtttc aggacccaca121 ggagcgaccc agaaagttac cacagttatg cacagagctg caaacaacta tacatgatat181 aatattagaa tgtgtgtact gcaagcaaca gttactgcga cgtgaggtat atgactttgc
          loctype
      <character>
  [1]      normal
                                                                                                                                                                                                                                                                                        gene_id
                                                                                                                                                                                                                                                                                    <character>
  [1] L2ORIGIN1 actacaataa ttcatgtata aaactaaggg cgtaaccgaa atcggttgaa ccgaaaccgg61 ttagtataaa agcagacatt ttatgcacca aaagagaact gcaatgtttc aggacccaca121 ggagcgaccc agaaagttac cacagttatg cacagagctg caaacaacta tacatgatat181 aatattagaa tgtgtgtact gcaagcaaca gttactgcga cgtgaggtat atgactttgc
  -------
  seqinfo: 1 sequence from TEST.1 genome

Just sed 's/ORIGIN.*/ORIGIN/' and the problem will be solved.

> genes(import("test2.gbk"))
Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
No exons read from genbank file. Assuming sections of CDS are full exons
GRanges object with 1 range and 4 metadata columns:
      seqnames    ranges strand |        type        gene     loctype
         <rle> <iranges>  <rle> | <character> <character> <character>
  [1]      unk  [42, 56]      + |        gene          L2      normal
          gene_id
      <character>
  [1]          L2
  -------
  seqinfo: 1 sequence from TEST.1 genome

Not shown above, but it looks like the symptoms only appear when the last feature before ORIGIN is a gene.

GenBank • 1.4k views

ADD COMMENT • link 6.5 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Hi Charles,

Can you provide a working example? Perhaps we can edit the genbankr function so that it can read these files correctly.

Kevin

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks Kevin for your kind offer (my gut feeling is that PaVE is wrong and genbankr is right) I added example data to the original post.

ADD REPLY • link 6.5 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Yes, it looks like /gene="L2" is causing the issue, perhaps the forward-slash in particular.

I reckon that it will still be easier to modify the function genbankr and create a workaround, as opposed to getting PaVE to change their submissions. I typically report these things to the developers too (in this case, whoever wrote genbankr). As systems get updated, programs accessing them typically have to update too!

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k