Understanding the ENA accession numbers for assembled/annotated sequences
0
0
Entering edit mode
5.7 years ago

I have a list of ENA accession numbers pointing to sequences that are sometimes contigs, sometimes scaffolds. I want to group together sequences that come from the same genome, if possible by exploiting the format of the accession number.

EBI-ENA website gives some rules about the format of accession number. In particular:

Assembled/Annotated sequences
[A-Z]{1}\d{5}.\d+
[A-Z]{2}\d{6}.\d+
[A-Z]{4}S?\d{8,9}.\d+

They do not explain however:

  • why there are 3 possibilities
  • if some parts of the accession number are specific of a genome/set of sequences

For example, I noticed so far that the third rule seems to be used for contigs (example: JQEY01000011), that the first 4 letters seems to be genome-specific, and that the following digits seems to be contig-specific. I am not sure however if I can rely on this to group my list of accession numbers. Furthermore, this breaks for scaffold accession numbers since they do not seem to display any such simple pattern.

Are there rules for the formatting of accession number in EBI-ENA that can be used reliably to group together assembled/annotated sequences that come from the same genome?

ENA Assembly sequence • 1.7k views
ADD COMMENT
0
Entering edit mode

Contacting datasubs at ebi.ac.uk may be a good place to start.

Rules you are referring to are specifying what the identifier looks like. The link between which organism they specify (e.g. Maribius sp. MOLA 401 corresponds to JQEY01000000, it is not immediately apparent how Maribus and JQEY are related) may not be inferable without a name-key file.

ADD REPLY
0
Entering edit mode

I do not really need the organism name since I don't mind working with ids. What I do need however is to be able to identify sequences (contigs or scaffolds) that belong to the same set (WGS project, ...).

While it looks straightforward for contig ids (JQEY01000001-JQEY01000033 all belong to the set JQEY01000000), it is less clear when looking at scaffold ids. For instance, the following scaffold ids look pretty similar KI912157, KI912158, KI912155, KI912156, KI912577 but they do not all come from the same organism. What I would like is a similarity rule to decipher if a set of scaffold ids come from the same organism.

I will probably try contacting datasubs like you suggest.

ADD REPLY

Login before adding your answer.

Traffic: 3222 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6