Question

GFF3 ID field: "Official" nomenclature

2

Entering edit mode

8.1 years ago

user230613 ▴ 360

Hi there, I'm trying to "manually" add ID field to a gff3 features. In the lines containing gene and mRNA features, I've already IDs but for UTR, exon and CDS not yet. I've read several webs and documentation regarding gff3 format but I'm not sure about the "official" way to annotate these features. I guess that ID field is not a must, but I need to add it. So, for each feature I've mentioned:

exon: If a exon is present in more than one isoform, in my gff3 I have two lines of the same exon (one for transcript1 and the other for transcript2). The ID for these two lines should be the same?
CDS: Please correct me if I'm wrong. All CDS belonging to a given protein must share the ID, isn't it?
UTR: ?

Finally, in the ID can I put a "random" number, like exon1, exon2, exon3... and so on, or it is better to mention the parent feature, like, transcript1.exon1, transcript1.exon2 ... etc. I know that the ID must be unique across the gff3, but I'd like to know the official way to annotate.

Thanks in advance.

gff3 • 3.9k views

ADD COMMENT • link updated 7.7 years ago by Daniel Standage 4.1k • written 8.1 years ago by user230613 ▴ 360

2

Entering edit mode

8.1 years ago

Ryan Dale 5.0k

Even though specs exist (as in mastal511's answer) it is rare for providers to actually follow them! The gffutils docs have a bunch of real-world examples which differ from the spec in different ways.

I would make your decisions based on how the file is to be used in downstream analysis. If it will be used alongside other GFF files, then try to match the format of those; if not then go with whatever's convenient.

ADD COMMENT • link 8.1 years ago by Ryan Dale 5.0k

0

Entering edit mode

8.1 years ago

mastal511 ★ 2.1k

This webpage describes the GFF3 format and shows a couple of examples which include exons that are included in more than 1 transcript.

http://www.sequenceontology.or/gff3.shtml

ADD COMMENT • link 8.1 years ago by mastal511 ★ 2.1k

0

Entering edit mode

7.7 years ago

Daniel Standage 4.1k

Michael's answer is excellent. I want to reiterate a few of his points before I add another of my own.

exons appearing in multiple isoforms: In the "canonical gene" shown in the official GFF3 specification (recently moved from the Sequence Onology website to GitHub), exons appearing in multiple isoforms are listed once with multiple Parent attributes. Some bioinformatics software expects data structured this way, some software expects them to be listed once for each isoform in which they appear, and some can handle both cases. Ryan Dale is right: you have to determine what the downstream software packages expect, and hope they have documented this!
CDS features: typically each isoform has a single coding sequence; if there are 5 CDS feature that does not mean that the isoform has 5 coding sequences, but one sequence split across 5 exons. Since it is a single feature, each CDS entry belonging to the same isoform should have the same ID. This is sometimes referred to has a multi-feature, but this is not universal.
UTR features: typically these do not require IDs, but again this depends on downstream software.

And last of all:

GFF3 IDs are not intended to be used as globally unique identifiers!

They are often treated this way, but IDs are only required to be unique within a single GFF3 file. If data in one GFF3 file gets processed and spit out into another GFF3 file, there is no requirement that ID=gene1 from the first file corresponds to ID=gene1 from the second file. IDs are ONLY intended to resolve parent/child relationships within a single file. So the structure or format of your IDs shouldn't matter too much, as long as they are unique for the features they describe. And there is certainly no "official" nomenclature, only conventions used by particular tools and communities of practice.

ADD COMMENT • link 7.7 years ago by Daniel Standage 4.1k

score 2 · Accepted Answer · 2016-03-12

There is indeed not alway a single way of encoding the same structure into a GFF, but I have worked with GFFs, Tripal and GBrowse (and BioPerl) imports a lot. Sometimes, one also might need slightly different structure for different applications, and it depends more on adapting the gff such that it works best with the desired application.

Whenever a feature already has an ID it should not be changed, but the name attribute can be used to add a human readable name to it. An ID field is required by some or most application to allow the mapping of features to their parent feature, and thereby allowing to rebuild gene models (e.g.: CDS -> mRNA -> gene).

exon: If a exon is present in more than one isoform, in my gff3 I have two lines of the same exon (one for transcript1 and the other for transcript2). The ID for these two lines should be the same?

In general, different entities should also have different IDs (exception below), but see: Using Multiple Parent Values In Gff3 Format? If the exons are fully identical they could be encoded in a single line with multiple transcripts as parents.

CDS: Please correct me if I'm wrong. All CDS belonging to a given protein must share the ID, isn't it?

Yes, they CDS lines can be seen as an aggregate of the pieces that form one coding sequence per transcript, so they are a single entity. GBrowse f.e. will be able to reconstruct the correct coding sequence from segments of coding sequences, only if they have the same ID.

UTR: ?

You can have IDs or not, it is important to annotate the correct coding sequence. In GBrowse, f.e. if UTRs and exons are annotated, the CDS can be inferred are not needed in the import.

Finally, in the ID can I put a "random" number, like exon1, exon2, exon3... Yes that is possibly how most files would encode it. Ensembl ids would be an example they just assign a running number with a prefix ("EMLSAE000000000001") to make stable IDs.

or it is better to mention the parent feature, like, transcript1.exon1, transcript1.exon2 .

That is more readable, but not required, some cDNA aligners (like GMAP) use a similar format. The downside is that these would be less suited as stable IDs. Imagine the assignment of the parent is changed, you will have either an ID implying a relation that is no longer true and live with the misleading ids, or you have to change the ID and all reference, which again makes these "un-stable" IDs. That is possibly why the random numbering is used by Ensembl.