Difference between Ensembl annotation GTF and GFF3 files
1
2
Entering edit mode
8.7 years ago
colin.kern ★ 1.1k

When downloading the annotation for a genome from Ensembl, there's a GTF and a GFF3 file available. When reading the README files for these two, I'm having trouble determining if these are exactly the same information just in different formats, or if there's a difference in the actual annotations between the two. The wording makes it sound like GFF3 file might include some non-gene features that aren't included in the GTF, and that possibly they have different requirements for evidence to include a gene in each of the files. Does anyone know exactly what the differences are?

ensembl genome annotation • 7.4k views
ADD COMMENT
0
Entering edit mode

Did you check this? You will get a primary idea of what these file types contain.

ADD REPLY
0
Entering edit mode

That's describing the differences in the formats, which I'm very familiar with. What I'm asking about is whether the Ensembl genome annotations in the two formats contain the exact same gene and feature sets, or if there's a difference in what is included in each. It's a question about Ensembl's data procedures, not the file formats.

ADD REPLY
0
Entering edit mode

Isn't this answerable from downloading relevant pairs of files and comparing?

ADD REPLY
0
Entering edit mode

It's not trivial, but doable, to determine whether they're the same. If they're different, I'm not sure how I'd discern what criteria Ensembl used to generate the two.

ADD REPLY
0
Entering edit mode

I don't know if this case is the norm or the exception, but while working with S.cerevisiae annotation I noticed that GFF3 had entries for whole chromosomes as a feature, while GTF did not.

ADD REPLY
1
Entering edit mode
4.0 years ago
Juke34 8.5k

Tehy do not contains same feature types (3rd column).
e.g. on Homo_sapiens.GRCh38.98.chr.gtf (awk '{print $3}' Homo_sapiens.GRCh38.98.chr.gtf | sort -u):

CDS
Selenocysteine
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript

Homo_sapiens.GRCh38.98.chr.gff:

CDS
C_gene_segment
D_gene_segment
GRCh38.p13
J_gene_segment
V_gene_segment
biological_region
chromosome
exon
five_prime_UTR
gene
lnc_RNA
mRNA
miRNA
ncRNA
ncRNA_gene
pseudogene
pseudogenic_transcript
rRNA
scRNA
scaffold
snRNA
snoRNA
tRNA
three_prime_UTR
unconfirmed_transcript
vaultRNA_primary_transcript

You will have to play with the biotype attribute of the 9th column of your GTF file to see if your transcript code for mRNA,miRNA,ncRNA etc. While in the GFF it is directly described in the feature type column (3rd column).

GTF format is much more constraint when it comes to feature types. See here for more details about the two formats.

ADD COMMENT

Login before adding your answer.

Traffic: 2272 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6