'CDS' but not 'exon' in GFF
1
0
Entering edit mode
5.4 years ago
NielQC ▴ 10

Hi,

I'm working with RNAseq data from A. thaliana. Reads were mapped with STAR 2.6 using the TAIR10.1 genome from NCBI and its annotation in GFF3 format (link).

I have noticed some of the genes have not any 'exon' feature line and go directly from the 'gene' line to the 'CDS' line, and this is a problem since I would like to count the 'exon' features. I have always thought that a 'CDS' has to have a related 'exon'.

In lines below, gene38184 is a tRNA and has 'exon' features, however gene38185 and gene38186 (protein codificant) have not.

NC_000932.1 RefSeq  gene    1717    4347    .   -   .   ID=gene38184;Dbxref=GeneID:1466272;Name=trnK;gbkey=Gene;gene=trnK;gene_biotype=tRNA;locus_tag=ArthCt089
NC_000932.1 RefSeq  tRNA    1717    4347    .   -   .   ID=rna60172;Parent=gene38184;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  exon    4311    4347    .   -   .   ID=id325622;Parent=rna60172;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  exon    1717    1751    .   -   .   ID=id325623;Parent=rna60172;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  gene    2056    3570    .   -   .   ID=gene38185;Dbxref=GeneID:844797;Name=matK;gbkey=Gene;gene=matK;gene_biotype=protein_coding;locus_tag=ArthCp003
NC_000932.1 RefSeq  CDS 2056    3570    .   -   0   ID=cds48182;Parent=gene38185;Dbxref=Genbank:NP_051040.2,GeneID:844797;Name=NP_051040.2;gbkey=CDS;gene=matK;product=maturase K;protein_id=NP_051040.2;transl_table=11
NC_000932.1 RefSeq  gene    5084    6188    .   -   .   ID=gene38186;Dbxref=GeneID:844798;Name=rps16;gbkey=Gene;gene=rps16;gene_biotype=protein_coding;locus_tag=ArthCp004
NC_000932.1 RefSeq  CDS 6149    6188    .   -   0   ID=cds48183;Parent=gene38186;Dbxref=Genbank:NP_051041.1,GeneID:844798;Name=NP_051041.1;gbkey=CDS;gene=rps16;product=ribosomal protein S16;protein_id=NP_051041.1;transl_table=11
NC_000932.1 RefSeq  CDS 5084    5283    .   -   2   ID=cds48183;Parent=gene38186;Dbxref=Genbank:NP_051041.1,GeneID:844798;Name=NP_051041.1;gbkey=CDS;gene=rps16;product=ribosomal protein S16;protein_id=NP_051041.1;transl_table=11

This is a problem since I would like to use 'exon' as a feature to count.

Is this normal? Should I use other feature to obtain the counts?

Thanks in advance.

RNA-Seq GFF exon cds • 3.8k views
ADD COMMENT
0
Entering edit mode

I am not a plant guy, but checking these two examples, both are chloroplast genes, so not on the nuclear genome, e.g. gene38186.

ADD REPLY
0
Entering edit mode

You are right, NC_000932.1 is the chloroplast chromosome. The case I describe also happens in the mitochondrial chromosome, NC_037304.1. With the usual proceeding of counting 'exon' features these genes would be lost. In fact I'm not sure if people usually include these chromosomes in the analysis.

ADD REPLY
0
Entering edit mode

chrM is typically not included in gene expression analysis (at least I do not do it for human and mouse). The reason that you do not have "exon" is probably the prokaryotic origin of these organelles which have operon rather than intron/exon structures.

ADD REPLY
5
Entering edit mode
5.3 years ago
tdmurphy ▴ 190

Organelles and prokaryote genomes in both GenBank and RefSeq are typically annotated with only CDS and no corresponding mRNA features, so the GFF3 has no exon features. This is mentioned briefly under "ANNOTATION DATA MODEL" in: ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt

ADD COMMENT

Login before adding your answer.

Traffic: 2076 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6