Question

Intronless genes in the human genome annotation

2

Entering edit mode

3.5 years ago

Sergio Martínez Cuesta ▴ 230

Hi everyone,

I was wondering if anyone is familiar of any annotation term in the human genome annotation e.g. from gencode or ensembl to be able to extract intronless genes and separate them from genes containing introns.

There is probably an automated way to extract intronless genes from the exon annotations in the gtf files. If there are any initial thoughts on this it would be much appreciated

Thanks in advance,

Sergio

intronless genes hg38 GRCh38 annotation • 1.2k views

ADD COMMENT • link 3.5 years ago by Sergio Martínez Cuesta ▴ 230

score 8 · Accepted Answer · 2020-11-09

A simple perl one-liner:

perl -lne 'if (/.+\texon\t.+gene_id "([^"]+)/) { $g{$1}++ }
END { foreach $i (sort keys %g) { print $i unless $g{$i} > 1 } }
' gencode.v26.annotation.gtf \
> intronless.txt

If you want to make sure that the code works, you can have both intermediate (exoncounts.txt) and final results (intronless.txt) to check manually the exon count:

perl -lne 'if (/.+\texon\t.+gene_id "([^"]+)/) { $g{$1}++ }
END { foreach $i (sort keys %g) { print "$i\t$g{$i}" } }
' gencode.v26.annotation.gtf | tee exoncounts.txt \
| perl -lane 'print $F[0] if $F[1]==1' > intronless.txt

Taking the same exon counting rationale into this grep | cut | sort | uniq | perl combination is even faster:

grep exon gencode.v26.annotation.gtf | cut -d'"' -f2 | sort | uniq -c \
| tee exoncounts.txt | perl -lane '$F[0] == 1 and print $F[1]' > intronless.txt

Important note: just realized that CDS lines do get through the previous grep, plus that HAVANA and ENSEMBL annotations may be redundant (therefore same exons could be counted twice), so the code should consider those issues in order to generate the proper output:

awk '{ if ($3=="exon") print $1, $4, $5, $10 }' gencode.v26.annotation.gtf | sort -u \
| cut -d'"' -f2 | sort | uniq -c | perl -lane '$F[0] == 1 and print $F[1]' > intronless.txt

Explanation: awk selects exon lines and prints only chromosome, start, end and geneid, sort -u collapses redundant exons, cut -d'"' -f2 reduces output to geneids only, sort | uniq -c collapses same geneids while counting them, and perl prints geneids containing 1 exon only.