Why are the sum of all exons so much longer than CDS?
2
1
Entering edit mode
5.1 years ago

Is this because of non-coding RNA? I thought these would have been at least comparable.

library(GenomicFeatures)
txdb <- makeTxDbFromEnsembl("Homo Sapiens",server="useastdb.ensembl.org")
gr<-cds(txdb)
sum(width(reduce(gr)))
[1] 41901692

gr<-exons(txdb)
sum(width(reduce(gr)))
[1] 153094341
exons cds • 1.8k views
ADD COMMENT
1
Entering edit mode

yes, there are a bunch of non-coding RNA families (rRNA, tRNA, miRNA, snRNA, lncRNA, ...) which can be in your exon list but not in your CDS

ADD REPLY
5
Entering edit mode
5.1 years ago
Eric Lim ★ 2.1k

3' and 5' UTR as well as non-coding species.

ADD COMMENT
2
Entering edit mode

how about that!

sum(width(unlist(fiveUTRsByTranscript(txdb))))
[1] 21299710
sum(width(unlist(threeUTRsByTranscript(txdb))))
[1] 86927764
ADD REPLY
1
Entering edit mode

Yeah, the mean 3' UTR is around 40% of the length of a transcript and a not insubstantial number of UTRs are more than 75% of the transcript.

ADD REPLY
0
Entering edit mode

4-5x is right in the neighborhood of their average between 5' and 3'. :)

Table 1 in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC139023/

ADD REPLY
0
Entering edit mode
5.1 years ago

Also, ensembl will count an exon as unique if it overlaps with other exons, so a lot of sequence is being counted over and over again because it belongs to multiple slightly different exons.

ADD COMMENT
0
Entering edit mode

That’s why I used reduce

ADD REPLY

Login before adding your answer.

Traffic: 2467 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6