Biostar Beta. Not for public use.
Question: Why Total Exon Length Of 2/3Rds Of Ucsc Knowngenes Is Non-Divisible By 3?
6
Entering edit mode

When examining the divisibility of total exon length of all UCSC knownGene transcripts (77614), 66.9% of them are not divisible, and even if I adjust for both UTRs and ignore all transcripts with cdsStart = cdsEnd (presumably non-coding), I still get 50% of non-divisible cDNAs.

Here's my script to evaluate divisibility:

``````#!/usr/bin/python
import fileinput

# subset of columns: name, txStart, txEnd, cdsStart, cdsEnd, exonsStart, exonsEnd
for line in fileinput.input("7_columns.txt"):
name, txs, txe, cds, cde, es, ee = line.strip("\n").split("\t")
# exon starts and ends
exons = es.strip(",").split(",")
exone = ee.strip(",").split(",")

total_length = 0

if cds != cde:
# adjust = min(delta(txEnd, cdsEnd), delta(txEnd, last_exon_start) +
# min(delta(cdsStart, txStart), delta(first_exon_end, txEnd)) ;
# without these min()'s, applying adjust yields negative lengths
adjust = min(int(txe) - int(cde), int(txe) - int(exons[-1])) + min(int(cds) - int(txs), int(exone[0]) - int(txe))
else:
# this N helps filter output for non-coding entries
print "N",

for e_index in range(len(exons)):
# sum up all exon lengths
total_length += int(exone[e_index]) - int(exons[e_index])
if adjusted % 3 != 0:
fileinput.close()
``````

(I counted transcripts by piping output to `grep -v N | wc -l`.)

I'd appreciate any hints, including things to try and change in the script.

As a directly related question - how can cdsStart be within the _second_ exon? What is the meaning of the 1st exon in this case? Here's an example (cdsStart is 3bps up from the end of 2nd exon):

``````select name, strand, txstart, txend, cdsstart, cdsend, exonstarts, exonends from knowngene where name='uc001aau.2';
name       | strand | txstart | txend  | cdsstart | cdsend |      exonstarts       |       exonends
------------+--------+---------+--------+----------+--------+-----------------------+-----------------------
uc001aau.2 | +      |  323891 | 328580 |   324342 | 325605 | 323891,324287,324438, | 324060,324345,328580,
``````
Entering edit mode
0

this would only be alarming if you could find an example where no splicing isoforms exist that are divisible by 3

Jeremy Leipzig
18k
Entering edit mode
0

Based on what Pierre has written below, it looks like UCSC transcripts themselves represent alternative splicing isoforms. How do I treat the "extra" nucleotides in non-divisible isoforms - just discard them? That doesn't seem right.

Chronos
• 570
9
Entering edit mode

I think there is a bug in your program: the translation can start in any exon, and not necessarily in the first exon. In that case, the 5' UTR is the result of the concatenation of several exons.

edit:

using the following awk script, I got only length%3==0:

``````BEGIN   {
FS="[\t]";
}

{
cdsStart=int(\$6);
cdsEnd=int(\$7);
if(cdsStart>=cdsEnd) next;
exonCount=int(\$8);
split(\$9,exonStarts,"[,]");
split(\$10,exonEnds,"[,]");
len=0;

for(i=1;i<= exonCount;i++)
{
if(cdsStart>=exonEnds[i])
{
continue;
}
if(cdsEnd<exonStarts[i])
{
continue;
}
beg=exonStarts[i];
if(cdsStart>=exonStarts[i] && cdsStart<exonEnds[i])
{
beg=cdsStart;
}
end=exonEnds[i];
if(cdsEnd>=exonStarts[i] && cdsEnd<exonEnds[i])
{
end=cdsEnd;
}
len+=(end-beg);
}
printf("%s\t%d\t%d\n",\$3,len,len%3);
}
``````

usage:

``````mysql -N  (...) -D hg19 -e 'select * from knownGene  ' | awk -f script.awk
``````
Entering edit mode
0

Basically you're saying that to get cDNA, I have to glue together all exons, discarding everything before cdsStart and after cdsEnd, correct? I'll try that and will report back.

If translation starts in e.g. the 2nd exon - shouldn't everything before it be considered either an intron or a UTR or both? Or do UCSC transcripts also include alternative transcripts? (This hasn't occurred to me before, I'm more used to Ensembl's gene model with sub-transcripts.)

Chronos
• 570
Entering edit mode
0

I've updated my answer. See above.

Pierre Lindenbaum
120k
Entering edit mode
0

ok, I think the notion of all transcripts also including alternative splicing isoforms helps a lot - thanks! Now the only question is what I do with any "extra" nucleotides beyond 3*n - if there any after the recalculation.

Chronos
• 570
Entering edit mode
0

Thanks for the code!

Chronos
• 570
Entering edit mode
0

Thanks again, your interpretation does make all cDNAs divisible.

Chronos
• 570
2
Entering edit mode

Hi chronos

it might also be worth thinking about the nature of exons, cDNA and CDS. Some of the difficulty might arise from the terminology? When you say " _Basically you're saying that to get cDNA, I have to glue together all exons, discarding everything before cdsStart and after cdsEnd, correct?_ " I think you might mean CDS not cDNA?

Exons can occur in the 5'UTR, the coding region (CDS), or the 3'UTR. All these three regions are found in the mRNA, and therefore the cDNA, but only the CDS is translated into protein in the cell. (I'm assuming that you have filtered to deal only with protein-coding transcripts, there can be a number of RNA-encoding genes with exons in the genome too). You first need to find which exons are found exclusively within the CDS and which aren't, N.B. exons can also span the boundary between the CDS and either UTR. The sum of the intra-CDS exons should be divisable by 3 (although I guess there could be some weird exceptions with RNA-editing and the like), but if you include UTR-exons there is no reason your total should be divisible by 3 since they don't code for a protein.

A single 5'UTR exons is therefore why CDS-start can be in the second exon.

I am not sure how many exons in human map to each of these 3 regions but in zebrafish it is 2% 5'UTR, 0.5% 3'UTR and 97.5% CDS.

It sounds like you might have script issues sorted out now, but I thought this description might help (a bit) too.