Question

Why Total Exon Length Of 2/3Rds Of Ucsc Knowngenes Is Non-Divisible By 3?

6

Entering edit mode

12.5 years ago

Chronos ▴ 610

When examining the divisibility of total exon length of all UCSC knownGene transcripts (77614), 66.9% of them are not divisible, and even if I adjust for both UTRs and ignore all transcripts with cdsStart = cdsEnd (presumably non-coding), I still get 50% of non-divisible cDNAs.

Here's my script to evaluate divisibility:

#!/usr/bin/python
import fileinput
print "name", "length", "%3", "adj.", "adj%3"

# subset of columns: name, txStart, txEnd, cdsStart, cdsEnd, exonsStart, exonsEnd
for line in fileinput.input("7_columns.txt"):
    name, txs, txe, cds, cde, es, ee = line.strip("\n").split("\t")
    # exon starts and ends
    exons = es.strip(",").split(",")
    exone = ee.strip(",").split(",")

    total_length = 0

    if cds != cde:
        # adjust = min(delta(txEnd, cdsEnd), delta(txEnd, last_exon_start) +
        # min(delta(cdsStart, txStart), delta(first_exon_end, txEnd)) ; 
        # without these min()'s, applying adjust yields negative lengths
        adjust = min(int(txe) - int(cde), int(txe) - int(exons[-1])) + min(int(cds) - int(txs), int(exone[0]) - int(txe))
    else:
        # this N helps filter output for non-coding entries
        print "N",
        adjust = 0

    for e_index in range(len(exons)):
        # sum up all exon lengths
        total_length += int(exone[e_index]) - int(exons[e_index])
        # adjust for UTRs
        adjusted = total_length - adjust
    if adjusted % 3 != 0:
        print name, total_length, total_length % 3, adjusted, adjusted % 3
fileinput.close()

(I counted transcripts by piping output to grep -v N | wc -l.)

I did read http://biostar.stackexchange.com/questions/7042/are-all-exon-nucleotide-sequence-counts-divisible-by-3-in-the-human-genome first.

I'd appreciate any hints, including things to try and change in the script.

As a directly related question - how can cdsStart be within the second exon? What is the meaning of the 1st exon in this case? Here's an example (cdsStart is 3bps up from the end of 2nd exon):

select name, strand, txstart, txend, cdsstart, cdsend, exonstarts, exonends from knowngene where name='uc001aau.2';
name       | strand | txstart | txend  | cdsstart | cdsend |      exonstarts       |       exonends        
------------+--------+---------+--------+----------+--------+-----------------------+-----------------------
uc001aau.2 | +      |  323891 | 328580 |   324342 | 325605 | 323891,324287,324438, | 324060,324345,328580,

ucsc exon length codon • 3.8k views

ADD COMMENT • link updated 12.5 years ago by Dave Lunt ★ 2.0k • written 12.5 years ago by Chronos ▴ 610

0

Entering edit mode

this would only be alarming if you could find an example where no splicing isoforms exist that are divisible by 3

ADD REPLY • link 12.5 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Based on what Pierre has written below, it looks like UCSC transcripts themselves represent alternative splicing isoforms. How do I treat the "extra" nucleotides in non-divisible isoforms - just discard them? That doesn't seem right.

ADD REPLY • link 12.5 years ago by Chronos ▴ 610

score 9 · Answer 1 · 2011-11-08

9

Entering edit mode

12.5 years ago

Pierre Lindenbaum 161k

I think there is a bug in your program: the translation can start in any exon, and not necessarily in the first exon. In that case, the 5' UTR is the result of the concatenation of several exons.

edit:

using the following awk script, I got only length%3==0:

BEGIN   {
    FS="[\t]";
    }

    {
    cdsStart=int($6);
    cdsEnd=int($7);
    if(cdsStart>=cdsEnd) next;
    exonCount=int($8);
    split($9,exonStarts,"[,]");
    split($10,exonEnds,"[,]");
    len=0;

    for(i=1;i<= exonCount;i++)
        {
        if(cdsStart>=exonEnds[i])
            {
            continue;
            }
        if(cdsEnd<exonStarts[i])
            {
            continue;
            }
        beg=exonStarts[i];
        if(cdsStart>=exonStarts[i] && cdsStart<exonEnds[i])
            {
            beg=cdsStart;
            }
        end=exonEnds[i];
        if(cdsEnd>=exonStarts[i] && cdsEnd<exonEnds[i])
            {
            end=cdsEnd;
            } 
        len+=(end-beg);
        }
    printf("%s\t%d\t%d\n",$3,len,len%3);
    }

usage:

mysql -N  (...) -D hg19 -e 'select * from knownGene  ' | awk -f script.awk

ADD COMMENT • link 12.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Basically you're saying that to get cDNA, I have to glue together all exons, discarding everything before cdsStart and after cdsEnd, correct? I'll try that and will report back.

If translation starts in e.g. the 2nd exon - shouldn't everything before it be considered either an intron or a UTR or both? Or do UCSC transcripts also include alternative transcripts? (This hasn't occurred to me before, I'm more used to Ensembl's gene model with sub-transcripts.)

ADD REPLY • link 12.5 years ago by Chronos ▴ 610

0

Entering edit mode

I've updated my answer. See above.

ADD REPLY • link 12.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

ok, I think the notion of all transcripts also including alternative splicing isoforms helps a lot - thanks! Now the only question is what I do with any "extra" nucleotides beyond 3*n - if there any after the recalculation.

ADD REPLY • link 12.5 years ago by Chronos ▴ 610

0

Entering edit mode

Thanks for the code!

ADD REPLY • link 12.5 years ago by Chronos ▴ 610

0

Entering edit mode

Thanks again, your interpretation does make all cDNAs divisible.

ADD REPLY • link 12.5 years ago by Chronos ▴ 610

score 2 · Answer 2 · 2011-11-08

Hi chronos

it might also be worth thinking about the nature of exons, cDNA and CDS. Some of the difficulty might arise from the terminology? When you say "Basically you're saying that to get cDNA, I have to glue together all exons, discarding everything before cdsStart and after cdsEnd, correct?" I think you might mean CDS not cDNA?

Exons can occur in the 5'UTR, the coding region (CDS), or the 3'UTR. All these three regions are found in the mRNA, and therefore the cDNA, but only the CDS is translated into protein in the cell. (I'm assuming that you have filtered to deal only with protein-coding transcripts, there can be a number of RNA-encoding genes with exons in the genome too). You first need to find which exons are found exclusively within the CDS and which aren't, N.B. exons can also span the boundary between the CDS and either UTR. The sum of the intra-CDS exons should be divisable by 3 (although I guess there could be some weird exceptions with RNA-editing and the like), but if you include UTR-exons there is no reason your total should be divisible by 3 since they don't code for a protein.

A single 5'UTR exons is therefore why CDS-start can be in the second exon.

I am not sure how many exons in human map to each of these 3 regions but in zebrafish it is 2% 5'UTR, 0.5% 3'UTR and 97.5% CDS.

It sounds like you might have script issues sorted out now, but I thought this description might help (a bit) too.