What determines the Intron/Exon boundary?
2
5
Entering edit mode
9.5 years ago
danvdk ▴ 80

I'm looking at p53 in IGV using hg38. My view window is chr17:7,648,987-7,708,368:

Introns and Exons for p53 on chr17

I understand that that the thin lines are introns and the darker rectangles are exons. The short parts of the rectangles are non-coding regions and the tall parts are coding. The coding begins with an ATG start sequence (read from right to left):

Proteins within an exon

What I can't figure out is what defines the intron/exon boundaries. According to Genomes 2, the boundaries look like:

  • 5' splice site 5'-AG↓GTAAGT-3'
  • 3' splice site 5'-PyPyPyPyPyPyNCAG↓-3'

(I changed U -> T, since this is DNA; N = any base, Py = T or C)

Reading the left end of the intron on the right above (from right to left), I see CCTCTTGCAG, i.e. PyPyPyPyPyPyNCAG! So far so good.

But looking at the right end of the left intron, I don't see anything resembling AGGTAAGT or TGAATGGA.

Base pair sequences at the left and right edges of each intron in p53

The PyPyPyPyPyPyNCAG pattern seems to be loosely followed on the left edges: The GA is very consistent. The next base is often a C but sometimes a T. And the Pyrimidines (C/T) do seem to be more common in the next six slots, although As and Gs occur in these positions in 6/10 introns.

There seems to be no structure to the right edges beyond the consistent TG.

So what defines the intron boundaries? Am I reading these incorrectly?

intron gene • 13k views
ADD COMMENT
0
Entering edit mode

also I forgot to mention that your questions was well researched and clearly you made a lot of effort to make sense of the data, that is pretty cool and we like that a lot here!

ADD REPLY
1
Entering edit mode
9.5 years ago

You've got this flipped around, is your problem. That transcript runs backwards. You've got the reverse DNA sequence there, but you want to be looking at the reverse complement of the original.

Looking at it the right way, I'll put exon in uppercase, and intron in lowercase

ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTgtgagtggatccattggaagggcag.....ctgactttctgctcttgtctttcagACTTCCTGAAAACAACGTTCTG

Those are fine sequences for splice donor sites and splice receptor sites.

http://uswest.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000141510;r=17:7661779-7687550;t=ENST00000413465

That's the link where I got those sequences...note how the first two letters of every intron are GT and the last three are CAG or TAG. Those are the most conserved letters.

ADD COMMENT
0
Entering edit mode
9.5 years ago

A note: calling left and right plus the qualifiers such "left end of the intron on the right above (from right to left)" is very confusing. I was unable to follow. You should call it start (or end) and show the sequences in the direction they exist and get transcribed. Those that don't understand the directionality of the strands won't be able to comment anyhow.

I downloaded the introns for TP53 as per A: How To Download All The Introns From Ucsc then I printed the beginning and end of each sequence (middle is cut out, first example is bold), I got the list below. There are some that start with gtaag(c or t) (there are some duplicated entries) but I agree that there are many that are not. But the signal is really the GT-AG the rest may only be needed in some situations (long introns etc)

>hg38_knownGene_uc002gig.1 range=chr17:7662015-7676520 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagtggatccattggaagggcaggcccaccacccccaccccaacccca
--
ctccaccctggcgacaaagtgagactccgtctctctctctctctttag
>hg38_knownGene_uc002gih.4 range=chr17:7666245-7676520 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagtggatccattggaagggcaggcccaccacccccaccccaacccca
--
tatttag
>hg38_knownGene_uc010cne.1 range=chr17:7669691-7673534 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtactaagtcttgggacctcttatcaagtggaaagtttccagtctaacac
--
actcatgtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc010cnf.2 range=chr17:7669691-7675052 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagcagctggggctggagagacgacagggctggttgcccagggtcccc
--
gtctcctacag
>hg38_knownGene_uc010cng.2 range=chr17:7669691-7675052 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagcagctggggctggagagacgacagggctggttgcccagggtcccc
--
gtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc002gii.2 range=chr17:7669691-7675052 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagcagctggggctggagagacgacagggctggttgcccagggtcccc
--
ccctgcttctgtctcctacag
>hg38_knownGene_uc031qyp.1 range=chr17:7669691-7676520 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagtggatccattggaagggcaggcccaccacccccaccccaacccca
--
ctctcactcatgtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc010cnh.3 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
tctcactcatgtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc010cni.3 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
tcctccctgcttctgtctcctacag
>hg38_knownGene_uc002gim.4 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
gtctcctacag
>hg38_knownGene_uc002gij.3 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
tcctacag
>hg38_knownGene_uc031qyq.1 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
catgtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc010cnj.1 range=chr17:7674291-7675052 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagcagctggggctggagagacgacagggctggttgcccagggtcccc
--
gttatctcctag
>hg38_knownGene_uc002gin.3 range=chr17:7674291-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
tgtgttatctcctag
>hg38_knownGene_uc002gio.3 range=chr17:7674291-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
ggcgcactggcctcatcttgggcctgtgttatctcctag
>hg38_knownGene_uc010vug.3 range=chr17:7674972-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
g
>hg38_knownGene_uc010cnk.2 range=chr17:7676273-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
ctgactgctcttttcacccatctacag
>hg38_knownGene_uc010vuh.2 range=chr17:7686225-7703242 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gtagattgtttttccgacaaattatcaaacgacccatcattgcactcttt
--
catctctccctcag
>hg38_knownGene_uc010vui.2 range=chr17:7687604-7703242 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gcaagtaatccgcctgccggaggaagcaaaggaaatggagttggggagga
ADD COMMENT
0
Entering edit mode

How can the GT-AG alone be sufficient? In many of the introns you've printed, there are several occurrences of ag before the terminal one. For example, in ctccaccctggcgacaaagtgagactccgtctctctctctctctttag, there are two occurrences of ag before the last.

ADD REPLY
0
Entering edit mode

it is a necessary but not sufficient requirement - there may be many other patterns that all contribute - some are tabulated in the reference that you link to, others are not.

One thing that I learned from biology is that there are not absolute rules and axioms (unlike other sciences where there are ground truths to fall back on)

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6