Question

What determines the Intron/Exon boundary?

5

Entering edit mode

9.5 years ago

danvdk ▴ 80

I'm looking at p53 in IGV using hg38. My view window is chr17:7,648,987-7,708,368:

Introns and Exons for p53 on chr17

I understand that that the thin lines are introns and the darker rectangles are exons. The short parts of the rectangles are non-coding regions and the tall parts are coding. The coding begins with an ATG start sequence (read from right to left):

Proteins within an exon

What I can't figure out is what defines the intron/exon boundaries. According to Genomes 2, the boundaries look like:

5' splice site 5'-AG↓GTAAGT-3'
3' splice site 5'-PyPyPyPyPyPyNCAG↓-3'

(I changed U -> T, since this is DNA; N = any base, Py = T or C)

Reading the left end of the intron on the right above (from right to left), I see CCTCTTGCAG, i.e. PyPyPyPyPyPyNCAG! So far so good.

But looking at the right end of the left intron, I don't see anything resembling AGGTAAGT or TGAATGGA.

Base pair sequences at the left and right edges of each intron in p53

The PyPyPyPyPyPyNCAG pattern seems to be loosely followed on the left edges: The GA is very consistent. The next base is often a C but sometimes a T. And the Pyrimidines (C/T) do seem to be more common in the next six slots, although As and Gs occur in these positions in 6/10 introns.

There seems to be no structure to the right edges beyond the consistent TG.

So what defines the intron boundaries? Am I reading these incorrectly?

intron gene • 13k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by danvdk ▴ 80

0

Entering edit mode

also I forgot to mention that your questions was well researched and clearly you made a lot of effort to make sense of the data, that is pretty cool and we like that a lot here!

ADD REPLY • link 9.5 years ago by Istvan Albert 100k

Ram · Answer 1 · 2014-10-09

You've got this flipped around, is your problem. That transcript runs backwards. You've got the reverse DNA sequence there, but you want to be looking at the reverse complement of the original.

Looking at it the right way, I'll put exon in uppercase, and intron in lowercase

ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTgtgagtggatccattggaagggcag.....ctgactttctgctcttgtctttcagACTTCCTGAAAACAACGTTCTG

Those are fine sequences for splice donor sites and splice receptor sites.

http://uswest.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000141510;r=17:7661779-7687550;t=ENST00000413465

That's the link where I got those sequences...note how the first two letters of every intron are GT and the last three are CAG or TAG. Those are the most conserved letters.

Ram · Answer 2 · 2014-10-09

A note: calling left and right plus the qualifiers such "left end of the intron on the right above (from right to left)" is very confusing. I was unable to follow. You should call it start (or end) and show the sequences in the direction they exist and get transcribed. Those that don't understand the directionality of the strands won't be able to comment anyhow.

I downloaded the introns for TP53 as per A: How To Download All The Introns From Ucsc then I printed the beginning and end of each sequence (middle is cut out, first example is bold), I got the list below. There are some that start with gtaag(c or t) (there are some duplicated entries) but I agree that there are many that are not. But the signal is really the GT-AG the rest may only be needed in some situations (long introns etc)

>hg38_knownGene_uc002gig.1 range=chr17:7662015-7676520 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagtggatccattggaagggcaggcccaccacccccaccccaacccca
--
ctccaccctggcgacaaagtgagactccgtctctctctctctctttag
>hg38_knownGene_uc002gih.4 range=chr17:7666245-7676520 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagtggatccattggaagggcaggcccaccacccccaccccaacccca
--
tatttag
>hg38_knownGene_uc010cne.1 range=chr17:7669691-7673534 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtactaagtcttgggacctcttatcaagtggaaagtttccagtctaacac
--
actcatgtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc010cnf.2 range=chr17:7669691-7675052 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagcagctggggctggagagacgacagggctggttgcccagggtcccc
--
gtctcctacag
>hg38_knownGene_uc010cng.2 range=chr17:7669691-7675052 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagcagctggggctggagagacgacagggctggttgcccagggtcccc
--
gtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc002gii.2 range=chr17:7669691-7675052 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagcagctggggctggagagacgacagggctggttgcccagggtcccc
--
ccctgcttctgtctcctacag
>hg38_knownGene_uc031qyp.1 range=chr17:7669691-7676520 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagtggatccattggaagggcaggcccaccacccccaccccaacccca
--
ctctcactcatgtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc010cnh.3 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
tctcactcatgtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc010cni.3 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
tcctccctgcttctgtctcctacag
>hg38_knownGene_uc002gim.4 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
gtctcctacag
>hg38_knownGene_uc002gij.3 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
tcctacag
>hg38_knownGene_uc031qyq.1 range=chr17:7669691-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
catgtgatgtcatctctcctccctgcttctgtctcctacag
>hg38_knownGene_uc010cnj.1 range=chr17:7674291-7675052 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtgagcagctggggctggagagacgacagggctggttgcccagggtcccc
--
gttatctcctag
>hg38_knownGene_uc002gin.3 range=chr17:7674291-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
tgtgttatctcctag
>hg38_knownGene_uc002gio.3 range=chr17:7674291-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
ggcgcactggcctcatcttgggcctgtgttatctcctag
>hg38_knownGene_uc010vug.3 range=chr17:7674972-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
g
>hg38_knownGene_uc010cnk.2 range=chr17:7676273-7687376 5'pad=0 3'pad=0 strand=- repeatMasking=none
gtaagctcctgactgaacttgatgagtcctctctgagtcacgggctctcg
--
ctgactgctcttttcacccatctacag
>hg38_knownGene_uc010vuh.2 range=chr17:7686225-7703242 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gtagattgtttttccgacaaattatcaaacgacccatcattgcactcttt
--
catctctccctcag
>hg38_knownGene_uc010vui.2 range=chr17:7687604-7703242 5'pad=0 3'pad=0 strand=+ repeatMasking=none
gcaagtaatccgcctgccggaggaagcaaaggaaatggagttggggagga