How to convert exon genomic coordinates to protein coordinates
1
0
Entering edit mode
8.6 years ago

I am trying to convert exon start/end from genomic coordinates to protein. I have been using Ensembl-Biomart exon attribute: Genomic coding start/end. I assume that this one refers to the start and end of the exon of only of the coding region. So if for each gene I start from exon 1 and I start counting by three from Genomic coding start I should have an exact number of aminoacids (and then I can convert easily from genomic to protein coordinates). In other words (Genomic coding start - Genomic coding end ) should be divisible by 3. However, it is not.

I have been trying to use CDS start/end cDNA start/end too, but same happens. Any clue on what is going on here? Thank you.

gene genome exon • 4.0k views
ADD COMMENT
0
Entering edit mode

Start from exon 1 and I start counting by three from Genomic coding start I should have an exact number of aminoacids

You're wrong: exons may contain UTRs

ADD REPLY
1
Entering edit mode
8.6 years ago
Juke34 8.5k

First of all, not all exon are coding (UTR). Some can be partially coding. So, you have to work with CDS coordinates.

Secondly, to calculate a length you must do (end-start+1).

Best

ADD COMMENT
0
Entering edit mode
I forgot to add the + 1 in my comment sure, sorry. About UTRs, as far as I know they are not considered coding so must be out of the Genomic coding start/end attribute..right? I have tried using CDS coordinates but same happens (i.e. GATA3 canonical transcript in ensembl v70)
ADD REPLY
0
Entering edit mode

What about "start - end" instead of "end - start" , did you noticed the error?

Right, UTRs are out the genomic coding start / end. The thing is the gene has several CDS (5 in human) so you should extract the length of each of them, make a total of all the length and then divide by 3. Did you do that?

ADD REPLY
0
Entering edit mode

The thing is I am not interested in the length of the exons in aa... I was using this measure to see if I could directly translate from exon start/end genomic coordinates to protein coordinates like (where EXON_START is the Genomic CODING exon start) this example for GATA3:

ENSG             ENST             EXON_START  EXON_END  EXON  (EXON_END-EXON_START)+1/3
ENSG00000107485  ENST00000379328  1
ENSG00000107485  ENST00000379328  8097619     8097859   2     80,3333333333
ENSG00000107485  ENST00000379328  8100268     8100804   3     179
ENSG00000107485  ENST00000379328  8105956     8106101   4     48,6666666667
ENSG00000107485  ENST00000379328  8111436     8111561   5     42
ENSG00000107485  ENST00000379328  8115702     8115986   6     95

My idea was 8097619 is the first genomic coding position of GATA3 so 8097619-8097621 correspond to protein position 1, 8097624-8097626 correspond to protein position 2, etc. However this cannot be true if the exon length according genomic coding coordinates start and end is not divisible by 3 ... (If you use CDS coordinates same happens)

ADD REPLY
2
Entering edit mode

Thanks for the example. It will be easier to explain.

According to genomic coding position start and end you are right this is divisible by 3. BUT ONLY THE TOTAL! Indeed some codons are split over two exons.

Your total is 445.

All length = 241 + 537 + 146 + 126 + 285 = 1335

And 1335 / 3 = 445.

I hope you understand it now. :)

ADD REPLY
0
Entering edit mode
I see, I understand it now. Thanks!!!
ADD REPLY

Login before adding your answer.

Traffic: 1878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6