Differences in exon number from Biopython Entrez Package and ANNOVAR
1
0
Entering edit mode
5.8 years ago
gero.knittel ▴ 10

Hi everybody,

I have a strange problem, which for sure has something to do with the version of some reference that I'm using.

I'm counting the number of exons of the transcript NM_003752 in the following way:

from Bio import Entrez
Entrez.email = 'xxxxxx'

handle = Entrez.efetch(db='Nucleotide', id='NM_003752', retmode='xml')
record = Entrez.read(handle)
feature_count = len(record[0]['GBSeq_feature-table'])
exoncount = 0
for i in range(0, feature_count):
    if record[0]['GBSeq_feature-table'][i]['GBFeature_key'] == 'exon':
        exoncount += 1
print(exoncount)

Output is 21. Also, if I count manually on the NCBI website (https://www.ncbi.nlm.nih.gov/nuccore/557129041/), I count 21 exons.

However, if I run ANNOVAR using the following input file:

16  28391160    28391160    G   A

with the following command

annotate_variation.pl -out results -build hg19 annovar.avinput humandb/ --separate

I get this result:

line1   nonsynonymous SNV EIF3CL:NM_001317857:exon21:c.C2704T:p.R902C,EIF3C:NM_001286478:exon22:c.C2671T:p.R891C,EIF3CL:NM_001099661:exon21:c.C2704T:p.R902C,EIF3C:NM_001267574:exon21:c.C2701T:p.R901C,EIF3C:NM_001037808:exon22:c.C2701T:p.R901C,EIF3CL:NM_001317856:exon21:c.C2704T:p.R902C,EIF3C:NM_001199142:exon22:c.C2701T:p.R901C,EIF3C:NM_003752:exon22:c.C2701T:p.R901C,  16  28391160    28391160    G   A

So it says that for transcript NM_003752, the resulting mutation will be in exon 22, which it shouldn't even have.

What's the problem here?

Thanks very much for helping out!

Best, Gero

annovar entrez package biopython • 1.6k views
ADD COMMENT
1
Entering edit mode

@gero: I removed your email address from the post.

ADD REPLY
0
Entering edit mode

Thank you, that was not very considerate of me.

ADD REPLY
0
Entering edit mode

gero.knittel : You should email ANNOVAR authors and let them know if this discrepancy.

ADD REPLY
0
Entering edit mode
5.8 years ago

Hello and good evening,

ANNOVAR is correct. For that particular transcript isoform, NM_001286478, it does indeed have 22 exons. The 'split' occurs at exon 9 of the gene. In the following screenshot from the UCSC Genome Browser, this transcript is highlighted in blue (at left) and you can see that, in exon 9, a few bases are not transcribed in that and other transcripts:

f

[source: https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtM...]

Kevin

ADD COMMENT
0
Entering edit mode

Kevin Blighe : OP is asking about NM_003752:exon22:c.C2701T:p.R901C which is at the end of that long line.

It could be an error with ANNOVAR's annotation correct?

ADD REPLY
0
Entering edit mode

Indeed, I was looking at the incorrect transcript. NM_001286478 (and a few other transcripts) have 22 exons, whereas NM_003752, in which Gero is interested, only has 21 exons in all transcripts that I have observed.

ADD REPLY
0
Entering edit mode

I have just checked my own version of ANNOVAR (Version: $Date: 2015-06-17 21:43:51 -0700 (Wed, 17 Jun 2015)), and it gives a different result for that transcript. It says exon 21, correctly:

line1   nonsynonymous SNV   EIF3C:NM_001037808:exon21:c.C2704T:p.R902C,EIF3CL:NM_001099661:exon21:c.C2704T:p.R902C,EIF3C:NM_001199142:exon21:c.C2704T:p.R902C,EIF3C:NM_001267574:exon20:c.C2704T:p.R902C,EIF3C:NM_001286478:exon21:c.C2674T:p.R892C,EIF3C:NM_003752:exon21:c.C2704T:p.R902C,    16  28391160    28391160    G   A

Which version of ANNOVAR are you using?

ADD REPLY
0
Entering edit mode

Hi Kevin,

I'm using the current version (Version: $Date: 2018-04-16 00:43:31 -0400 (Mon, 16 Apr 2018)).

ADD REPLY
0
Entering edit mode

What is the source for the annotation by ANNOVAR? At ensembl I cannot find any transcript with 22 exons. They all have 21 or less.

I'v seen it at least once that the data from UCSC were wrong. This leads to a problem in a custom target resequencing kit we let design by Agilent. We gave them the transcript number we liked to capture and they design the probes. For one exon, in one gene the coordinates were wrong at the end, so we could not capture this one. Investigating on that problem it turned out that UCSC had the wrong coordinates for that exon.

fin swimmer

ADD REPLY
0
Entering edit mode

well, IGV (v 2.4.10) shows it's exon 22 for transcript (for both hg19 and b37) (screenshot attached). But UCSC table data for NM_003752 (for grch37/ hg19) lists only 21 exons. Probably, this discrepancy is due to incorrect transcript version in annotation, for that build.

Screenshot_at_2018_06_21_12_56_32

ADD REPLY
0
Entering edit mode

Hm, I'm pretty sure this is some kind of version problem of the annotation source.

In my IGV (v 2.4.10) it's exon 21.

NM_003752

ADD REPLY
0
Entering edit mode

I think it is the problem with annotation source. I deleted earlier hg19.genome from IGV (genomes folder) and downloaded new one. This time, the transcript has 21 exons.

Screenshot_at_2018_06_21_14_46_49

ADD REPLY
0
Entering edit mode

In the UCSC screenshot that I showed, some also have 22 exons (I believe that they were 'UCSC genes'). Gene annotation is a constantly evolving process, as far as I understand.

ADD REPLY
0
Entering edit mode

Ladies and Gentlemen: what we are witnessing here is the complexity of the genome at play... transcription is pervasive and constant, and performed in ways we don't fully grasp in our daily research activities.

ADD REPLY

Login before adding your answer.

Traffic: 1490 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6