Question

Refseq Transcript Version Numbers In Ucsc

8

Entering edit mode

11.7 years ago

Mahdi Sarmady ▴ 310

I found out that the UCSC refseq transcript ids do not have a version number.. below are all the "same" transcript (from UCSC refGene table). The positions and exon counts are wildly different. According to NCBI there are seven versions for this particular one: http://www.ncbi.nlm.nih.gov/nuccore/NM_000500

id   | strand |  start   |   end    | coding_start | coding_end | coding_start_status | coding_end_status | exon_count | refseq_id | gene_id 
-------+--------+----------+----------+--------------+------------+---------------------+-------------------+------------+-----------+---------
  2438 | +      |  3355551 |  3356557 |      3355551 |    3356021 | incmpl              | cmpl              |          2 | NM_000500 |    5448
  2439 | +      |  3385938 |  3389289 |      3386045 |    3388753 | cmpl                | cmpl              |         10 | NM_000500 |    5448
  2440 | +      |  3267147 |  3270502 |      3267254 |    3269966 | cmpl                | cmpl              |         10 | NM_000500 |    5448
  2441 | +      |  3306069 |  3342156 |      3306176 |    3341620 | cmpl                | cmpl              |         10 | NM_000500 |    5448
  2442 | +      |  3306853 |  3309423 |      3306853 |    3308887 | incmpl              | cmpl              |          9 | NM_000500 |    5448
 16023 | +      | 31973359 | 31976713 |     31973466 |   31976177 | cmpl                | cmpl              |         11 | NM_000500 |    5448
 16024 | +      |  3476744 |  3480099 |      3476851 |    3479563 | cmpl                | cmpl              |         10 | NM_000500 |    5448
 16025 | +      |  3258942 |  3262296 |      3259049 |    3261760 | cmpl                | cmpl              |         11 | NM_000500 |    5448
 16026 | +      |  3285309 |  3288664 |      3285416 |    3288128 | cmpl                | cmpl              |         10 | NM_000500 |    5448
 16061 | +      | 32006093 | 32009448 |     32006200 |   32008912 | cmpl                | cmpl              |         10 | NM_000500 |    5448

Does UCSC keep track of the RefSeq versions for the refGene table?

refseq transcript ucsc annotation • 7.3k views

ADD COMMENT • link updated 7.8 years ago by Maximilian Haeussler ★ 1.6k • written 11.7 years ago by Mahdi Sarmady ▴ 310

0

Entering edit mode

For clarity/completeness, can you please explain how you obtained this result from the UCSC refGene table

ADD REPLY • link 11.6 years ago by Malachi Griffith 19k

0

Entering edit mode

We loaded the data into our own defined table with internal surrogate keys.

ADD REPLY • link 11.6 years ago by Byron • 0

score 9 · Answer 1 · 2012-08-31

Yes, UCSC does indeed track RefSeq versions for the refGene table. In the browser you can see this by clicking on your RefSeq transcript.

To get them through the table browser you will have to join another table though. Specifically you can join 'gbCdnaInfo' or 'gbStatus'. In both cases the join is via refGene.name. Both of these contain the 'version' field. See attached screenshot.

To get them by a mysql query at the command line you can do something like this:

mysql --user=genomep --password=password --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select distinct refGene.name,gbCdnaInfo.version from refGene,gbCdnaInfo WHERE refGene.name=gbCdnaInfo.acc AND refGene.name="NM_000500"'

+-----------+---------+
| name      | version |
+-----------+---------+
| NM_000500 |       7 |
+-----------+---------+

enter image description here

This reveals that there is in fact only one version of this transcript that is stored in the current UCSC database. That version is NM_000500.7 (i.e. the latest version right now). The reason you are seeing so many multiple rows can be explained by a further refinement of the mysql query example I provide above. Try this:

mysql --user=genomep --password=password --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select distinct refGene.name,gbCdnaInfo.version,refGene.chrom,refGene.txStart,refGene.txEnd from refGene,gbCdnaInfo WHERE refGene.name=gbCdnaInfo.acc AND refGene.name="NM_000500" order by refGene.chrom,refGene.txStart'

+-----------+---------+----------------+----------+----------+
| name      | version | chrom          | txStart  | txEnd    |
+-----------+---------+----------------+----------+----------+
| NM_000500 |       7 | chr6           | 31973358 | 31976712 |
| NM_000500 |       7 | chr6           | 32006092 | 32009447 |
| NM_000500 |       7 | chr6_cox_hap2  |  3476743 |  3480098 |
| NM_000500 |       7 | chr6_dbb_hap3  |  3258941 |  3262295 |
| NM_000500 |       7 | chr6_dbb_hap3  |  3285308 |  3288663 |
| NM_000500 |       7 | chr6_mcf_hap5  |  3355550 |  3356556 |
| NM_000500 |       7 | chr6_mcf_hap5  |  3385937 |  3389288 |
| NM_000500 |       7 | chr6_qbl_hap6  |  3267146 |  3270501 |
| NM_000500 |       7 | chr6_ssto_hap7 |  3306068 |  3342155 |
| NM_000500 |       7 | chr6_ssto_hap7 |  3306852 |  3309422 |
+-----------+---------+----------------+----------+----------+

As you can see, these entries do not correspond to multiple versions of the same transcript (from a transcript sequence perspective, only one is being considered) but rather multiple alignments of that transcript to the genome. Your gene of interest happens to be one from a challenging region of the genome that has been the subject of haplotype investigation that has subsequently been incorporated into the human genome by the human genome reference consortium (HGRC).

Try going to the UCSC genome browser (hg19) and entering these coordinates: chr6:31,973,359-32,009,447. You will see that the positioning of NM_000500 is ambiguous and corresponds to at least two places in hg19. It also maps to five alternate haplotype chromosomes of chr6. In some of those haplotypes, the apparent tandem duplication is maintained. In a few it is not.

As a final sanity check, retrieve the sequence of NM_000500 and perform a BLAT with that sequence:

NM_000500         2122     1  2131  2131 100.0%     6   +   32006093  32009447   3355
NM_000500         2116     1  2131  2131  99.9%  6_qbl_hap6   +    3267147   3270501   3355
NM_000500         2116     1  2131  2131  99.9%  6_cox_hap2   +    3476744   3480098   3355
NM_000500         2110     1  2131  2131  99.8%  6_ssto_hap7   +    3306069   3342155  36087
NM_000500         2108     1  2131  2131  99.8%  6_mcf_hap5   +    3385938   3389288   3351
NM_000500         2104     1  2131  2131  99.6%  6_dbb_hap3   +    3285309   3288663   3355
NM_000500         2064     1  2131  2131  98.9%  6_dbb_hap3   +    3258942   3262295   3354
NM_000500         2064     1  2131  2131  98.9%     6   +   31973359  31976712   3354
NM_000500         1675   400  2131  2131  98.8%  6_ssto_hap7   +    3306853   3309422   2570
NM_000500          894  1223  2131  2131  99.3%  6_mcf_hap5   +    3355551   3356556   1006

I list the top 10 alignments from BLAT (ordered by percent identity this time instead of chromosome position), all with a percent identity higher than 99%. These appear to have a 1-to-1 relationship with those records being stored in the refGene table.

score 0 · Answer 2 · 2016-07-20

0

Entering edit mode

7.8 years ago

Maximilian Haeussler ★ 1.6k

The fact that Refseq tables don't have the version right in them is often annoying. A typical mysql query concats the version number right to the refseq accession:

hgsql hg38 -NBe 'select CONCAT(refseq, ".", version), kgId from kgXref, gbCdnaInfo where gbCdnaInfo.acc=kgXref.refseq' > refseqToUcscId.tab

ADD COMMENT • link 7.8 years ago by Maximilian Haeussler ★ 1.6k