Dear biostars,
Do you know how one can separate noncoding from coding UCSC and ENSEMBL transcripts ? In general I use NR_* to identify noncoding and NM_* to identify protein coding genes in Refseq database.
Thanx in advance
Dear biostars,
Do you know how one can separate noncoding from coding UCSC and ENSEMBL transcripts ? In general I use NR_* to identify noncoding and NM_* to identify protein coding genes in Refseq database.
Thanx in advance
for the ucsc/knownGene, you can select the transcripts having cdsStart==cdsEnd
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select name,chrom,cdsStart,cdsEnd from knownGene where cdsStart=cdsEnd limit 10'
+------------+-------+----------+--------+
| name | chrom | cdsStart | cdsEnd |
+------------+-------+----------+--------+
| uc001aaa.3 | chr1 | 11873 | 11873 |
| uc010nxr.1 | chr1 | 11873 | 11873 |
| uc009vis.3 | chr1 | 14361 | 14361 |
| uc009vit.3 | chr1 | 14361 | 14361 |
| uc009viu.3 | chr1 | 14361 | 14361 |
| uc001aae.4 | chr1 | 14361 | 14361 |
| uc001aah.4 | chr1 | 14361 | 14361 |
| uc009vir.3 | chr1 | 14361 | 14361 |
| uc009viq.3 | chr1 | 14361 | 14361 |
| uc001aac.4 | chr1 | 14361 | 14361 |
+------------+-------+----------+--------+
You can download ENSEMBL annotation from Biomart (http://useast.ensembl.org/biomart/martview/) , you can select Gene Biotype information that will tell you if a given transcript is protein-coding or non-coding.
There's an online course here. http://www.ebi.ac.uk/training/online/course/ensembl-filmed-api-workshop
I think hundreds of ENSEMBL lincRNAs annotations were wrong. (They should be intergenic and in principle they should not overlap with any known coding transcript irrespective of strand direction)
ex:
chr8 33998976 34060498 NM_001177589_Gm3985 0 - chr8 33998977 34060498 lincRNA_ENSMUSG00000079070_ENSMUST00000132101_Gm3985 0 -
chr8 33998976 34060498 NM_001177589_Gm3985 0 - chr8 34000947 34052954 lincRNA_ENSMUSG00000079070_ENSMUST00000180220_Gm3985 0 -
chr8 48265402 48437702 proteinCoding_ENSMUSG00000038143_8_Stox2 0 - chr8 48379626 48531716 lincRNA_ENSMUSG00000097922_ENSMUST00000181417_AC102862.2 0 -
Whoops yes. I googled lincRNA for a definition and didn't notice that the wiki page wasn't actually called lincRNA.
The Ensembl definition can be found here:
http://www.ensembl.org/info/docs/genebuild/ncrna.html
We include RNAs that overlap other genes by <35%
Wiki is right. The original definition came from here http://www.ncbi.nlm.nih.gov/pubmed/19182780. May be you ENSEMBL guys need to change the name from lincRNA to lncRNA. :)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
what is your input ? a list of knownGene identifiers ? a list of ENSGxxxxxxx ?
yes ENS* in case of ENSEMBL and ucsc.* in case of UCSC.
ucsc.* ? can you give one example please.