Biostar Beta. Not for public use.
Can't Make Sense Of Gene Symbols
0
Entering edit mode
19 months ago

Sorry in advance if I am doing something completely meaningless as I am very new to gene ontology and bioinformatics.

I am using gene ontology for a small learning project and my ultimate goal is obtaining a list of entrez gene ids. However it is not possible to obtain entrez ids from the database so I have to work with gene symbols.

I am using the mygene python library to get the entrez id of some symbols.Using the mygene library I can check the symbols and aliases and get entrez ids. However I can't convert every gene symbol to an entrez id this way. Even when I try to set the scope and field to "all" there are symbols which return no hits.

Any suggestions on converting gene symbols to entrez ids is appreciated.

Or if anybody could help me make sense of these gene symbols I would be thankful(I tried googling some but some returned nothing while lead me to deeper confusion):

A2RUA4
A6ND21
A6NDM0
A6NDV5
A6NDX5
A6NE68
A6NEC0
A6NF35
A6NFC0
A6NFE1
A6NFK1
A6NFR3
A6NG73
A6NGH9
A6NGZ2
A6NH08
A6NHQ2
A6NJ58
A6NJ64
A6NJ87
A6NJD2
A6NK05
A6NK21
A6NK31
A6NK39
A6NKB0
A6NKX1
A6NLC8
A6NMX9
A6XGL8
A6XGM2
A6XGM3
A6XGM6
A6XGM8
A7LQ08
A8K014
A8K1J1
A8K3J7
A8K3M1
A8K3P5
A8K4M4
A8K549
A8K557
A8K581
A8K5A9
A8K5K4
A8K5N8
A8K607
A8K6A1
A8K6R9
A8K6Z7
A8K6Z9
A8K7L7
A8KAG7
A8MPP1
A8MS99
A8MSY0
A8MT22
A8MT96
A8MTE8
A8MTY0
A8MU69
A8MUF5
A8MUP4
A8MVD6
A8MVP4
A8MVW9
A8MVY7
A8MW26
A8MW45
A8MWD9
A8MXS6
A8MXV9
A8MYF4
A8MYH7
A8MYJ9
A8MYM2
A8MYN1
A8MYQ5
A8MZ49
A8MZF2
A8WDG6
A9LSF3
A9UK02
A9UL17
AVGR5
AVGR8
B2R768
B2R9H1
B3KM92
B3KNQ9
B3KNT8
B3KPC8
B3KPI7
B3KQ05
B3KQ61
B3KQP7
B3KR99
B3KRG6
B3KRR0
B3KSX8
B3KSZ2
B3KT27
B3KT62
B3KTH8
B3KUX4
B3KV47
B3KVB9
B3KWP0
B3KXF6
B3KXL9
B3KY84
B3KY92
CGI-117
CYCT1b
DKFZp313E1411
DKFZp434G222
DKFZp434K202
DKFZp434L0650
DKFZp434N1212
DKFZp547B0614
DKFZp571N1833
DKFZp572F2170
DKFZp586K2222
DKFZp666D035
DKFZp667L0117
DKFZp686A11164
DKFZp686A11192
DKFZp686B01123
DKFZp686B16128
DKFZp686C24207
DKFZp686D0662
DKFZp686G19178
DKFZp686G2045
DKFZp686H10254
DKFZp686H1726
DKFZp686I05275
DKFZp686J1732
DKFZp686K1352
DKFZp686L22104
DKFZp686L2367
DKFZp686M1483
DKFZp686N1815
DKFZp686N23123
DKFZp686O17118
DKFZp761C0417
DKFZp761O1618
DKFZp761P1714
DKFZp762B226
DKFZp779A1451
DKFZp779E2460
DKFZp779F1429
DKFZp779G118
DKFZp779O0162
DKFZp781H1755
DKFZp781L0540
DKFZp781L0674
ENSP00000253473
ENSP00000255320
ENSP00000268683
ENSP00000274028
ENSP00000292074
ENSP00000319224
ENSP00000323765
ENSP00000337140
ENSP00000337494
ENSP00000338360
ENSP00000340338
ENSP00000342189
ENSP00000353090
ENSP00000353260
ENSP00000360622
ENSP00000361081
ENSP00000363813
ENSP00000364268
ENSP00000365122
ENSP00000366629
ENSP00000368480
ENSP00000368985
ENSP00000373895
ENSP00000376005
ENSP00000376302
ENSP00000376693
ENSP00000377011
ENSP00000378490
ENSP00000381401
ENSP00000381873
ENSP00000382368
ENSP00000384847
FLJ00110
FLJ00320
FLJ00380
GIOT3
HIT000000767
HIT000017204
HIT000022092
HIT000045640
HIT000064221
HIT000075737
HIT000081263
HIT000090018
HIT000094289
HIT000193050
HIT000216273
HIT000220452
HIT000242561
HIT000265711
HIT000265751
HIT000275070
HIT000291414
HIT000321646
HIT000323430
HIT000324866
HOX2.8
HP8
HPX-42
HPX-5
HSPC114
HZF14
HZF40
KIAA0543
KIAA0581
KIAA0863
KIAA1627
KIAA1688
KIAA1829
LOC125893
LOC284688
LOC401327
LOC402257
LOC728395
N1_40
MORF/CBP
MSL1L1
MYT2
NOL5
NP_001008728
NP_001018854
NP_001028171
NP_001036148
NP_001074016
NP_001077365
NP_001092199
NP_001093632
NP_001094808
NP_001095147
NP_001098662
NP_001104516
NP_003299
NP_006783
NP_009230
NP_009233
NP_009234
NP_009236
NP_056958
NP_061815
NP_067016
NP_068780
NP_848643
NP_852124
NP_874393
NP_874394
NP_877571
NP_877573
NP_998827
O00366
O14889
O14913
O43582
O43831
O60428
O95480
OK/SW-cl.43
OTTHUMP00000010128
OTTHUMP00000012402
OTTHUMP00000012404
OTTHUMP00000012730
OTTHUMP00000012731
OTTHUMP00000015898
OTTHUMP00000016919
OTTHUMP00000016920
OTTHUMP00000017047
OTTHUMP00000019183
OTTHUMP00000019184
OTTHUMP00000022909
OTTHUMP00000023260
OTTHUMP00000023261
OTTHUMP00000073320
OTTHUMP00000077417
OTTHUMP00000077419
OTTHUMP00000077672
OTTHUMP00000077676
OTTHUMP00000077694
OTTHUMP00000077697
OTTHUMP00000078145
OTTHUMP00000078147
OTTHUMP00000078318
OTTHUMP00000078319
OTTHUMP00000078320
OTTHUMP00000078392
OTTHUMP00000080389
OTTHUMP00000080648
OTTHUMP00000080776
OTTHUMP00000081424
OTTHUMP00000081678
OTTHUMP00000081680
OTTHUMP00000081728
OTTHUMP00000101912
OTTHUMP00000108701
OTTHUMP00000108891
OTTHUMP00000108893
OTTHUMP00000108894
OTTHUMP00000108897
OTTHUMP00000115968
OTTHUMP00000115969
OTTHUMP00000160276
OTTHUMP00000162897
OTTHUMP00000163265
OTTHUMP00000164264
OTTHUMP00000164457
OTTHUMP00000166226
OTTHUMP00000166233
OTTHUMP00000166234
OTTHUMP00000166250
OTTHUMP00000166345
OTTHUMP00000166435
OTTHUMP00000166436
OTTHUMP00000166438
OTTHUMP00000166480
OTTHUMP00000166565
OTTHUMP00000166579
OTTHUMP00000166657
OTTHUMP00000167446
OTTHUMP00000167447
OTTHUMP00000167464
OTTHUMP00000167474
OTTHUMP00000167567
OTTHUMP00000167568
OTTHUMP00000167574
OTTHUMP00000168071
OTTHUMP00000168073
OTTHUMP00000168239
OTTHUMP00000168416
OTTHUMP00000168469
OTTHUMP00000168472
OTTHUMP00000168473
OTTHUMP00000168474
OTTHUMP00000168475
OTTHUMP00000168479
OTTHUMP00000169150
OTTHUMP00000169353
OTTHUMP00000169366
OTTHUMP00000169367
OTTHUMP00000169630
OTTHUMP00000169706
OTTHUMP00000170050
OTTHUMP00000170105
OTTHUMP00000170261
OTTHUMP00000170338
OTTHUMP00000170346
OTTHUMP00000170347
OTTHUMP00000170430
OTTHUMP00000170443
OTTHUMP00000170502
OTTHUMP00000170619
OTTHUMP00000170620
OTTHUMP00000170852
OTTHUMP00000171399
OTTHUMP00000171401
OTTHUMP00000171508
OTTHUMP00000171806
OTTHUMP00000171826
OTTHUMP00000171830
OTTHUMP00000171871
OTTHUMP00000172277
OTTHUMP00000172310
OTTHUMP00000172469
OTTHUMP00000172471
OTTHUMP00000172688
OTTHUMP00000172690
OTTHUMP00000172691
OTTHUMP00000172740
OTTHUMP00000172744
OTTHUMP00000172947
OTTHUMP00000173122
OTTHUMP00000173192
OTTHUMP00000173306
OTTHUMP00000173308
OTTHUMP00000173309
OTTHUMP00000173541
OTTHUMP00000173741
OTTHUMP00000173963
OTTHUMP00000173964
OTTHUMP00000174056
OTTHUMP00000175559
OTTHUMP00000176033
OTTHUMP00000176097
OTTHUMP00000176113
OTTHUMP00000176231
OTTHUMP00000177113
OTTHUMP00000177114
OTTHUMP00000177115
OTTHUMP00000177641
OTTHUMP00000177645
OTTHUMP00000178325
OTTHUMP00000178342
OTTHUMP00000178382
OTTHUMP00000178424
OTTHUMP00000179062
OTTHUMP00000180729
OTTHUMP00000180954
OTTHUMP00000181393
OTTHUMP00000181638
OTTHUMP00000181761
OTTHUMP00000182063
OTTHUMP00000182067
OTTHUMP00000182466
OTTHUMP00000182467
OTTHUMP00000182498
OTTHUMP00000183154
OTTHUMP00000183345
OTTHUMP00000183497
OTTHUMP00000183498
OTTHUMP00000194894
OTTHUMP00000195099
OTTHUMP00000195100
OTTHUMP00000196683
OTTHUMP00000196695
OTTHUMP00000196843
OTTHUMP00000196844
OTTHUMP00000196933
OTTHUMP00000197180
OTTHUMP00000197217
OTTHUMP00000197311
OTTHUMP00000197312
OTTHUMP00000197315
OTTHUMP00000197316
OTTHUMP00000197337
OTTHUMP00000197338
OTTHUMP00000197362
OTTHUMP00000197363
OTTHUMP00000197366
OTTHUMP00000197370
OTTHUMP00000197375
OTTHUMP00000197384
OTTHUMP00000197386
OTTHUMP00000197432
OTTHUMP00000197454
OTTHUMP00000197487
OTTHUMP00000197549
OTTHUMP00000198039
OTTHUMP00000198042
OTTHUMP00000198043
OTTHUMP00000198045
OTTHUMP00000198046
OTTHUMP00000198048
OTTHUMP00000198321
OTTHUMP00000198322
OTTHUMP00000198375
OTTHUMP00000198377
OTTHUMP00000198531
OTTHUMP00000198767
OTTHUMP00000198822
OTTHUMP00000199413
OTTHUMP00000199414
OTTHUMP00000199415
OTTHUMP00000199821
OTTHUMP00000200474
OTTHUMP00000200848
OTTHUMP00000200898
OTTHUMP00000201137
OTTHUMP00000201203
OTTHUMP00000201247
OTTHUMP00000201382
OTTHUMP00000201383
OTTHUMP00000201497
OTTHUMP00000201500
OTTHUMP00000201501
OTTHUMP00000201503
OTTHUMP00000201504
OTTHUMP00000201507
OTTHUMP00000201508
OTTHUMP00000201509
OTTHUMP00000201557
OTTHUMP00000201618
OTTHUMP00000201622
OTTHUMP00000201810
P0C6E5
P61571
P61572
P61574
P61575
P61576
P61578
P61580
P61581
P61582
PRDM7_V1
Q05D88
Q05DF4
Q06DQ3
Q07644
Q12771
Q12998
Q13060
Q13537
Q13771
Q14546
Q14547
Q15288
Q15625
Q15636
Q15736
Q15918
Q16247
Q16365
Q16464
Q16466
Q16624
Q2VXS7
Q2VXS8
Q4U0F0
Q504R3
Q53EW1
Q53FZ4
Q53GX1
Q53XZ7
Q53YE0
Q59EB5
Q59EG4
Q59EI0
Q59EI6
Q59EI7
Q59EL8
Q59EQ0
Q59F07
Q59F14
Q59F49
Q59F60
Q59FH0
Q59FH9
Q59FK5
Q59FT1
Q59FT8
Q59FW0
Q59FW1
Q59FW7
Q59G05
Q59G74
Q59G90
Q59G94
Q59GA3
Q59GA7
Q59GB9
Q59GC7
Q59GG3
Q59GI1
Q59GP6
Q59GR2
Q59GS3
Q59GS7
Q59GV3
Q59HD6
Q59HE9
Q59HF7
Q5C8S5
Q5DNC6
Q5JB52
Q5TG08
Q5TZP9
Q5UW39
Q6N045
Q6TPI8
Q6TXQ4
Q6V963
Q6WG68
Q6YL39
Q6ZMM0
Q6ZMN4
Q6ZN27
Q6ZNA9
Q6ZNB5
Q6ZNF9
Q6ZNQ5
Q6ZP10
Q6ZP45
Q6ZP78
Q6ZQN3
Q6ZQR7
Q6ZS47
Q6ZUT2
Q6ZV00
Q6ZV39
Q6ZW69
Q708E2
Q7KZU6
Q7Z4R1
Q7Z5G6
Q7Z637
Q86SS3
Q86T11
Q86TV8
Q86UQ2
Q86UQ3
Q86VD5
Q86WD1
Q8N1R8
Q8N2J5
Q8N692
Q8N7V9
Q8NAM2
Q8NAN6
Q8NCF9
Q8NEV5
Q8NHX1
Q8WTZ3
Q8WXT4
Q96HF8
Q96I83
Q96LE1
Q96LU8
Q96M79
Q96MC1
Q96MF3
Q96ML5
Q96MM8
Q99419
Q9BQ57
Q9BR93
Q9BY29
Q9BZD1
Q9BZI2
Q9H861
Q9H8R0
Q9H942
Q9HD74
Q9NRA4
Q9NS29
Q9NZ00
Q9P057
Q9UCY6
Q9UD04
Q9UD29
Q9UD61
Q9Y4A3
Q9Y655
Q9Y6S1
Q9Y6T0
RP11-165F24.7-005
RP11-248I9.4-003
RP11-30E16.3-002
RP11-32F11.3-003
RP11-485H8.1-003
RP11-486H9.2
RP11-549L6.1-008
RP11-69O16.1-001
RP11-787I22.2-003
RP11-80B9.3-005
RP3-337H4.4
RP4-747L4.2-003
RP4-811H13.1-004
RP5-1100H13.2
RP5-905H16.1-004
SSDP4
TTDN1
WUGSC:H_DJ0320J15.1
WUGSC:H_NH0244E06.1
XP_001126567
XP_001127004
XP_001128445
XP_001129414
XP_001130480
XP_001130734
XP_001130897
XP_001131061
XP_001131122
XP_001132114
XP_001132385
XP_001713895
XP_001713978
XP_001713987
XP_001714062
XP_001714096
XP_001714187
XP_001714189
XP_001714196
XP_001714200
XP_001714242
XP_001714394
XP_001714478
XP_001714681
XP_001714908
XP_001714950
XP_001714997
XP_001715032
XP_001715055
XP_001715063
XP_001715080
XP_001715084
XP_001715114
XP_001715203
XP_001715573
XP_001715834
XP_001715861
XP_001715999
XP_001716005
XP_001716126
XP_001716171
XP_001716221
XP_001716291
XP_001716320
XP_001716549
XP_001716799
XP_001716832
XP_001716940
XP_001716952
XP_001716965
XP_001716969
XP_001717028
XP_001717276
XP_001717412
XP_001717609
XP_001717611
XP_001717626
XP_001717754
XP_001717766
XP_001717777
XP_001717895
XP_001717966
XP_001718110
XP_001718301
XP_001718472
XP_001718571
XP_001718604
XP_001718960
XP_001718966
XP_001718995
XP_001719058
XP_001719085
XP_001719184
XP_001719195
XP_001719325
XP_001719354
XP_001719871
XP_001719896
XP_001719954
XP_001720130
XP_001720508
XP_001720824
XP_001720850
XP_001721048
XP_001721121
XP_001721439
XP_001721592
XP_001721807
XP_001721950
XP_001722581
XP_001723240
XP_001723503
XP_001724341
XP_001724522
XP_001724765
XP_001725424
XP_001727042
XP_001727057
XP_001732864
XP_001732865
XP_001732882
XP_001732908
XP_001732909
XP_001732922
XP_001732941
XP_001732942
XP_070619
XP_370909
XP_372730
XP_373058
XP_373061
XP_373075
XP_373076
XP_373077
XP_377879
XP_495855
XP_497547
XP_498379
XP_933089
XP_933608
XP_935058
XP_936926
XP_942752
XP_944049
XP_947091
XP_947417
XP_947602
XP_950073
XP_950277
ZNF705H
hCG_17955
hZNF1
tmp_locus_6
usf1-bd
ADD COMMENTlink
0
Entering edit mode

Many of these may be obsolete records that no longer exist. These things happen as new genome builds come out. Examples:
https://www.uniprot.org/uniprot/Q6ZW69
https://www.ncbi.nlm.nih.gov/protein/XP_497547.3?report=genpept

ADD REPLYlink
0
Entering edit mode

Thank you for the reply. The gene symbols being obsolete makes sense. So I went back to gene ontology and tried to see if I can get the obsolete genes from there to see if it would return a similar result to mine, however my sql query for obsolete human genes only returns around 20 genes.So is gene ontology not up to date or is there another problem

my sql code is:

SELECT
 is_obsolete,
 gene_product.symbol AS gp_symbol,
 gene_product.symbol AS gp_full_name,
 dbxref.xref_dbname AS gp_dbname,
 dbxref.xref_key AS gp_acc,
 db.fullname
FROM term
 INNER JOIN association ON term.id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN db ON (association.source_db_id=db.id)
WHERE
 species.ncbi_taxa_id = '9606'
AND 
is_obsolete='1';
ADD REPLYlink
0
Entering edit mode

You can't query a current copy of the database (I assume that is what you did) and expect to find these. As genome builds get refined some gene predictions no longer make sense, there is no experimental evidence for some and some may just be identified as plain errors. As far as current versions of the database are concerned many of these are no longer real. If you look at the historic record for this entry you can see that it was always a computer prediction (no experimental evidence) and it was eventually dropped from the database.

What exactly are you trying to do?

ADD REPLYlink
0
Entering edit mode

Oh so if I understand correctly the latest version of the Gene Ontology will not have many obsolete genes because if the latest version contained these obsolete genes then these genes would not really be obsolete.

What exactly are you trying to do?

Well my main goal is getting the genes(in Entrez ID form) related to certain GO terms. These GO terms are defined by the user.The user also specifies the ncbi taxonomy id so that I get the genes related to that species. To do this automatically, I connect to the

ensembldb.ensembl.org

SQL server and use ensembl_go_54 database. I do this via a python code. I then execute an SQL query on this database to get the genes annotated to the given GO term and species. From this query I am able to obtain gene symbols and I am unable to convert some of these symbols to entrez gene ids.

Initially I thought using an online mirror would be faster compared to setting up a local one. But now I suspect that the GO mirror that I am using is an older version and it gives me some obsolete gene symbols. I will now try setting up a local go database to see if that leads to anything different. I will update this comment after I try that. Thanks

ADD REPLYlink
0
Entering edit mode

This is a rather unusual use case (normally people want to go from genes to GO). Have you looked at http://geneontology.org/ and the downloads they provide?

ADD REPLYlink
0
Entering edit mode

This is a rather unusual use case (normally people want to go from genes to GO)

I agree while I was googling about my problem I found many more posts related to that problem.

Have you looked at http://geneontology.org/ and the downloads they provide?

Yes I did but couldn't see something that would help me.

My current situation:I set up a local gene ontology database on my computer using the sql files they provide. I ran a test case on the database and obtained the gene symbols. Now I will see if I can convert these gene symbols, will keep you updated.

ADD REPLYlink
0
Entering edit mode

Update: The local server gave almost twice as much gene symbols compared to my other solutions. With this set of gene symbols I did 2 things.

  • I tried converting these symbols using the mygene python library. Which was able to give around 5k unique ids.
  • I parsed the gene2accesion file(8 GB file provided by ncbi) to get entrez ids of these gene symbols. This also returned around 5k unique entrez ids.

    When I compared the two sets I saw that they shared around 4900 ids and there were 120 gene ids which weren't mutual. I will investigate this difference further but I think it is an acceptable error margin.
    I will also look into the thousands of gene symbols which weren't converted to anything, I am guessing they are proteins etc. and not really genes.

ADD REPLYlink
1
Entering edit mode

I am guessing they are proteins etc. and not really genes.

Minimally there is one gene = one protein (for genes that code for proteins, then there are isoforms, which are variants produced from same gene). What remains in your list is probably all defunct entries that you will have to ignore.

gene2accession file is the most current state of things. It gets generated every day.

ADD REPLYlink
0
Entering edit mode

I also stumbled across gene2go provided by ncbi. It seems useful too, do you know how frequently updated that file is

ADD REPLYlink
0
Entering edit mode

All gene2* files are updated nightly everyday.

ADD REPLYlink
0
Entering edit mode

Ok thank you.

If the gene2go file is complete-and it seems like that- that might be the answer for me. It is updated nightly as you say so I don't have to worry about getting obsolete results. It is also a small file (145 mb) so it is easy to download and easy to parse.

Compared to my other solution which is to download an 8 gb(gene2accesion) file and also download and setup the gene ontology sql database on my computer, downloading a 145 mb file is much more logical.

My main goal was to get the genes related to GO ID's and it seems like the gene2go file has that information along with species and evidence information. If gene2go is a complete file that contains all the gene-go relations I think I will continue with that, if you have any remarks feel free to tell me.

Thank you very much for caring and responding to my questions.

ADD REPLYlink
0
Entering edit mode

Yeah, I would personally not bother with them given that they won't come up in any other annotations really. On an unrelated note, submitting multiple questions for the same issue is unlikely to get you any further help. Try to limit yourself to a single question for a given issue. It makes it easier to track for those of us interested in helping.

ADD REPLYlink
0
Entering edit mode

Thank you for the suggestion. As you can probably tell I am a beginner to all these stuff so I thought if I phrased myself and my problem differently somebody would be able to help me.

ADD REPLYlink
0
Entering edit mode

It's okay. People here are pretty stubborn - if you pose an interesting, well-written question with proper background, they'll usually do their best to help once they get invested. It might take some time (a few days between comments/answers even), but we generally want to help you find the answer and learn as well. Biostars is odd in that it has a lot of users, but it's still a rather small community - you will grow to recognize/remember the most active members pretty quickly.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3