No corresponding Gene Symbol for Affymertix Probe Set ID
0
1
Entering edit mode
8.3 years ago

I've downloaded this breast cancer expression profile data from NCBI (GPL570), which has 54675 rows. In this dataset, rows are probes, which I want to convert them into gene symbols to give it to GENIE3. But I've encountered with these problems:

  1. 12227 rows of this data, doesn't have any corresponding gene symbol, how can I deal with it?
  2. As I know, human genome has 20,000-25,000 genes, and this data, except the rows without corresponding gene symbols, has 21,025 rows with unique gene symbols/probe id. Doesn't it exceed the acceptable area?

i had the problem of having a many to one relation between gene symbols and probe ids, but I think it would be ok, if I consider an average value for expression data with one gene symbol.

Can anyone help me?

affymertix-probe gene-expression gene-symbol • 3.0k views
ADD COMMENT
0
Entering edit mode

You should get the probe sequences and map them to a reference genome of your choice. Any probe annotation provided by vendors or study authors several years ago is very likely obsolete, don't rely on it for serious work.

ADD REPLY
0
Entering edit mode

so, are you suggesting me to replace the corresponding gene symbols from Original "Human Genome U133 Plus 2.0 Array" annotation to my using dataset?
if yes, there are some prob ids with no gene symbols in the original annotation, too!!!
And there are some probe ids in both dataset, that they don't match? does it prove your saying?

ADD REPLY
0
Entering edit mode

What I meant is that you should get the sequence of each probe and map it (e.g. with blast) to a current annotated genome reference to find out what current gene(s) each probe represent. Usually you have no idea how up-to-date and accurate the vendor-provided association between probe ID and gene symbol is. For probes designed some time ago, you'll always have discrepancy when mapping them to a more recent genome. For example, some probes that were targeting a unique genes back when they were designed will now target nothing or several genes. Also the notion of gene is not the same depending on which reference annotation you're using e.g. a gene in Ensembl is not the same as a gene in Entrez. What you do with probes that map to multiple genes is usually problem-dependent and up to you to decide.

Given that you mention the Human Genome U133 Plus 2.0 Array, you might want to have a look at the hgu133plus2.db Bioconductor package.

ADD REPLY
0
Entering edit mode

actually I am working with plants but did you try NetAffx™ Analysis Center contains many options for human from IDs conversion to ect..for example by entering prob sequence you could retrieve the symbol

ADD REPLY
0
Entering edit mode

Thank you so much! it helped to decrease 12,000 unknown probes, to 8,000. but still there are some prob ids with no gene symbols in the NetAffx database, too!!!

ADD REPLY

Login before adding your answer.

Traffic: 2747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6