Biostar Beta. Not for public use.
Question: Mapping Affymetrix IDs to GeneSymbols; Why so many NAs?
Entering edit mode

Hello, after performing a differential expression analysis on a set of .CEL files downloaded from GEO, I'm trying to map the Affymetrix Probe IDs to GeneSymbols using the 'annotate' package with 'hgu133plus2.db'. However, from ~150 significant genes, about 50 can't be mapped ("NA"), which I think is quite a lot. Even more concerning, the top 5 genes can't be mapped.

I also tried using Probe IDs instead of the GeneSymbols when performing GO-enrichment analysis with DAVID, but they don't seem to mapped at all, when choosing AFFYMETRIX_3PRIME_IVT_ID.

My questions are: Why is that? Shouldn't all IDs map? How am I supposed proceed from here on? I doesn't feel right to simply exclude all "NA" Genes from further analysis's.

Any help is greatly appreciated.

ADD COMMENTlink 3.3 years ago bi_Scholar • 0 • updated 3.3 years ago aln • 290
Entering edit mode
Entering edit mode

To answer your question precisely I would need to see the snippet of your code, especially the annotation step. But in general old Affymetrix arrays (including the chip version you use now) have ambiguous design, where probes in the probeset can map to different genes or even non-transcribed regions (according to nowadays annotation), and where some genes can be represented by several probesets. So, I would recommend using custom CDF files from Brainarray project with EntrezG IDs - As a result you will get ~19000 genes, while initially there are 54675 probesets in hgu133plus2 chip.

For the details on Brainarray custom CDF read following article -

How to use CDFs - Be aware, that you will need to use new annotation package, which you are gonna download and install from the same site.

ADD COMMENTlink 3.3 years ago aln • 290
Entering edit mode

Hello aln, thanks for your reply, it was really helpful.

The annotation is performed in a very basic manner:

results <- topTable(...)

symbols <- getSYMBOL(rownames(results), "hgu133plus2")

anno_results <- cbind(results, symbols)

I'll look into the above links and see if I can make some improvements. In general, is there a rule how to handle probesets which couldn't be annotated? Are they simply removed from the result-set? How does one handle a case where one gene is reported as differentially expressed multiple times? (you mentioned, that some genes are represented by multiple probesets) Should I keep the most significant one and remove the others or do I mean over all?

Again, many thanks for your help. Cheers!

ADD REPLYlink 3.3 years ago
• 0
Entering edit mode

If you use Brainarray custom CDF you won't have multiple probesets per gene (read the previous links I posted to understand why). But if you want to apply different solution you should do it before DEGs analysis, so you won't have one gene reported as differentially expressed multiple times. First, I would eliminate all the probesets that map to different genes and all probesets with NA. Second, there are multiple other ways how to do deal with multiple probesets per one gene. Indeed, as you said you can mean over all, but it is not considered the best solution. Better solutions, read:

As for annotation, I usually use select function, at least I'm sure that it reports all the entries for all the probesetsIDs:

probesetsID_to_EntrezID<-select("hgu133plus2.db", probesetsID, "ENTREZID")

where probesetsID is the list of your platform probesets IDs.

So, in my case I do annotation step before DEGs analysis (no matter Brainarray CDF or regular one), so I can eliminate NA probesets and probesets mapped to different genes simultaneously, I don't want them in my DEG list.

ADD REPLYlink 3.2 years ago
• 290

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0