Biostar Beta. Not for public use.
Annotate Affymetrix probesets to Gene symbols
Entering edit mode
22 months ago
bio94 • 40

I am quite unsure on how to annotate the probesets from a GEO dataset(in this case GSE14333) to gene symbols, in R. The Affymetrix Human Genome U133Plus 2.0 arrays, was used for this dataset.

Would appreciate any help in this regard.

Many thanks!

Entering edit mode
13 months ago
Republic of Ireland

You can use biomaRt. Here is a previous reproducible example using another dataset of the same array type, GSE12056:

First download the dataset from GEO:

gset <- getGEO("GSE12056", GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

We now have the ExpressionSet object:

[1] 54674    20

 [1] "1007_s_at"    "1053_at"      "117_at"       "121_at"       "1255_g_at"   
 [6] "1294_at"      "1316_at"      "1320_at"      "1405_i_at"    "1431_at"     
[11] "1438_at"      "1487_at"      "1494_f_at"    "1552256_a_at" "1552257_a_at"
[16] "1552258_at"   "1552261_at"   "1552263_at"   "1552264_a_at" "1552266_at"  
[21] "1552269_at"   "1552271_at"   "1552272_a_at" "1552274_at"   "1552275_s_at"
[26] "1552276_a_at" "1552277_a_at" "1552278_a_at" "1552279_a_at" "1552280_at"  
[31] "1552281_at"   "1552283_s_at" "1552286_at"   "1552287_s_at" "1552288_at"  
[36] "1552289_a_at" "1552291_at"   "1552293_at"   "1552295_a_at" "1552296_at"  
[41] "1552299_at"   "1552301_a_at" "1552302_at"   "1552303_a_at" "1552304_at"  
[46] "1552306_at"   "1552307_a_at" "1552309_a_at" "1552310_at"   "1552311_a_at"

Now annotate these first 50 and create a 'lookup' table of annotation that can be used to rename your Affy IDs to gene names (takes a long time to look up all IDs):

mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(mart=mart, attributes=c("affy_hg_u133_plus_2", "ensembl_gene_id", "gene_biotype", "external_gene_name"), filter="affy_hg_u133_plus_2", values=rownames(exprs(gset))[1:50], uniqueRows=TRUE)

head(annotLookup, 20)
affy_hg_u133_plus_2 ensembl_gene_id gene_biotype              external_gene_name
1294_at             ENSG00000283726 miRNA                     MIR5193
1316_at             ENSG00000126351 protein_coding            THRA
1552310_at          ENSG00000169609 protein_coding            C15orf40
1552286_at          ENSG00000250565 protein_coding            ATP6V1E2
1552291_at          ENSG00000163964 protein_coding            PIGX
1294_at             ENSG00000182179 protein_coding            UBA7
1552296_at          ENSG00000142959 protein_coding            BEST4
1438_at             ENSG00000182580 protein_coding            EPHB3
1552287_s_at        ENSG00000223959 transcribed_pseudogene    AFG3L1P
1007_s_at           ENSG00000234078 protein_coding            DDR1
1552280_at          ENSG00000145850 protein_coding            TIMD4
1552304_at          ENSG00000139133 protein_coding            ALG10
1552306_at          ENSG00000139133 protein_coding            ALG10
1320_at             ENSG00000070778 protein_coding            PTPN21
1552256_a_at        ENSG00000073060 protein_coding            SCARB1
1007_s_at           ENSG00000215522 protein_coding            DDR1
1552303_a_at        ENSG00000184988 protein_coding            TMEM106A
1552302_at          ENSG00000184988 protein_coding            TMEM106A
1552309_a_at        ENSG00000162614 protein_coding            NEXN
1552274_at          ENSG00000168297 protein_coding            PXK

If you want the RefSeq 'NM' and 'NR' identifiers, then add "refseq_mrna" and "refseq_ncrna" to attributes


Entering edit mode

Thank you so much, Kevin.

Entering edit mode

Hi Kevin, how do I map the probe id's to the gene symbols so that they are in the same order. I just noticed that the annotLookup table probe id's are in a different order when compared to the original dataset.

> rownames(exprs(gset))[1:10]
 [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"   "1316_at"   "1320_at"   "1405_i_at"
[10] "1431_at"  

> annotLookup <- getBM(mart=mart, attributes=c("affy_hg_u133_plus_2", "external_gene_name"), filter="affy_hg_u133_plus_2", values <- rownames(exprs(gset))[1:10] , uniqueRows=TRUE)

> head(annotLookup)
  affy_hg_u133_plus_2 external_gene_name
1             1294_at            MIR5193
2             1316_at               THRA
3             1294_at               UBA7
4           1007_s_at               DDR1
5             1320_at             PTPN21
6              117_at              HSPA6

Thanks again!

Entering edit mode

Yes, thanks for bringing this up. The table that is generated is just a 'lookup' table, which you can then use to modify your data's annotation, as follows:

We have:

[1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"  

  affy_hg_u133_plus_2 ensembl_gene_id   gene_biotype external_gene_name
1             1294_at ENSG00000283726          miRNA            MIR5193
2          1552347_at ENSG00000205758 protein_coding             CRYZL1
3          1552620_at ENSG00000184148 protein_coding              SPRR4
4          1552340_at ENSG00000170374 protein_coding                SP7
5             1316_at ENSG00000126351 protein_coding               THRA
6        1552516_a_at ENSG00000163349 protein_coding              HIPK1

There are many ways to match across fields in objects, but I always go to match():

indicesLookup <- match(rownames(gset), annotLookup$affy_hg_u133_plus_2)

With the matching indices, we use these to extract the matching gene names:

head(annotLookup[indicesLookup, "external_gene_name"])
[1] "DDR1"    "RFC2"    "HSPA6"   "PAX8"    "GUCA1A"  "MIR5193"

Always double check:

dftmp <- data.frame(rownames(gset), annotLookup[indicesLookup, c("affy_hg_u133_plus_2", "external_gene_name")])
head(dftmp, 20)
    rownames.exprs.gset.. affy_hg_u133_plus_2 external_gene_name
117             1007_s_at           1007_s_at               DDR1
363               1053_at             1053_at               RFC2
262                117_at              117_at              HSPA6
529                121_at              121_at               PAX8
304             1255_g_at           1255_g_at             GUCA1A
1                 1294_at             1294_at            MIR5193
5                 1316_at             1316_at               THRA
177               1320_at             1320_at             PTPN21
300             1405_i_at           1405_i_at               CCL5
269               1431_at             1431_at             CYP2E1
99                1438_at             1438_at              EPHB3
359               1487_at             1487_at              ESRRA
473             1494_f_at           1494_f_at             CYP2A6
180          1552256_a_at        1552256_a_at             SCARB1
372          1552257_a_at        1552257_a_at             TTLL12
515            1552258_at          1552258_at        MIR4435-2HG
416            1552261_at          1552261_at              WFDC2
383            1552263_at          1552263_at              MAPK1
382          1552264_a_at        1552264_a_at              MAPK1
255            1552266_at          1552266_at             ADAM32

table(dftmp[,1] == dftmp[,2])


...seems to match (watch for the probe names matching on each row).

Now replace the rownames. Note that there will be some probes that don't match to any genes - these will be NA. For example, all Affy probes with a 'AFFX' prefix are control probes, and should be removed after normalisation.

Also, some probes will map to the same gene; so, gene names will not be unique. This will normally create an error if you try to set non-unique names as rownames to a data-frame. So, you can 'cheat' and make rownames unique for now by adding an extension:

rownames(gset) <- paste(annotLookup[indicesLookup, "external_gene_name"], c(1:length(indicesLookup)), sep="_")

 [1] "DDR1_1"         "RFC2_2"         "HSPA6_3"        "PAX8_4"        
 [5] "GUCA1A_5"       "MIR5193_6"      "THRA_7"         "PTPN21_8"      
 [9] "CCL5_9"         "CYP2E1_10"      "EPHB3_11"       "ESRRA_12"      
[13] "CYP2A6_13"      "SCARB1_14"      "TTLL12_15"      "MIR4435-2HG_16"
[17] "WFDC2_17"       "MAPK1_18"       "MAPK1_19"       "ADAM32_20"     

         GSM304303 GSM304304 GSM304479 GSM304480 GSM304481
DDR1_1      8222.3    6354.8    8361.6   10947.8    8385.2
RFC2_2     10033.3    8773.8   11313.5    7423.7   10522.6
HSPA6_3      221.4     323.5     459.3      67.4     458.5
PAX8_4      2832.2    3333.2    2952.5    3450.2    3194.3
GUCA1A_5     297.1     222.6      20.1      30.5     140.5

You can easily remove this extension for plots, etc, with:

gsub("_[0-9]*$", "", rownames(gset))

Note - rownames(exprs(gset)) and rownames(gset) are interchangeable, but, for modifying these, you have to use rownames(gset)

Entering edit mode

Thanks again, Kevin. Appreciate it.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1