Question

Merging expression data from multiple platforms

0

Entering edit mode

7.7 years ago

mforde84 ★ 1.4k

I'm looking to do a meta-analysis of expression arrays in GEO for a particular cell line. I've been able to determine which GSE my samples are associated with, however there are a wide variety of GPL associated with them. At the moment I'm using GEOquery to retrieve the GSE ExpressionSets, and I'm curious if there is a way to match probe IDs across all of the ExpressionSets to a gene identifier like HGCN, Entrez ID, Ensembl etc.

I've tried merging ExpressionSets with inSilicoMerging, however the program is merging by the objects featureName of probe ID. Different platforms ... different probe names. So the only time merging actually works is when merging GSE with the same GPL. I have all of the GPL annotations, and I've gone through each and mapped probe ID to gene name, though I'm not sure how to go about using these to change the featureNames in my ExpressionSet objects to their corresponding gene names.

Any suggestions are appreciated.

Marty

microarray expression array • 3.6k views

ADD COMMENT • link 7.7 years ago by mforde84 ★ 1.4k

0

Entering edit mode

> library(GEOquery)
> library(inSilicoMerging)
> eset1 <- getGEO("GSE49962") #[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array
> eset2 <- getGEO("GSE53494") #[HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]
> eset1 = eset1[[1]]
> eset2 = eset2[[1]]
> eset1
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 6 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1210881 GSM1210882 ... GSM1210886 (6 total)
  varLabels: title geo_accession ... data_row_count (31 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL570 
> eset2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 32321 features, 24 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1294905 GSM1294906 ... GSM1294928 (24 total)
  varLabels: title geo_accession ... data_row_count (34 total)
  varMetadata: labelDescription
featureData
  featureNames: 7892501 7892502 ... 8180418 (32321 total)
  fvarLabels: ID GB_LIST ... category (12 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL6244 
> linker <- list(eset1, eset2)
> merged_data <- merge(linker)
  INSILICOMERGING: Run with no additional merging technique...
  INSILICOMERGING:  ! WARNING ! Number of common genes < 1%
> merged_data
ExpressionSet (storageMode: lockedEnvironment)
assayData: 0 features, 30 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1210881 GSM1210882 ... GSM1294928 (30 total)
  varLabels: channel_count characteristics_ch1 ... type (34 total)
  varMetadata: labelDescription
featureData
  featureNames:
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL570 GPL6244 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] inSilicoMerging_1.15.0 GEOquery_2.38.4        Biobase_2.32.0        
[4] BiocGenerics_0.18.0   

loaded via a namespace (and not attached):
 [1] lattice_0.20-33      IRanges_2.6.1        XML_3.98-1.4        
 [4] bitops_1.0-6         R6_2.1.3             grid_3.3.1          
 [7] xtable_1.8-2         DBI_0.5              stats4_3.3.1        
[10] DESeq_1.24.0         RSQLite_1.0.0        httr_1.2.1          
[13] genefilter_1.54.2    annotate_1.50.0      S4Vectors_0.10.3    
[16] Matrix_1.2-6         splines_3.3.1        RColorBrewer_1.1-2  
[19] geneplotter_1.50.0   RCurl_1.95-4.8       survival_2.39-5     
[22] AnnotationDbi_1.34.4

ADD REPLY • link 7.7 years ago by mforde84 ★ 1.4k

score 1 · Answer 1 · 2016-08-25

1

Entering edit mode

7.7 years ago

Manvendra Singh ★ 2.2k

May be assign, mean of all probes to its target gene. then choose only those genes that are detectable in all platforms. once you have equal number of rows in different datasets, you can easily merge it by "merge" function in R

ADD COMMENT • link 7.7 years ago by Manvendra Singh ★ 2.2k

score 1 · Answer 2 · 2016-08-25

Figured it out:

gpl_annotation <- read.delim("~/gpl_annotation.txt")
count=1
for (name in featureNames(eset1)){
    lookup_index <- which(gpl_annotation$V1==name)
    try({featureNames(eset1)[[c]] = as.character(gpl_annotation[lookup_index,2])},TRUE)
    count=count+1
}

Then, I'll mean of probes per target then merge.