Question

Annotation in ExpressionSet object

0

Entering edit mode

7.9 years ago

Na Sed ▴ 310

I have a matrix of gene expression which label of each sample and symbol of each gene is known. I wanna perform gtBroad in globaltest R-package. It needs to convert expression matrix to an ExpressionSet object with known Annotation. How I can do that while I have only the name of genes?

EDIT: You can download sample data from this link: https://www.dropbox.com/s/d50tr958yvvcxwx/TEST.Rdata?dl=0

I wanna run the below code:

library(globaltest)
library(GSEABase)
broadset <- getBroadSets(asBroadUri(c('BIOCARTA_HDAC_PATHWAY', 'PID_HDAC_CLASSII_PATHWAY')))
pData <- data.frame(id=resp, row.names=colnames(data))
phenoData <- AnnotatedDataFrame(data=pData)
exprs <- ExpressionSet(data, phenoData=phenoData)
gtBroad (response=id, exprs, id='PID_HDAC_CLASSIII_PATHWAY', collection=broadset)

R ExpressionSet • 7.0k views

ADD COMMENT • link 7.9 years ago by Na Sed ▴ 310

score 3 · Accepted Answer · 2016-05-25

3

Entering edit mode

7.9 years ago

Selenocysteine ▴ 620

You can use this function: http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/Biobase/html/class.ExpressionSet.html

The gene annotations (featureData) are optional, so you can get an ExpressionSet without any annotations. If you want to get some information about your genes, you can use the biomaRt package which is based on the Ensembl database. Here is the guide: https://www.bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.pdf.

Basically, you can use your list of gene symbol to query the biomaRt database and obtain a set of data for these symbols (e.g. entrezID, chromosomal location, strand). The obtained dataset (featureData) will be an object of class R dataframe. According to the biobase guide to create ExpressionSet items, you then have to convert it to an AnnotatedDataFrame using the annotatedDataFrameFrom() function. Note that the number of rows in featureData must match the number of rows in assayData, and row names of featureData (gene symbols) must match row names of the matrix / matricies in assayData (gene symbols).

ADD COMMENT • link 7.9 years ago by Selenocysteine ▴ 620

0

Entering edit mode

Thanks, but the problem still is remained. I added sample data as well as my code to the question. It would be great if you take a look.

ADD REPLY • link 7.9 years ago by Na Sed ▴ 310

0

Entering edit mode

I tried fixing your code. The main issue is that to use the GSEA tool you need to know the EntrezIDs of your genes. First, you need to map your gene symbols to their EntrezIDs, and in your case it is not possible for around 2k genes, because they are not called with the official gene name, but with an alias. You have to find the official gene names for all of your genes and change the initial dataset, I think there are some tools on like the HGNC webpage. The rest is explained in the code.

library(globaltest)
library(GSEABase)
library(annotate)
library(org.Hs.eg.db) # I used this database because I assumed we are dealing with Homo sapiens

load("TEST.RData")

# Takes the gene symbols of your expression data
geneSymbols = rownames(data)

# Takes the EntrezIDs of your genes. It gives an error because some of your genes
# are not called with the official gene symbol, but with a synonym.
# You probably must fix this manually, I just removed the rows of your dataset
# for which the annotation package wasn't able to map an EntrezID.
annotations = select(org.Hs.eg.db, geneSymbols, "ENTREZID", "SYMBOL")

# Removes the annotations for those genes for which it didn't find the EntrezID
annotations = annotations[!is.na(annotations$ENTREZID),]

# Creates a named list to map the EntrezIDs of your genes and their symbols
annotationsVec = c(annotations$ENTREZID)
names(annotationsVec) = annotations$SYMBOL
annotationsVec = as.list(annotationsVec)

# Removes the rows corresponding to the genes without an EntrezID
data = data[rownames(data) %in% annotations$SYMBOL,]
nrow(data)
# [1] 5325  
# You lost around 2k genes because they are not called with the official symbols
# and the annotation cannot map their symbols to the IDs. You have to do this with some 
# other tools which allow you to look for gene symbol aliases.

broadset = getBroadSets(asBroadUri(c('BIOCARTA_HDAC_PATHWAY', 'PID_HDAC_CLASSII_PATHWAY')))
phenoData = data.frame(id = resp, row.names = colnames(data))
phenoData = AnnotatedDataFrame(data = phenoData)

exprs = ExpressionSet(data, phenoData = phenoData)

# Since you are using general annotations for Homo sapiens, add the name of the annotation database ("org.Hs.eg.db")
# and then the annotations list (annotationsVec) which allows it to map the rows of your ExpressionSet to their EntrezIDs.
# Note: in your sample code you wrote the pathway ID wrong, so I fixed it.
gtBroad(response = id, exprs, id = 'PID_HDAC_CLASSII_PATHWAY', annotation = "org.Hs.eg.db", collection = broadset, probe2entrez = annotationsVec)