Biostar Beta. Not for public use.
Collapsing probesets to genes on WGCNA
1
Entering edit mode
24 months ago
@Davide Chicco3592

Hi all

I need to do a weighted correlation network analysis on a dataset made of gene expression data of 169 samples (patients) that I downloaded from GEO, and I am trying to use the WGCNA package on R.

A colleague told me I have to collapse the probesets to genes. I have tried to do it through the collapseRows() function, but I'm experiencing some errors.

Here's my function call:

result <- collapseRows(datET=combat_edata,
rowGroup=unique(combat_edata_genes),
rowID=rownames(combat_edata),
method="function",
methodFunction=colMeans)


The combat_edata dataframe is my output from the ComBat() function that I used for batch correction, while the combat_edata_genes list is the result of the retrieval of the genes associated to "combat_edata in the hsapiens_gene_ensembl mart, through the getBM() function. As you can see, I used combat_edata as my main input data, and I set rowGroup as the list of unique genes. I set rowID as the row names of the dataframe (features of the GEO dataset).

My call to collapseRows() returns an error:

Error: rowGroup and rowID not the same length"

Which is true. Here are the lengths of the variables:

length(rownames(combat_edata)):       33,297
length(unique(combat_edata_genes)):   27,217
length(combat_edata_genes):           47,127


The number of rows of my dataframe (the features of the GEO dataset) and the number of genes are different. I don't know how to handle the situation... Do you have any idea on how to handle this issue?

How can I solve perform the collapsing of the probesets to genes correctly?

EDIT: I don't necessarily have to use collapseRows(); if you know another method and you can explain me how to use it easily, you're welcome to propose it. Thanks

gene expression R WGCNA collapseRows() • 452 views
1
Entering edit mode
24 months ago
aln • 290
@aln9248

rowGroup and rowID should be of the same length as the error reports. Namely, for each rowID entry you need to have corresponding group ID, it means that groups can be repeated in rowGroup. For microarrays you have cases when one probeset can be mapped to different genes, and when one gene is described by multiple probesets, that's why you need to "collapse" your matrix. To illustrate, your input vector might look like this:

rowGroup    rowID
geneA         probeset1
geneA         probeset2
geneB         probeset3
....


I would also suggest deleting probesets that map to different genes simultaneously.

0
Entering edit mode

Thanks, but how do I perform that operation in R?

0
Entering edit mode

Which operation exactly? If you need to annotate your probesets in order to get list of corresponding genes. You can do it like this:

library(hgu133plus2.db)
select(hgu133plus2.db, probesetsID, "ENTREZID")


It will give you 1 to multiple relations list. Then you just check, which probesets map to different genes at the same time using basic R operations with data.frame.

Or if that's not what you ask, please, elaborate.

P.S. What array you use? Because if Affy, then instead of WGCNA you must use different probesets annotations from BrainArray project (custom cheap definition files).

0
Entering edit mode

Hi Aln

Thanks for your help.

I tried to use your code but it gives me an error:

probesetsID <- c("7892501", "7892502", "7892503", "7892504", "7892505", "7892506") # first 6 probeset IDs

select(hgu133plus2.db, probesetsID, "ENTREZID")

Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.


How can I solve this error?

Also, I don't know about the BrainArray project. If you have suggestions on that, please share them. Thanks!

0
Entering edit mode

My example was for the particular Affymetrix array, your probesets ids do not look like typical affymetrix ids. What array do you use exactly?

0
Entering edit mode

Thanks for having replied. It's Affymetrix, as I read it on the GEO repository for the dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59867

1
Entering edit mode

You could try using the annotation file from Gemma for this platform: https://gemma.msl.ubc.ca/arrays/showArrayDesign.html?id=380 , click the link that says "Basic"

First column is ProbeName, second one is a gene symbol. If you use it in your work, don't forget to cite Gemma (Zoubarev, A., et al., Gemma: A resource for the re-use, sharing and meta-analysis of expression profiling data. Bioinformatics, 2012.)

You could also download the annotation for the platform and parse it out yourself. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6244 (The "Download full table..." button.)

1
Entering edit mode

Davide, I am having some déjà vu... I originally showed you how to convert Affy probe IDs to gene symbols, here: C: How to map Affymetrix probe ID's from a GEO dataset to corresponding gene symbol (follow the other link).

To then summarise expression over genes, use the aggregate() function in R.

Note that, with microarray data and starting with the raw data files, one can summarise expression over genes or exons via the target parameter that is passed to rma(). See here: C: Human Exon array probeset to gene-level expression

1
Entering edit mode

Affymetrix has different types of arrays, where each array has its own annotation. hgu133plus2.db is an annotation for a particular array, that's why it doesn't work with yours. Your array type is Affymetrix Human Gene 1.0 ST. And judging from your answers you didn't have experience with analyzing arrays in R. I would simply recommend reading this tutorial:

https://bioconductor.org/packages/release/bioc/vignettes/oligo/inst/doc/oug.pdf

Oligo package should work with your array type. And I would certainly recommend starting with raw data.

http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp

1
Entering edit mode

Just want to add one more comment. There are many ways how to analyze array, with R packages or web-services that were mentioned in this post already. And each analysis step (probes summarization, normalization, annotation etc) can be also done with different tools. You just need to try few of them and choose what you like the most. And the best way to help is if you post actual source code, so we can point you to the right direction, as now your questions are very broad.

The important thing one needs to understand about arrays is that their probe/probeset design is not perfect, and one needs to do additional steps to get one-to-one probe-gene correspondence. One way is to do some sort of "collapsing" or use re-annotated chip definition files (BrainArray etc). Or one can re-annotate probe sequences themself.

0
Entering edit mode

Hi Aln

Thanks for your reply and your help. As you might have understood, I'm a newbie with Bioconductor... It's difficult to me to get the right commands for my needs. My source code is all here on GitHub: https://github.com/davidechicco/heart-failure-gene-expression-analysis

Running it should be quite easy: you just need to change FIRST_RUN = FALSE to FIRST_RUN = TRUE on the first run in the retrieve_genes.r script, and then execute it.

Rscript retrieve_genes.r

Feel free to play with it! My goal now is to collapse probesets to genes on the combat_edata variable.

Thanks

EDIT: This is the head of the annotationLookup of the combat_edata variable, in case it helps understanding:

 thisAnnotLookup <- getBM(mart=mart, attributes=c("affy_hugene_1_0_st_v1", "ensembl_gene_id", "gene_biotype", "external_gene_name"), filter="affy_hugene_1_0_st_v1", values=rownames(combat_edata), uniqueRows=TRUE)

affy_hugene_1_0_st_v1 ensembl_gene_id   gene_biotype external_gene_name

1               7892640 ENSG00000199574         snoRNA           SNORD18C

2               7892536 ENSG00000200623         snoRNA           SNORD18A

3               7892648 ENSG00000230624 protein_coding             DDX39B

4               7892831 ENSG00000202093         snoRNA           SNORD58C

5               7892536 ENSG00000174444 protein_coding               RPL4

6               7892640 ENSG00000174444 protein_coding               RPL4