Question

Analysis of a batch-normalized expression dataset

0

Entering edit mode

3.4 years ago

dave ▴ 20

Hello, I am a long-time lurker but new poster to Biostars, so apologies in advance for any improper forum etiquette! I thank you for your patience in advance, I have had some formal bioinformatics training through a research fellowship but am largely self-taught.

I have a question involving the downstream analysis of publicly available data from GEO. I am hopeful to utilize GSE124814 as a large dataset to confirm trends seen in a smaller RNAseq dataset of tumor samples obtained by my group.

GSE124814 is a compiled expression set of 23 medulloblastoma datasets, with 1350 tumor samples and 291 control samples. I am working with a subset of GSE124814 of 233 tumor samples and 291 control samples. The removal of unwanted variation (RUV) method was applied by the authors to account for batch effects, and the entire matrix was quantile normalized.

I am interested in performing enrichment analysis (through GSEA), and weighted gene coexpression network analysis (WGCNA) on these data, but after consulting the literature I am not clear if the normalization methods used lend themselves to the analyses I wish to perform. Obviously, I could avoid the confusion by obtaining raw .CEL files and processing them myself, but I lack the expertise (and the hardware) to confidently reproduce an analysis of the same size.

The resulting expression matrix contains expression values roughly centered at 0, with expression values ranging from [-7.7 : 9.4]. Honestly, I am unsure how to interpret negative expression values like this, it almost seems to be log scaled though I could not find anything on this in the literature. I have been considering an exponential transformation of the data to work from a positive distribution, would this be reasonable? Please find the density plot below to observe the distribution. ExpressionDensity

Am I able to use the values as they are for analysis? I performed a very basic differential expression analysis through a Mann-Whitney-U test for some preliminary filtering as I was unsure if these data are suited for limma/edgeR. I have not yet attempted large-scale GSEA of the data. I have performed WGCNA, and the resulting network did not show robust connectivity unless I lower the power threshold to approximately 4, though the scale free analysis (see below) suggests a power of 10-12 to be more appropriate to construct the adjacency matrix. Further, the WGCNA literature suggests a power of 12 for a signed network using data of this size. ScaleFreeAnalysis

tl;dr: How do I interpret data processed through RUV/quantile normalization, can I use this for WGCNA/GSEA, which differential expression algorithm would be most suited for data of this type?

Thanks in advance.

R Normalization WGCNA GSEA • 1.2k views

ADD COMMENT • link updated 10 months ago by LChart 3.9k • written 3.4 years ago by dave ▴ 20

0

Entering edit mode

Hi Dave! I am a med student working on a similar project. Can you please let me know how did you go about this then? Thanks a lot!

ADD REPLY • link 10 months ago by Awais • 0

0

Entering edit mode

I've moved your post to a comment - don't add answers unless you're answering the top level question.

ADD REPLY • link 10 months ago by Ram 43k

score 0 · Answer 1 · 2023-06-28

It looks like the authors removed the mean value of expression for all of the probes as part of their normalization; and this is why you see negative values. It is perfectly appropriate to run something like limma on these data; and the relative log-fold changes will be generally the same order as other microarray datasets. The "mean expression" for all genes will necessarily be 0; but this is not a major issue except for sanity checks when comparing across datasets.

Batch and covariate correction is typically applied prior to WGCNA; and quantile normalization (to the empirical quantiles -- which is what I assume has been done here) is also occasionally performed. It should not greatly impact the ability to run WGCNA, and only mildly influence results versus no quantile normalization; it may, in fact, help to reduce the impact of outliers.

have performed WGCNA, and the resulting network did not show robust connectivity unless I lower the power threshold to approximately 4, though the scale free analysis (see below) suggests a power of 10-12 to be more appropriate to construct the adjacency matrix

I'm not sure what you mean by "robust connectivity". It should be clear from the dendrogram whether it worked or not; but for such large sample sizes my recommendation is set the connectivity to 12 (https://support.bioconductor.org/p/124844/#124864) and only change it if things look totally weird.

The one potential issue is that the RUV approach necessarily has removed degrees of freedom, so the p-values may not be calibrated. However this should not be the case as n is quite large here (smallest group n > 200).

You should be fine applying either method to the post-RUV and post-QN data.