Biostar Beta. Not for public use.
Differential gene expression analysis of real-valued expression data of genes and individuals
Entering edit mode
14 months ago


I've a dataset of gene expressions of 102 patients and 9 healthy controls. I downloaded this dataset from GEO, I applied several preprocessing steps (normalization, batch correction based on date, etc), and I was finally able to generate a table containing:

  • the individuals on the rows

  • the genes on the columns

  • each entry ij containing a real value that indicates the expression of the gene_i in the individual_j

This first preprocessing phase was a lot of effort. Now I would like to perform a differential gene expression analysis, to see how the genes expressions differ between the patients and the healthy controls.

I checked some packages online (such as DESeq2), and I noticed they all have specific requirements for input files, that need to contain raw counts. Unfortunately, I don't have raw counts.

I would like to perform a differential gene expression analysis by myself, by taking advantage of biostatistics R functions applied on my preprocessed tables.

How can I do it? Any suggestion?


Entering edit mode
15 months ago
h.mon 25k

Look at Linear Models for Microarray Data, or limma. The User Guide is particularly helpful:

Entering edit mode

Adding on this, what you have is array data which provides you with intensity values, not counts so a relative measure of gene expression rather than absolute counts as in RNA-seq. limma seems to be pretty much the standard and following their workflow should get you the intended results. Be sure to read the manual thoroughly and also look at this end-to-end workflow for Affymetrix microarrays.

Entering edit mode

Thank you guys for your replies. There's a lot of material online and I feel like I am drowning in it. I found this interesting question and answer here on, that I tried to implement for my case. I used lmFit(table) and eBayes(fit), as explained, without design.

I was able to generate a table with the values of the fitted model for the patients, and a table for the healthy controls. This is the head of the topTable of the patients fit:

head(topTable(fit, n=Inf, sort="p", p.value=0.05))

logFC AveExpr t P.Value adj.P.Val B

EEF1A1P5 13.2 13.2 362 5.03e-26 2.05e-22 45.2

MIR6891 13.2 13.2 358 5.77e-26 2.05e-22 45.2

HLA.G 12.8 12.8 356 6.34e-26 2.05e-22 45.1

RN7SK 13.6 13.6 355 6.41e-26 2.05e-22 45.1

MALAT1 12.8 12.8 354 6.65e-26 2.05e-22 45.1

HLA.J 12.9 12.9 352 7.31e-26 2.05e-22 45.1

Some questions:

1) What is the meaning of these p-values associated to each gene that I found this way?

2) Was it a good/useful idea to split the patients and healthy controls into two different tables and perform the analysis separately? Or should I keep them together and insert this information into the design parameter? If the latter, how?



Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3