published array data not replicable with limma analysis
1
1
Entering edit mode
9.3 years ago
TriS ★ 4.7k

Hi all

I'm a bit battled since I'm re-analyzing some published data from a collaborator and I cannot get more than 10-20% of my genes overlapping with their data. I would like to make sure that I'm not screwing it up so I hope that this is the best place to ask

Example:

Analysis of 20 samples, 10 paired tumor-normal

My R code (based on limma) looks like:

fileList <- list.files(".",pattern = "*.CEL$",full.names=T, include.dirs=F, recursive=F)
sample <- ReadAffy(filenames = fileList, verbose=T)
eset <- expresso(sample,
                 bgcorrect.method="mas",
                 normalize.method="quantiles",
                 pmcorrect.method="mas",
                 summary.method="medianpolish")
as.matrix(colnames(exprs(eset)))
# prepare design
target <- cbind(c(rep(5,2), rep(6,2), rep(7,2), rep(8,2), rep(9,2), rep(10,2), rep(31,2),rep(36,2), rep(37,2), rep(38,2)), rep(c("D","N"),10)))
colnames(target) <- c("ID", "status")
rownames(target) <- colnames(exprs(eset))
target <- as.data.frame(target)
paired_samples <- factor(target$ID)
Treat <- factor(target$status, levels=c("D","N"))

design <- model.matrix(~paired_samples+Treat)
fit <- lmFit(eset, design)
fit <- eBayes(fit)
topTable(fit, coef="TreatN")

but the list of genes that I get not only has 0 with FDR < 5% (as stated in the paper) but have only 11 genes out of >400 that overlap if I use the test p.value instead of the adjusted p.value.

I even tried using a non-paired analysis with this design

design <- cbind(N=1,DvsN=rep(c(1,0),10))
rownames(design) <- colnames(exprs(eset))

The heatmap looks prettier in dividing normal vs tumor but still same problem of overlapping my significant genes vs the published ones.

-- edit --

An example of normalized data I have is the attached image

https://drive.google.com/file/d/0BxzhXZ5eBptDRnd0SXpTNkNHbGs/view?usp=sharing

Any suggestion would be just awesome

paired-analysis Microarray R limma • 3.8k views
ADD COMMENT
0
Entering edit mode

What'd they use in the paper to do the analysis? I've seen more than a few papers that were just analyzed in a completely incompetent manner. Also, did you do any sort of QC? That's not the sort of thing mentioned in methods sections of papers and could have led to either excluding samples or tweaking methods.

ADD REPLY
0
Entering edit mode

for DEGs they say that they used BH-equation = FDR correction but they don't say which analysis they actually did to obtain the p.values that they corrected. only mentioned that they used a 5% FDR.

for QC I quickly checked how boxplots were after normalizing (they also used mas medianpolish etc) and it looks pretty enough to me (I'm gonna attach image in the main txt as soon as I reply to this). the paper mentions the "use of scatterplots, correlation and trees" to check quality but it doesn't say whether they excluded any.

ADD REPLY
0
Entering edit mode

Those don't look terrible to me either, though they aren't quite as well quantile normalized as I'd like to see. The thing with methods sections is that they almost always leave out important details. Your best bet is to ply one of the authors with booze and get the actual details that way. While I'm sure there are important implementational differences between limma and the Affymetrix suite/dChip, it's difficult to believe that that alone could lead to such a huge difference (though I'm no expert in this particular area, so I could be off there).

ADD REPLY
0
Entering edit mode

Do you happen to know what version of limma they used? Sometimes versions can make a *big* difference if some underlying parameter is tweaked under the hood between versions (e.g. as has happened with edgeR). Despite specific p-values are the trends preserved between your analysis and theirs? Or are you simply stuck with gene lists from them (and unable to to do more specific comparisons)? Your normalization method is listed as quantiles, is that what they used?

ADD REPLY
0
Entering edit mode

The paper says they used Affymetrix's MicroArray Suite and dChip with medianpolish normalization method. no limma. the ration of up/downregulated genes is similar in both analysis, but the probe names are not.

Jowever, what I'm mainly concerned about tho is: unless I missed something important in the code above, I would like to see a good part of my data to overlap with theirs...but if the code above is correct, and their analysis is correct...then what's the real answer?!

ADD REPLY
3
Entering edit mode
9.2 years ago
Gordon Smyth ★ 7.0k

You have used very different pre-processing and statistical methods from those used in the published paper, so I am not surprised the results aren't the same.

The trouble with the mas background correction method is that it generates a lot of noise for low intensity probes. To see what I mean, try typing plotSA(fit). I think you will find there is a strong decreasing trend in the residual variances. You could improve your analysis with one or more of the following:

1. Replace eBayes(fit) with eBayes(fit, trend=TRUE)

2. Filter out about 25% or so of the lowest intensity probe-sets, for example:

keep <- fit$Amean > quantile(fit$Amean, prob=0.25)
fit2 <- eBayes(fit[keep,], trend=TRUE)

3. Switch to RMA or gcRMA background correction and normalization.

ADD COMMENT
0
Entering edit mode

Gordon, thanks for the insights, those are very good points. I did think about switching to RMA but I didn't think about filtering out the probes with the lowest quartile in intensity! that'll definitely help with the noise

ADD REPLY

Login before adding your answer.

Traffic: 2747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6