Hi all
I'm a bit battled since I'm re-analyzing some published data from a collaborator and I cannot get more than 10-20% of my genes overlapping with their data. I would like to make sure that I'm not screwing it up so I hope that this is the best place to ask
Example:
Analysis of 20 samples, 10 paired tumor-normal
My R code (based on limma) looks like:
fileList <- list.files(".",pattern = "*.CEL$",full.names=T, include.dirs=F, recursive=F)
sample <- ReadAffy(filenames = fileList, verbose=T)
eset <- expresso(sample,
bgcorrect.method="mas",
normalize.method="quantiles",
pmcorrect.method="mas",
summary.method="medianpolish")
as.matrix(colnames(exprs(eset)))
# prepare design
target <- cbind(c(rep(5,2), rep(6,2), rep(7,2), rep(8,2), rep(9,2), rep(10,2), rep(31,2),rep(36,2), rep(37,2), rep(38,2)), rep(c("D","N"),10)))
colnames(target) <- c("ID", "status")
rownames(target) <- colnames(exprs(eset))
target <- as.data.frame(target)
paired_samples <- factor(target$ID)
Treat <- factor(target$status, levels=c("D","N"))
design <- model.matrix(~paired_samples+Treat)
fit <- lmFit(eset, design)
fit <- eBayes(fit)
topTable(fit, coef="TreatN")
but the list of genes that I get not only has 0 with FDR < 5% (as stated in the paper) but have only 11 genes out of >400 that overlap if I use the test p.value instead of the adjusted p.value.
I even tried using a non-paired analysis with this design
design <- cbind(N=1,DvsN=rep(c(1,0),10))
rownames(design) <- colnames(exprs(eset))
The heatmap looks prettier in dividing normal vs tumor but still same problem of overlapping my significant genes vs the published ones.
-- edit --
An example of normalized data I have is the attached image
https://drive.google.com/file/d/0BxzhXZ5eBptDRnd0SXpTNkNHbGs/view?usp=sharing
Any suggestion would be just awesome
What'd they use in the paper to do the analysis? I've seen more than a few papers that were just analyzed in a completely incompetent manner. Also, did you do any sort of QC? That's not the sort of thing mentioned in methods sections of papers and could have led to either excluding samples or tweaking methods.
for DEGs they say that they used BH-equation = FDR correction but they don't say which analysis they actually did to obtain the p.values that they corrected. only mentioned that they used a 5% FDR.
for QC I quickly checked how boxplots were after normalizing (they also used mas medianpolish etc) and it looks pretty enough to me (I'm gonna attach image in the main txt as soon as I reply to this). the paper mentions the "use of scatterplots, correlation and trees" to check quality but it doesn't say whether they excluded any.
Those don't look terrible to me either, though they aren't quite as well quantile normalized as I'd like to see. The thing with methods sections is that they almost always leave out important details. Your best bet is to ply one of the authors with booze and get the actual details that way. While I'm sure there are important implementational differences between limma and the Affymetrix suite/dChip, it's difficult to believe that that alone could lead to such a huge difference (though I'm no expert in this particular area, so I could be off there).
Do you happen to know what version of limma they used? Sometimes versions can make a *big* difference if some underlying parameter is tweaked under the hood between versions (e.g. as has happened with edgeR). Despite specific p-values are the trends preserved between your analysis and theirs? Or are you simply stuck with gene lists from them (and unable to to do more specific comparisons)? Your normalization method is listed as quantiles, is that what they used?
The paper says they used Affymetrix's MicroArray Suite and dChip with medianpolish normalization method. no limma. the ration of up/downregulated genes is similar in both analysis, but the probe names are not.
Jowever, what I'm mainly concerned about tho is: unless I missed something important in the code above, I would like to see a good part of my data to overlap with theirs...but if the code above is correct, and their analysis is correct...then what's the real answer?!