Question

An idea to identify batch effects

0

Entering edit mode

9.7 years ago

mjarosz • 0

Hi All,

I would like to ask you to comment on my idea to identify batch effects in a set of Affymetrix arrays coming from different studies:

I am thinking about discovering batches by clustering Affymetrix control probes using k-means or SOM. However, I am afraid that the differences between batches might be so small that clustering output will not correspond to real batches. What do you think about it?

microarray batch-effect • 4.9k views

ADD COMMENT • link updated 7 days ago by Ram 43k • written 9.7 years ago by mjarosz • 0

score 2 · Answer 1 · 2014-09-01

2

Entering edit mode

9.7 years ago

Irsan ★ 7.8k

Perform analysis of variance (anova) of all relevant sample attributes (of course batch included and e.g. diseaseState, driverMutation, sex, tumorSize, histology, ...) on the normalized log expression values. This way you can see what the relative/proportional influence of batch is on the expression estimates. So if you have your expression matrix (in R/Bioconductor: yourExpressionMatrix <- exprs(yourExpressionSet) melt the matrix (with e.g. melt() function from reshape library in R) so that sample and probe and expression estimate become columns and add sample attributes as columns. Then perform ANOVA (with aov() from R base)

ADD COMMENT • link 9.7 years ago by Irsan ★ 7.8k

0

Entering edit mode

Irsan, thank you very much. I will give it a try.

ADD REPLY • link 9.7 years ago by mjarosz • 0

0

Entering edit mode

Hi Irsan,

I would like to ask you two questions about the details of preparing data for ANOVA. I have decided that all samples from the same study scanned on the same day make up a batch.
(1) There are some batches which contain only one or two samples. In my opinion, these batches should be discarded from further analysis. However, this way I am loosing valuable data (49 out of 397 samples). How do you resolve such issues in your analyses?
(2) There are some batches which contain only tumors (or controls), not both. What would you suggest me to do with them?

Best regards,
Marcin.

ADD REPLY • link 9.6 years ago by mjarosz • 0

0

Entering edit mode

Unless I misunderstand your questions, they are not related to preparing the data for ANOVA. But still here are my answers:

(1) In stead of discarding samples, include batch as a covariate covariate in your differential expression workflow: design <- model.matrix(~ Batch + isTumor). This way, the resulting fold changes are differences in gene expression between tumor and normal samples corrected for the batch effect.

(2) Keep all your samples and if you are worried about batch effect use batch as a covariate in your design matrix as described above.

ADD REPLY • link 9.6 years ago by Irsan ★ 7.8k

0

Entering edit mode

Irsan,

In fact, I was thinking about preparing data for discovering batch effects using the aov function as you suggested. What shall I do for (1) and (2) in this case?

ADD REPLY • link 9.6 years ago by mjarosz • 0

Ram · Answer 2 · 2014-09-01

1

Entering edit mode

9.7 years ago

Ann ★ 2.4k

Are you talking about using the mismatch probes because they measure background?

That sounds like a neat idea. Could be worth a try.

However you should take a look at how people are using hierarchical clustering and PCA to discover bias in data. Look at the limma vignette in Bioconductor and also Bioconductor Case Studies (book by Gentleman and friends).

I use R/Bioconductor methods "hclust" and "plotMDS" to find out if samples got switched or if there are batch effects. Then, if there are batch effects, I try to account for them using linear modeling in limma (microarray) or edgeR (RNA-Seq). But there are many methods for doing this -- those just happen to be the ones I am most familiar with.

If you'd like code examples let me know and I will post a link.

Also, a tip: Bioconductor has a method that lets you check the scan date on arrays. It's amazing how often people scan their arrays on completely different dates, sometimes years apart! I bet that scan date is the biggest source of batch effects in microarray experiments. I'd be very interested to read a study of this. If you find one please let me know?

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Ann ★ 2.4k

0

Entering edit mode

Hi ann, I misread your post. I thought you were the same person asking the question. My mistake. I changed my comment

Yes some people use unsupervised hierarchical clustering (UHC) and multidimensional scaling (MDS) and/or principal component analysis (PCA) for the semi-quantitive analysis on how much influence each of the sample attributes has on the expression profile of a sample. However, when you are interested in such a thing you are better off with analysis of variance (ANOVA). This general statistical method gives you what you want in a complete quantative way. As the OP intuitively sensed yourself, when batch effect is only small (but truly is present and affecting your analysis) it is likely overlooked by UHC/MDS/PCA.

BTW, if you are considered that a non-biological factor like batch is contributing to variance in your expression estimates just use it as a covariate in your differential testing formula.

ADD REPLY • link 9.7 years ago by Irsan ★ 7.8k

0

Entering edit mode

Dear Ann,

Thank you for your insight.

I was thinking about Affymetrix control probes (there are, for example, 62 of them on HG133 Plus 2.0), not the mismatch ones - I am not sure if the difference in mismatch probe signal between batches would be detectable.

You are right about scan dates: in one of the studies I am analysing, each array has been scanned on a different day.

ADD REPLY • link 9.7 years ago by mjarosz • 0