Question

What values should I use for principle component analysis of microarray data?

0

Entering edit mode

5.7 years ago

reubenmcgregor88 ▴ 50

I have been doing a bit of reading about Principle Component Analysis (PCA) and have been using R to carry out PCA on some microarray data and have started with background subtracted raw fluorescent values.

I can run PCA on any of the following datasets, each one giving different plots:

1) Unnormalised data, log transformed data 2) Normalised data 3) Normalised, Log transformed data

Am I right in saying that depending on what question I am asking, I should run the PCA on 1) or 3)

If I wanted to see if samples clustered specifically based on different microarray chips for example I would have to do it on 1) otherwise normalisation would mask this effect?

Once I have done this and ascertained that there is not specific clustering on PCA based on microarrays I can run another PCA on Normalised (for example quantile normalised) data to see if my samples cluster by treatment for example?

And it is my understanding that with this type of data it should really always be log transformed to stop the PC's being weighted based on huge differences in fluorescence values?

Also I could not find a consensus on whether to scale the data or not. I presume as all the data is on the same scale here that scaling the data should have little effect on the outcome?

Thank you

microarray PCA R • 1.8k views

ADD COMMENT • link updated 5.7 years ago by Kevin Blighe 87k • written 5.7 years ago by reubenmcgregor88 ▴ 50

score 3 · Accepted Answer · 2018-07-31

You should use the log (base 2) normalised expression values. These will typically be accessible via the exprs() function applied to your ExpressionSet object.

If I wanted to see if samples clustered specifically based on different microarray chips for example I would have to do it on 1) otherwise normalisation would mask this effect?

Not clear what you are are asking. If you want to check the reproducibility of samples across multiple different chips, then process the different chips independently and generate separate PCA bi-plots using the log (base 2) (log2) normalised values. It's not clear why you would ever want to use un-nomalised data for PCA, let alone logged, un-normalised data.

Once I have done this and ascertained that there is not specific clustering on PCA based on microarrays I can run another PCA on Normalised (for example quantile normalised) data to see if my samples cluster by treatment for example?

You should always use log2 normalised data.

And it is my understanding that with this type of data it should really always be log transformed to stop the PC's being weighted based on huge differences in fluorescence values?

Yes. Using un-normalised data would simply not be valid unless, by a very very very very small chance, there was no experimental / technical bias in your samples. The odds of that happening may be greater than winning the National Lottery, though.

Also I could not find a consensus on whether to scale the data or not. I presume as all the data is on the same scale here that scaling the data should have little effect on the outcome?

Yes, there is no consensus here. I recommend scaling the data prior to performing the PCA transformation because it helps to 'even out' the variances between your principal components (PCs). As PCA is fundamentally based on variance, this can be important. By not scaling the data, even if it's log2 normalised data, then you may find that the first PC always substantially dominates the variance in the dataset.

Kevin