Question

How to perform PCA on gene expression?

1

Entering edit mode

5.5 years ago

Thomas B. ▴ 30

Hi,

I am interested in the identification of marker genes of infection by a parasite. I have several biological samples corresponding to different infection stages. I quantified the expression of genes that are though to be differentially expressed by qPCR. Now I would like to run a principal component analysis (PCA) to (1) cluster the samples based on gene expression and (2) identify which of the genes contribute the most to clustering.

From the qPCR experiment, I got data under three forms: Cq values, relative quantities and normalized expression against reference genes.

I read tutorials but most of them are focused on RNAseq data analysis, not qPCR. Some of my questions remain unanswered and I really hope to get some more explanations here.

Which data is it better to use for the PCA? Intuitively I would use normalized expression because the variance between samples has been taken into account. But I can see, for example in the HTqPCR package, that Cq values are rather plotted.
Should I apply a log transformation such as log(N+1)? My data have a logarithmic distribution. I read that normality is not an assumption of PCA; but the closer the data are to a normal distribution, the better the PCA performs.
Should I scale the data? The prcomp function in R (for example) offers this possibility. But I guess this is related to the type of data that is used. My feeling is that this is useless providing I use normalised gene expression, for the reason I provided in my first question.

I tried several combinations and they did not yield in the same output. I would like to find the proper, objective way - I don't want to just pick the one that suits best to my expectations.

PCA qPCR Clustering Marker gene Gene expression • 8.5k views

ADD COMMENT • link updated 5.5 years ago by ahmad mousavi ▴ 800 • written 5.5 years ago by Thomas B. ▴ 30

0

Entering edit mode

Hi you could try following on your raw reads count file:

library(ggplot2)
# gr is group name of your samples based on columns in counts matrix.
pc<-prcomp(log2(counts+1))
#pcr<- data.frame(pc$r[,1:2])
pcr <- data.frame(pc$r[,1:2], Group=gr)
barplot(pc$rotation)
ggplot(pcr, aes(PC1, PC2, color=Group))+geom_point(size=7)+theme_bw()

ADD REPLY • link 5.5 years ago by ahmad mousavi ▴ 800

1

Entering edit mode

I would not use raw read counts for PCA

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

score 6 · Answer 1 · 2018-11-02

My preference for input to PCA is definitely normalised data that follows a Gaussian / binomial distribution. In RNA-seq, normalised data follows a negative binomial distribution, so, it 'requires' a transformation to logged (e.g. logCPM (EdgeR) or regularised log (DESeq2)) data prior to running PCA. The additional scaling step can still be used, as it helps to 'iron out' any extra creases in the data.

As you have qPCR data, you should check its distribution and then make a decision. Logging it is certainly feasible. If you log it and additionally convert it to Z-scores outside of PCA, then you should switch off the further scaling step in prcomp().

For checking the the relationship of genes to each PC, you need to look at the rotation object that is returned by prcomp(), e.g.,

pca <- prcomp(t(x))
pca$rotation

Kevin