ProjectPCA scores each gene in the dataset (including genes not included

Question

Principal component analysis: PC2 contribute more to variance than PC1?

3

Entering edit mode

5.2 years ago

chilifan ▴ 120

I am clustering data using Satija's Seurat package. Link: https://satijalab.org/seurat/pbmc3k_tutorial.html Now I'm doing a presentation about principal component analysis (PCA). As I've understood it, the dimensions that contribute with the highest variance to the dataset is used for the first principal component (PC). The second principal component is perpendicular to the first. Like explained in this video:

In Seurat they present a heat map of the genes (dimensions) presented in the first component, a heat map the genes perpendicular to those and so on. The heat map shows the expression of each gene included in the PC (yellow= high, black = normal and purple = low).

What I'm wondering is: if the first principal component is based on the dimensions with most "extreme" values, contributing the most to the shape of the dataset, then how come the gene expression in the heatmap of PC1 is less "extreme" than the genes in the heatmap of PC2? If there is a bigger difference between gene expressions in PC2, then it should per definition be PC1.

The link shows the heatmaps of PC 1-6 from Satija Seurat, PBMC example (sorry, I was not able to embed the picture this time)

PCA principal component analysis seurat • 4.5k views

ADD COMMENT • link updated 5.2 years ago by manuel.belmadani ★ 1.3k • written 5.2 years ago by chilifan ▴ 120

score 1 · Answer 1 · 2019-02-14

ProjectPCA scores each gene in the dataset (including genes not included

in the PCA) based on their correlation with the calculated components.

So what you're seeing is the correlation of the gene to the PC. A principle component is just a vector, not a matrix.

You are right that the PCs are order of which vector explains the most variability in the dataset. In that case, a high correlation with PC2 means that most of the variance is explained by the second principle component. You could imagine that the differences samples/cell types can be captured best by a certain component. If you're interested in learning more, I would recommend doing a tutorial using a standard PCA only package (like prcomp) as this seems to take some shortcuts (i.e. Computing the correlation matrix instead of letting you do that on your own.)