Question

PCA Analysis Direction

1

Entering edit mode

7.7 years ago

Steve ▴ 10

Hello All,

I am new to PCA analysis and I was hoping if someone could help steer me in the right direction. I performed a microarray that details the global programs of gene expression for a few select cell lines. Now I want to see where these cell lines lie on the genetic spectrum with Nk and Nkt cells with a few others. However, I have been exposed only to bench work and would now like to move towards bioinformatics. Does anyone have a suggestion on how to start this PCA analysis or resources for it.

R gene Microarray Python PCA • 4.0k views

ADD COMMENT • link updated 5.7 years ago by Biostar 20 • written 7.7 years ago by Steve ▴ 10

0

Entering edit mode

Hi Steve, could you please clarify what you mean by PCA? I guess you refer to the principal component analysis but wanted to make sure before writing an answer. (Is it possible to type LaTeX into answers?)

ADD REPLY • link 7.7 years ago by numeromiljoona ▴ 10

0

Entering edit mode

Yes, I am referring to principal component analysis. My apologies for the confusion.

ADD REPLY • link 7.7 years ago by Steve ▴ 10

score 3 · Answer 1 · 2016-08-08

This is a quite useful tutorial on using R for PCA: https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/ If you want to stick to R, I definitely recommend ggplot2 and this will put you on the right track: http://rpubs.com/sinhrks/plot_pca

But the underlying concept is harder to understand, and there are excellent explanations on youtube, especially

Oh and don't write or say "PCA analysis". That's Principal Components Analysis Analysis. As annoying as PCR reaction, QTL locus and GWAS study ;)

score 0 · Answer 2 · 2016-08-08

0

Entering edit mode

7.7 years ago

unksci ▴ 180

One very nice small introduction with a practical example can be found here:

http://www.mathworks.com/help/bioinfo/ug/example-analyzing-gene-expression-profiles.html

(While not necessary to understand the concepts, you could also run it, if you had access to MATLAB)

ADD COMMENT • link 7.7 years ago by unksci ▴ 180

0

Entering edit mode

Is there anything similar to this using r or python?

ADD REPLY • link 7.7 years ago by Steve ▴ 10

score 0 · Answer 3 · 2016-08-09

Principal component analysis (PCA) is a method for finding "directions" from multivariate data that have maximal variance. The first principal component (PC) explains most of the variance in the analyzed dataset, the second PC explains second most variance, and so on. If you had just two variables, and made a scatter plot of them, the first PC would point to the same direction as the major axis of an ellipse fitted around the data. The second PC would point to the same direction as the minor axis of the ellipse. In principle, you can think of fitting an N-dimensional "ellipse" and looking for the directions of its axes.

Anyway, it is a long way to reach the intuitive explanation above. To really understand what PCA does, one has to first understand what eigenvalues and eigenvectors of a matrix are. If you have not studied linear algebra for a while, I would suggest watching the excellent lectures by the MIT professor Gilbert Strang from MIT OpenCourseWare (http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/).

The eigenvalues of some matrix A are the roots of its characteristic polynomial. Everyone knows how to find the roots of a quadratic equation, so studying 2 x 2 matrices is quite easy. I have written a brief example below.

enter image description here

Once the eigenvalues are known, the corresponding eigenvectors can be found using the first equation.

So, what is the connection between eigenvalues and eigenvectors of a matrix with PCA? Since PCA was about finding directions of maximal variance, we should be probably analyzing some special matrix. That special matrix is the covariance matrix of your original dataset. By finding its eigenvalues and eigenvectors, you find the principal components.

Finally, I should probably add a few words about why PCA is an interesting technique. Suppose you have a multivariate dataset consisting of hundreds or thousands of variables. Is all of that information relevant? Probably not. So, one could find the first k principal components (of N), which explain almost all of the variance in the dataset (e.g. 95% or 99%). The rest principal components N-k could be discarded as noise. This procedure (known as dimension reduction) works with some data but not all. Sometimes the little-varying components could be interesting. Anyway, you get a new view of the data you have at hand.