Question

Which tool to use to run a PCA on a microarray dataset to select the most relevant features (preferably in Python)

1

Entering edit mode

10.0 years ago

fbrundu ▴ 350

Hi all,

I need to run a PCA analysis on a microarray dataset to estimate which are the most relevant features. How can I do it? Is it possible?

I tried with scikit-learn but I was unable to come with the relevant genes. I did it like this:

from sklearn.decomposition import PCA
import numpy as np

# X is the matrix transposed (n samples on the rows, m features on the columns) - it is represented as a numpy array
X = np.array([[-1, -2, 5, 1], [-3, -1, 1, 0], [-3, -2, 0, 2], [1, 1, 1, 3], [2, 1, 1, 4], [3, 2, 0, 5]])

# suppose we want the 2 most relevant features
nf = 2
pca = PCA(n_components=nf)
pca.fit(X)

X_proj = pca.transform(X)

With X

array([[-1, -2,  5,  1],
       [-3, -1,  1,  0],
       [-3, -2,  0,  2],
       [ 1,  1,  1,  3],
       [ 2,  1,  1,  4],
       [ 3,  2,  0,  5]])

it returns X_proj

array([[-2.9999967 ,  3.26498171],
       [-3.53939268, -1.18864266],
       [-2.77013188, -2.15637734],
       [ 1.67612209,  0.03059917],
       [ 2.87464655,  0.35674472],
       [ 4.75875261, -0.30730559]])

How can I say which are the selected features? Is there another way to do it (also in R for example)?

Thanks

pca python microarray r features • 5.2k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by fbrundu ▴ 350

4

Entering edit mode

10.0 years ago

brentp 24k

You can do something like that with this tool:

https://github.com/brentp/clinical-components

You are asking what are the selected features, but it doesn't work like that. You get the components with the most variance. You can then take those and correlate them with your clinical or lab data, which is what the above software does.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by brentp 24k

0

Entering edit mode

Thanks for your answer, very useful. I will try it.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by fbrundu ▴ 350

0

Entering edit mode

I read your code but unfortunately I have to integrate a smaller module in my code and I have to write a custom routine, sorry. When you said, you get the components with the most variance, can you consider the components as pseudo-samples? Thanks anyway for your effort.

ADD REPLY • link 10.0 years ago by fbrundu ▴ 350

Ram · Accepted Answer · 2014-04-25

5

Entering edit mode

10.0 years ago

Giovanni M Dall'Olio 28k

Hi,

be careful because I think you are making some confusion about the concept of "components" in PCA.

The PCA doesn't return you the most relevant "features" in the dataset; it rather returns two vectors (or more, depending on how many components you calculate), corresponding to the coordinates of each element on a plane that divides the datasets into two parts. For a quick explanation of how PCA works, you can read this document or this article on Nature Biotechnology.

If you want to determine how much each of your original variables contribute to the components, you can either look at the loadings of the pca (I can't install scikit right now, but it may be inside the pca object), or you can calculate the correlation of each variable with each of the components, like described here. This should give you an idea of the "relevant" features, the features that contribute the most to divide the data into two.

Another tool to do PCA with python is orange, which gives you both a graphical interface (if you want to use it), and a python library.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hi Giovanni,

your and Brent answer are very useful. I did not understand well PCA before reading the example of psu university.

Thanks

ADD REPLY • link 10.0 years ago by fbrundu ▴ 350

0

Entering edit mode

Just to add, what is the preferred way to correlate a feature (that is a gene) with a principal component? May I use Pearson or cosine similarity? Or are there constraints for this data type? Thanks

ADD REPLY • link 10.0 years ago by fbrundu ▴ 350