Combining gene expression data into a principal components analysis - how to deal with missing data?
0
0
Entering edit mode
7.1 years ago
Tom ▴ 40

Hi all,

I have gene expression for say 9 different samples, all the same age, for say 3 different tissues (see table below, I have three samples from hippocampus, three from cerebellum and three from blood). I want to put these into a PCA plot to see if the gene expression for similar tissues clumps together in the plot.

The problem is that I have overlapping, but not completely overlapping, gene sets between the samples.

So essentially I have a table like this (Hipp= hippocampus, Cer = Cerebellum, Bloo = Blood);

          Hipp1     Hipp2    Hipp3   Cer1   Cer2   Cer3   Bloo1   Bloo2   Bloo3
Gene1      0.01     0.01     0.02    0.01   0.03   0.05   1.0     1.2     1.3
Gene2      -        -        -       0.1    0.2    0.3    1.2     1.4     1.5 
Gene3      0.54     0.4      0.4     0.4    0.4    0.5    0.5     1.2     3.4
Gene4      1.2      1.3      4.4     1.3    3.4    3.1    -       -       -        
Gene5      -        -        -       -      -      -      0.3    0.3      0.2

So for example here, since hippocampus and cerebellum are both parts of the brain, and the blood is different, and all the samples are the same age, I want to know if the brain samples cluster together on a PCA plot more than say hippocampus and blood. But here, I don't have expression data for gene 2 for the hippocampus, and I don't have data for gene 4 for blood, or gene 5 I have no data for the brain.

So I'm wondering if there is a package where I can read in the above table, and it will give me back a PCA plot of the genes, considering that there is some missing data. I do not have the option to remove genes with some missing data, because in reality, I have >10,000 samples and 20,000 genes, so there is at least one missing value for all the genes (obviously, I will consider filtering the genes with almost exclusively missing data).

I'm sorry that I have no code to show, I don't know where to start/what package to use so this is just a thought experiment at the minut, will post the code if I can get something working.

pca missing data gene expression r • 2.1k views
ADD COMMENT
0
Entering edit mode

Since numbers are read counts, why not go with missing == 0 counts?

ADD REPLY
0
Entering edit mode

The numbers are log2 transformed microarray data (actually in this example I just want to show people what I mean so the numbers are made up), but I can't set them to 0, since maybe there is a high expression for that gene, I just don't have the info for it?

ADD REPLY
0
Entering edit mode

So I suppose I'm wondering if there's a PCA package in either python or R, where I can read in the above table, and then there's a parameter that says something like "set missing data to NA"", and then before inputting my table into the package, I would just change all the "-" to NA.

ADD REPLY
0
Entering edit mode

I appreciate the help, but maybe the answer is not as trivial as I thought, i've just found this, so maybe using PCA with explicitly defined missing data is not possible, and I have to somehow impute the data.

ADD REPLY
1
Entering edit mode

Missing data is not an issue specific to PCA. However, the Bioconductor package pcaMethods implements various algorithms to deal with missing data.

ADD REPLY
0
Entering edit mode

since maybe there is a high expression for that gene, I just don't have the info for it?

Wait, how did you get the data then?

ADD REPLY
0
Entering edit mode

because it's a combination of a couple of three different data sets (i.e. two brain parts and blood) that are comparable. Thank you for the help, I appreciate it.

ADD REPLY
0
Entering edit mode

But if the RNA-seq (I suppose) was performed in all three tissues then a highly expressed gene would be picked up, no?

ADD REPLY

Login before adding your answer.

Traffic: 1464 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6