Question

PCA for deregulated genes - technical question

0

Entering edit mode

7.0 years ago

arronar ▴ 280

Hello.

I'm working on some microarray data, and I have lists with deregulated genes for each treatment. Now i want to run a PCA analysis on these genes to see if they are grouped in respect of treatments.

The thing is that the lists with the expression values that I have are not of the same length. For example (values are random) :

deregulated_treat1 = c( 2.334, 1.3244, 2.2223, 4,5554, 7.5444, 3.3455 ) 
deregulated_treat2 = c( 1.674, 1.3534, 3.2253, 4,5554, 7.5444, 8.1115, ) 
deregulated_treat1 = c( 2.334, 1.3244, 2.2223, 4,5554 )

The first step before running the PCA analysis is to create a matrix that will contain all treatments and their values. But some genes are not deregulated in all treatments and I don't know what value should i use at its place in the matrix. Should i use the 0 value or NA's ?

The matrix that I'm thinking to build is like the following

|Gene Name|Treat1|Treat2|Ttreat3|
|Gene1    |2.445 |1.533 |7.344  |
|Gene2    |2.445 |0/NA  |1.424  |
|Gene3    |2.445 |2.463 |0/NA   |
|Gene4    |2.445 |5.533 |0/NA   |

Any ideas ?

R PCA microarray • 1.7k views

ADD COMMENT • link updated 7.0 years ago by Giovanni M Dall'Olio 28k • written 7.0 years ago by arronar ▴ 280

0

Entering edit mode

I would like to ask when you say expression values, how do you retrieve these expression values for your genes across each treatment. If its microarray then all should have certain intensity values or normalized expression values, so NA should not be there. 0 expression is still a possibility if you do not have any expression for those genes. PCA is a way of reducing the dimension with a new characteristic on linear combination. So probably your PCA is not on the correct values. When you say de-regulated does it refer to differentially expressed genes from microarray, in that case you should be able to obtain normalized expression of those genes across your sample and simply plot them. Otherwise , I see this as a problem , others might give more inputs but to me it does not make sense.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

The initial microarray data was expression values and normalized them with quantile method. Then i run on them a dunnet's test to get the most significant deregulated (differentially expressed) genes. Let's say that for each treatment i got the following DE genes

deregulated_treat1 = c( 1, 2, 3, 4, 5, 7 ) 
deregulated_treat2 = c( 1, 2, 4, 5, 6, 8 ) 
deregulated_treat1 = c( 2, 4, 6 )

The above values are the indexes of the genes (rows) from the initial dataset. The next step is to match these indexes with the initial normalized expression values

deregulated_treat1 = c( 2.334, 1.324, 2.223, 4,555, 7.544, 3.345 )  
deregulated_treat2 = c( 1.674, 1.353, 3.225, 4,554, 7.547, 8.111 )  
deregulated_treat1 = c( 2.334, 1.324, 2.222 )

and finally construct the corresponding matrix

|Gene Name|Treat1|Treat2|Ttreat3|
|Gene1    |2.334 |1.674 |0/NA   |
|Gene2    |1.324 |1.353 |2.334  |
|Gene3    |2.223 |0/NA  |0/NA   |
|Gene4    |4.555 |3.225 |1.324  |
|Gene5    |7.544 |4,554 |0/NA   |
|Gene6    |0/NA  |7.547 |2.222  |
|Gene7    |3.345 |0/NA  |0/NA   |
|Gene7    |0/NA  |8.111 |0/NA   |

And I want to use these data for PCA.

ADD REPLY • link 7.0 years ago by arronar ▴ 280

score 0 · Answer 1 · 2017-05-05

0

Entering edit mode

7.0 years ago

Santosh Anand 5.7k

Missing data should be NA. You are looking for PCA with missing data. Check https://bioconductor.org/packages/release/bioc/html/pcaMethods.html

ADD COMMENT • link 7.0 years ago by Santosh Anand 5.7k

0

Entering edit mode

Thanks for your reply.

I was reading this conversation on these methods and I have to mention that the total number of NA's in my matrix is 6004 in total values of 8280. This is a 72.5% NA's. Also, methods at pcaMethods package are tolerant for a missing values percentage around > 10-15%. Isn't my percentage too large ? I cannot decide which method to use with such data.

ADD REPLY • link 7.0 years ago by arronar ▴ 280

0

Entering edit mode

I am not sure: 1) why you have so many NAs. You need to re-evaluate the design of your experiment and 2) With such a sparse data, you may not get any meaningful results with any algorithm whatsoever, even if you are able to successfully run that.

ADD REPLY • link 7.0 years ago by Santosh Anand 5.7k

score 0 · Answer 2 · 2017-05-05

0

Entering edit mode

7.0 years ago

Giovanni M Dall'Olio 28k

You can either set them as NA, or impute them.

> library(caret)  
> d = read.table("yourexample.txt")
> d
# A tibble: 5 × 4
  GeneName Treat1 Treat2 Treat3
     <chr>  <dbl>  <dbl>  <dbl>
1    Gene1  2.445  1.533  7.344
2    Gene2  2.445  2.400  1.424
3    Gene3  3.445  2.463     NA
4    Gene4  3.445  2.533  6.000
5    Gene5  4.231  3.123  1.230


> predict(preProcess(d[2:4], "bagImpute"), d )
# A tibble: 5 × 4
  GeneName Treat1 Treat2  Treat3
     <chr>  <dbl>  <dbl>   <dbl>
1    Gene1  2.445  1.533 7.34400
2    Gene2  2.445  2.400 1.42400
3    Gene3  3.445  2.463 3.44595
4    Gene4  3.445  2.533 6.00000
5    Gene5  4.231  3.123 1.23000

In this case the algorithm implement in "bagImpute" imputed a value of 3.44595.

Since you are there, you can also center and scale the values in the same statement:

> predict(preProcess(d[2:4], c("center", "scale", "bagImpute")), d )
# A tibble: 5 × 4
  GeneName     Treat1      Treat2      Treat3
     <chr>      <dbl>       <dbl>       <dbl>
1    Gene1 -0.9936022 -1.54171116  1.06671155
2    Gene2 -0.9936022 -0.01827421 -0.82144284
3    Gene3  0.3186036  0.09242536  0.07142773
4    Gene4  0.3186036  0.21542488  0.63804947
5    Gene5  1.3499973  1.25213514 -0.88331817

More info on caret: https://topepo.github.io/caret/pre-processing.html

ADD COMMENT • link 7.0 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

does not the impute here works on a manipulation for the data? Is it applicable for this query that the OP made?

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Hi Giovanni, OP has an extremely sparse data (>70% NAs) and I am not sure if impute can work robustly (or even work) in this case. Comments?

ADD REPLY • link 7.0 years ago by Santosh Anand 5.7k

0

Entering edit mode

I agree with your comments - it seems something is wrong with the experiment, there it should not be so many NAs. The best thing to do would be to revise the design.

ADD REPLY • link 6.9 years ago by Giovanni M Dall'Olio 28k