How to get a half correlation matrix?
1
0
Entering edit mode
5.1 years ago
pablo ▴ 300

Hi guys,

I would like to know if it is possible to compute the half of a correlation matrix? I mean, I don't want to compute the whole matrix, and then extract just the upper or the lower half - I want to get directly the half matrix?

I use R. I have already a code but it gives the whole matrix :

data<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", sep="\t", h=T, row.names=1)+1

res <- foreach(i = seq_len(ncol(data)),
 .combine = rbind,
 .multicombine = TRUE,
 .inorder = FALSE,
 .packages = c('data.table', 'doParallel')) %dopar% {
 apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
}

Thanks

correlation matrix R • 3.8k views
ADD COMMENT
1
Entering edit mode

For the record, this code is mine, from my previous answer:

ADD REPLY
0
Entering edit mode

I read some documentation about the functions upper.tri and lower.tri but I don't know if it could suitable in this code?

ADD REPLY
0
Entering edit mode

Yes, I know what you are aiming to do. A data-frame/matrix is 'rectangular' in structure, though. If you just fill the upper part, you still have to have the bottom part, even if the cells are empty.

ADD REPLY
0
Entering edit mode

My goal here is to gain time. So, if I fill the upper or the lower part with 0 , and is I compute the correlations on the other part, do I will gain time?

ADD REPLY
0
Entering edit mode

You may just have to be patient. You could add a line such as this to your foreach loop:

if ((i %% 100) == 0) {
  print(i)
}

Then it will print the value of i after 100 values are processed. You may see, for example, 300 coming before 200 based on how the parallel processing works if one core finishes before the other.

ADD REPLY
0
Entering edit mode

Correlations for what?

ADD REPLY
0
Entering edit mode

It is correlations between OTUs

ADD REPLY
0
Entering edit mode

You need to provide more information. Which language (R, python, etc...).

ADD REPLY
0
Entering edit mode

put in the original question

ADD REPLY
0
Entering edit mode

The answer you get is at most as good as the question you ask:

awk 'BEGIN{FS="\t";print 1.0"\n"0.9,1.0"\n"0.5,0.8,1.0}'
1
0.9 1
0.5 0.8 1
ADD REPLY
0
Entering edit mode

Sorry for the lack of information. I answered above.

ADD REPLY
0
Entering edit mode

Could you show us how your data is formated ? thanks

ADD REPLY
1
Entering edit mode
5.1 years ago

Simple double for loop. The trick is to inject in the second for loop the index of the first loop as starting point :

 res <- matrix(NA,ncol(data),ncol(data))
  for(i in 1:ncol(data)){
    for(j in i:ncol(data)){
      res[i,j]<-cor(data[,i],data[,j])
    }
  }
ADD COMMENT
0
Entering edit mode

Thanks for your answer. But in your code, you need to create a matrix before, right? res <- matrix(NA,ncol(data),ncol(data)

I was asking if it could be possible to compute directly the upper or the lower triangle on my initial matrix, without creating another one before?

ADD REPLY
0
Entering edit mode

Yes you create a matrix but you only compute the upper triangle

ADD REPLY
0
Entering edit mode

I know what you mean. But you also mean that it is impossible to compute directly the upper or the lower triangle on my first matrix? I necessarily need to create another matrix to the dimensions of my first matrix (res <- matrix(NA,ncol(data),ncol(data))), and then compute the correlations and add them on the matrix that I have created, which is a waste of time so?

Because my goal was to compute the upper triangle of my matrix (thanks to the code that Kevin Blighe gave it to me), without creating another matrix, in the aim to save time..

ADD REPLY
1
Entering edit mode

there is no "upper-triangle" data structure in R in my knowledge. What is the size of your correlation matrix ?

ADD REPLY
0
Entering edit mode

Actually, my correlation matrix is very huge (up to 5To) . My input file has ~800.000 rows . I can't execute my code on this input file, it requires too much RAM. I will split this input file into several submatrices, then I will compute the correlations on these submatrices.

ADD REPLY
0
Entering edit mode

As Nicolas said, there is no data structure in R to do this. However, you could simply output each line to a separate list entry, like this:

[[1]]
x x x x x x x x
[[2]]
x x x x x x x 
[[3]]
x x x x x x
[[4]]
x x x x x
[[5]]
x x x x
[[6]]
x x x
[[7]]
x x
[[8]]
x

You should be able to code for this.

ADD REPLY
0
Entering edit mode

Thanks for your answers. I will stay with my first matrix from which I will extract the correlations I want.

ADD REPLY
0
Entering edit mode

Okay, sure. À bientôt

ADD REPLY
0
Entering edit mode

I splitted my matrix into 10 submatrices with 80.000 rows each. I computed the correlations on these 10 submatrices. Then, I joined these correlations.

Does it look fine?

ADD REPLY
0
Entering edit mode

How is the correlation going to occur between the submatrices?

ADD REPLY
0
Entering edit mode

I will use a "double loop" , I mean : - between the first submatrix and the 9 others - between the second one and the 8 others...

Is it the good way to proceed?

ADD REPLY
0
Entering edit mode

Performing the correlation separately on subsets of the data is one way to do this - oui / yes. I believe this is how bigcor does it. Just make sure that you validate your results... and check it more than 5 times in different ways. Mistakes are always made by everyone.

ADD REPLY
0
Entering edit mode

From the code you gave it to me, do you think it is possible to add some code lines to compute the correlations between the submatrices? Or do I need to create a new function?

ADD REPLY
0
Entering edit mode

I tried something (I checked the bigcor() code) :

> dim(data)
[1]   40 1087

 #My last block will contains the 1001th line to the 1087th one

ncol<-ncol(data)
if (!is.null(y)) ycol <- col(y)

rest<-ncol%%100 #100 columns by block, it will remains 87 columns for the last block
large<-ncol-rest
blocks<-ncol%/%100 #10 blocks (+ 1 block with 87 columns)

#to put the columns into each block

ngroup <- rep(1:blocks, each = 100)
  if (rest > 0) ngroup <- c(ngroup, rep(blocks+ 1, rest))
split <- split(1:ncol, group)

#that gives me all the pairwise comparison between the blocks .

combinaison <- expand.grid(1:length(split), 1:length(split))
  combs <- t(apply(combs, 1, sort))
  combs <- unique(combs)  
if (!is.null(y)) combs <- cbind(1:length(split), rep(1, length(split)))

      [,1] [,2]
  [1,]    1    1
  [2,]    1    2
  [3,]    1    3
  [4,]    1    4
  [5,]    1    5
  [6,]    1    6
....

But after, I don't know how could I create the nested loop to compute the correlations between these blocks..

ADD REPLY

Login before adding your answer.

Traffic: 2139 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6