Question

Compute the correlations between submatrices

2

Entering edit mode

5.1 years ago

pablo ▴ 300

Hi guys,

Related my previous post: How to get a half correlation matrix?

I need to compute the correlations between the columns of my matrix which has more than 800.000 rows in R. I have decided to split this matrix into submatrices (with 60.000 rows for each) and to compute the pairwise correlations between these submatrices.

I use SLURM. What I want to do is to distribute the computation of the correlation between two submatrices on a node of the cluster I use, in order to parallelize.

For the moment, I have created an argument which takes in account the number of columns of the main matrix I want to compute.

For example, with the command line data=data[1:opt$subset,] (in R), I can compute the correlation from my 1st column to the 10.000th : for that, I have set up arrays in my SLurm code : subset=$((SLURM_ARRAY_TASK_ID*10000)) . I have defined 10 arrays, and so, the first one will compute the compute from the 1st to the 1*10000 th column, the second from the 1st to the 2*10000 = 20 000 th column ....

With this argument data=data[(as.numeric(opt$subset)-4999):opt$subset,] , I can compute the correlations into a block/submatrix with a number of columns defined . For example, if I want to create blocks of 5000 columns , I set up my argument as above and in SLURM , with my arrays : subset=$((SLURM_ARRAY_TASK_ID*5000)) . So , my first block will correspond from the (1*5000) - 4999 = 1st column to the 1*5000 = 5000th column , the second block will correspond from the (2*5000)-4999 = 5001 th column to the 2*5000 = 10 000 th column ..

My problem is here : the correlations are computed into these blocks independently . What I want to do is to compute the correlation between all of these blocks like this (=pairwise correlation between all the blocks) :

     [,1] [,2]
  [1,]    1    1
  [2,]    1    2
  [3,]    1    3
  [4,]    1    4
  [5,]    1    5
  [6,]    1    6
 ...

until the block 6 between the block 6.

Any advice ?

Cheers

R CODE (Kevin Blighe gave me the foreach function)

#load packages 
library(compositions)
library(parallel)
library(doParallel)
library(optparse)

args <- commandArgs(trailingOnly = F)

# get options 

option_list = list(
        make_option(c("-s", "--subset"), type="character", default=NULL, help="Input file matrix ")
);

opt_parser= OptionParser(usage = "Usage: %prog -f [FILE]",option_list=option_list, description= "Description:")

opt = parse_args(opt_parser)

#main code

print('Set Up Cores')

cores<-32
options('mc.cores'=cores)
registerDoParallel(cores)

print('Load matrice')

data<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", sep="\t", h=T, row.names=1)+1

data=t(data)

##THIS IS MY ARGUMENT###

#data=data[(as.numeric(opt$subset)-4999):opt$subset,]
data=data[1:opt$subset,]


res <- foreach(i = seq_len(ncol(data)),
 .combine = rbind,
 .multicombine = TRUE,
 .inorder = FALSE,
 .packages = c('data.table', 'doParallel')) %dopar% {
 if((i%%1000)==0){
 print(i)}
 apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
}

SLURM CODE

#!/bin/bash
#SBATCH --nodes=1
#SBATCH -o slurmjob-%A-%a.out
#SBATCH --job-name=rho_blocks_5k
#SBATCH --mail-user vincentpailler@hotmail.fr
#SBATCH --partition=normal
#SBATCH --time=1-00:00:00
#SBATCH --mem=250G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --array=0-10

echo tableau de jobs numero $SLURM_ARRAY_JOB_ID, indices de $SLURM_ARRAY_TASK_MIN à $SLURM_ARRAY_TASK_MAX

echo $SLURM_ARRAY_TASK_ID

#Set up whatever package we need to run with

module load gcc/8.1.0 openblas/0.3.3 R

# SET UP DIRECTORIES

OUTPUT="$HOME"/PROJET_M2/bin/propr/$(date +"%Y%m%d")_parallel_blocks_32cpus_5000
mkdir -p "$OUTPUT"

export FILENAME=/home/vipailler/PROJET_M2/bin/coefficient_rho.R

subset=$((SLURM_ARRAY_TASK_ID*10000))

#Run the program

echo "Start job :"`date` >> "$OUTPUT"/temp_"$SLURM_ARRAY_TASK_ID".txt
echo "Start job :"`date`

Rscript $FILENAME --subset $subset  > "$OUTPUT"/"$SLURM_ARRAY_TASK_ID"

echo "Stop job : "`date` >> "$OUTPUT"/temp_"$SLURM_ARRAY_TASK_ID".txt
echo "Stop job : "`date`

The output I get is this one :

OTU0001     OTU0004    OTU0014    OTU0016    OTU0017      OTU0027
OTU0001  1.00000000  0.96688301 0.80621218 0.16754758 0.40818524  0.155976198
OTU0004  0.96688301  1.00000000 0.81330915 0.18928670 0.43247749  0.187540302
OTU0014  0.80621218  0.81330915 1.00000000 0.23753965 0.57237416  0.222890740
OTU0016  0.16754758  0.18928670 0.23753965 1.00000000 0.64007329  0.775772234
OTU0017  0.40818524  0.43247749 0.57237416 0.64007329 1.00000000  0.445145905
OTU0027  0.15597620  0.18754030 0.22289074 0.77577223 0.44514590  1.000000000
...

After, I rearrange my output with :

Df<-data.frame(var1=rownames(res)[row(res)[upper.tri(res)]],
        var2=colnames(res)[col(res)[upper.tri(res)]],
        corr=res[upper.tri(res)])

in order to get :

    var1    var2          corr
1   OTU0001 OTU0004  0.9668830120
2   OTU0001 OTU0014  0.8062121821
3   OTU0004 OTU0014  0.8133091522
4   OTU0001 OTU0016  0.1675475819
5   OTU0004 OTU0016  0.1892866996
6   OTU0014 OTU0016  0.2375396470
7   OTU0001 OTU0017  0.4081852433
8   OTU0004 OTU0017  0.4324774863
...

correlation R matrix foreach • 4.1k views

ADD COMMENT • link updated 5.1 years ago by Ram 43k • written 5.1 years ago by pablo ▴ 300

0

Entering edit mode

I can only speak for myself, but you should really consider to make your question shorter and more on-point. I would not be motivated to go through a question that long to understand what the issue is.

ADD REPLY • link 5.1 years ago by ATpoint 81k

0

Entering edit mode

Ok sorry, I was wondering it was too much long too.

I have a big matrix with more than 800.000 columns. I need to compute the correlations between these columns. There are too many ones to compute all at once. I have decided to split this big matrix into smaller submatrices (60.000 columns each).

So, I need to compute the correlations of these submatrices between each other but I get stucked. The R function I have can only compute the correlations on only one submatrix. I need to edit my R code but I don't really now how to do it.

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

I have created a small subset data (100 columns) to work on.

ncol<-ncol(data) 

rest<-ncol%%10
blocks<-ncol%/%10

#creation of the blocks (the number of columns I put in each blocks : 10 blocks * 10 columns here)

ngroup <- rep(1:blocks, each = 100)
split <- split(1:ncol, ngroup)

#creation of the pairwise correlation between each group

combinaison <- expand.grid(1:length(split), 1:length(split))
 combs <- t(apply(combs, 1, sort))
 combs <- unique(combs)

[,1] [,2]
  [1,]    1    1
  [2,]    1    2
  [3,]    1    3
  [4,]    1    4
  [5,]    1    5
  [6,]    1    6
....

I don't know if I have the good start to get an issue..

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

You will get more joy on stack overflow with a strictly R-programming question. Plus, I thought you had 800k rows, not 800k columns. Is your input a square matrix? If not, precisely what are it's dimensions

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

I asked for my problem on stackoverflow but I didn't get any answer back yet..

Actually, my starting matrix has 800.000 rows (OTUs) and 40 columns (samples). But I used data=t(data) to work on the columns.

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

OK. I still don't know the answer to the thing that is puzzling me, so I'll ask it a different way. What is the expected dimension of the correlation matrix: (800k * 800k) or (40 * 40)?

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

The expected dimension of the correlation matrix is 800k*800k. That's why I want to split my starting matrix into submatrices

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

How are you going to store the data?

M <- matrix(rep(0, 64 * 10^10), nrow=8*10^5)

Error: cannot allocate vector of size 4768.4 Gb

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

Actually, I work on a cluster with several nodes. My goal here is to split my main matrix into several submatrices (with 60 000 columns for each for example). Then, I want to compute the correlations between each submatrices by distributing the calculation on nodes (thanks to the parallelization) .

According to this distribution :

         [,1] [,2]
  [1,]    1    1     #node1
  [2,]    1    2     #node 2
  [3,]    1    3     ....
  [4,]    1    4
  [5,]    1    5
  [6,]    1    6

I will arrange each output in order to get a file like this :

      var1    var2       corr
1   OTU0001 OTU0004  0.9668830120
2   OTU0001 OTU0014  0.8062121821
3   OTU0004 OTU0014  0.8133091522
4   OTU0001 OTU0016  0.1675475819
5   OTU0004 OTU0016  0.1892866996

And to finish, I will gather together each one of these outputs in order to get only one file which will contain each unique pairwise correlation between the OTUs.

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

There seems to be a lot of overengineering going on here. Can we just fix your R code. First you need a function that can run correlation of one set of your OTUs against all of the rest of the OTUs and return a (var1, var2, corr) data.frame where var1 < var2. Use cor(A[, my_indices], A) and as.data.frame.table and dplyr::rename/filter. Don't roll your own correlation function.

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

Do you think it is really tricky to edit my own correlation function? Because it works really well for a number of columns I have chosen. If I edit it by creating a nested loop, could it work?

res <- foreach(i = seq_len(ncol(data)),
 .combine = rbind,
 .multicombine = TRUE,
 .inorder = FALSE,
 .packages = c('data.table', 'doParallel')) %dopar% {
 apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
}

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

But why are you using a vector-by-vector implementation of cor that's slow and untested, when there is a perfectly good matrix by matrix version?

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

I found this coefficient in a publication "propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis" . I thought it was possible to edit my apply function in order to get what I want.

And I use it to practice with R because I am pretty new with this language.

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

Aaaahhhh, you're not doing correlation. I thought it looked strange.

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

Yes, sorry for the misunderstanding. I say "correlation" by through misuse of language.

But does my problem look more clearly or not?

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

Not to me. I don't understand why you've written all this cluster code before you've tested your R code works. Where do you specify the matrix that you are doing your proportionality test against?

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

I tested my R code on a small subset of my matrix (10 rows and 40 columns), and it works. I got a 40*40 dimensional matrix with the good results. So the code below is good

 data<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", sep="\t", h=T, row.names=1)+1


    data=t(data) 
    data=data[,1:40]
    data=data[1:10,]

    cores<-32
    option ('mc.cores'=cores
    registerDoParallel(cores)

    res <- foreach(i = seq_len(ncol(data)),
     .combine = rbind,
     .multicombine = TRUE,
     .inorder = FALSE,
     .packages = c('data.table', 'doParallel')) %dopar% {
     apply(data, 2, function(x) 1 - ((var(data[,i] - x)) / (var(data[,i]) + var(x))))
    }

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

Yes but your problem is that you aren't getting the inter-block proportionality results isn't it? How does this test that?

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

Yes that's right. I just want to compute the proportionality between each blocks (block1 vs block 1; block1 vs block2 ...... ) and gather all the outputs. The problem here is that my code compute the proportionality only on the number of columns of my data_file (up to a number of columns, according to my computational power) . So, we can consider that I execute my code only on one block (which is the data_file).

And I really don't know how could I modify my code to get the inter-block proportionality results.. I guess I can split my data_file into many submatrices but that's it for me for the moment..

ADD REPLY • link 5.1 years ago by pablo ▴ 300

0

Entering edit mode

No it isn't. There's more than one way to split your data up. You can use one set of column indices to pull out block_a and another to pull out block_b. Then iterate over the cols of block a and over the cols of block b. Whatever you do, write a function to do it and test that function.

ADD REPLY • link 5.1 years ago by russhh 5.7k

0

Entering edit mode

I found out something, I created two small matrices (which could correspond to two blocks) : matrix1 and matrix2 (here, it is a test of correlation between the rows of the 2 matrices (not matter if I work on the rows or the columns, my goal is really to compute the correlation/proportionality between my blocks)) . I can compute the pairwise correlation between : matrix1 vs matrix 1 , matrix 1 vs matrix 2 , and matrix 2 vs matrix 2

cor_blocks<- function(a,b) {
  M  <- rbind(a, b);
  M  <- M - rowMeans(M);
  M  <- M / sqrt(rowSums(M^2));
  CR   <- tcrossprod(M)
  edge <- dim(a)[1]
  CR   <- t(cr[1:edge, -c(1:edge)])
  return(CR)
}

ADD REPLY • link 5.1 years ago by pablo ▴ 300