load required packages

Question

Parallelization to compute correlations

3

Entering edit mode

5.1 years ago

pablo ▴ 300

Hello guys,

I want to compute correlations between OTUs (from metagenomic sequencing) . I am working with R and I would like to parallelize the computations because my data set is big.

What I want to do is maybe simple. I work on a cluster with Slurm (I have many CPUs and a lot of memory). I would like to cut my data set into subsets (by incrementing of 500 OTUs for example), and then, compute the correlations between each OTUs of the different subsets.

The main goal of this is to set a number of CPUs and RAM (for example 32 CPUs with 260 Go of RAM), start the computations of correlation between my subsets (500 OTUs, 1000 OTUs, 1500 OTUs... until 5000 OTUs) and look at the scalability. I will change the number of CPUs and the amount of RAM.

At the end, I will conclude by : "with 32 CPUs and 260 Go RAM, I can compute the correlation between X OTUs", "with 124 CPUs and 1To RAM, I can compute the correlation between X OTUs" ...

My problem is that I don't really know how to organize my code, and how to use the package mclapply() to parallelize.

library(optparse)

args <- commandArgs(trailingOnly = F)

# get options
option_list = list(
  make_option(c("-s", "--subset"), type="character", default=NULL,
              help="Input file matrix ")
);
opt_parser = OptionParser(usage = "Usage: %prog -f [FILE]",option_list=option_list,
                          description= "Description:")
opt = parse_args(opt_parser)


library(parallel)
library(doParallel)
library(compositional)

data=read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv", row.names=1, h=T, sep="\t")

data=clr(data) #log_ratio transformation
data=data[1:opt$subset,]

data=t(data)


cores<-32
options('mc.cores'=cores)
registerDoParallel(cores)

parallel=mclapply(cor(data, c='spearman')) 

save(parallel, file=sprintf("/home/vipailler/PROJET_M2/data/parallel_%s.RData", opt$subset))

parallelization correlation • 3.5k views

ADD COMMENT • link updated 5.1 years ago by Kevin Blighe 87k • written 5.1 years ago by pablo ▴ 300

0

Entering edit mode

Vincent, I have edited my answer. Please read it again (at the top). Another user identified a better approach to do what you need.

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k

score 10 · Answer 1 · 2019-03-13

Edit July 18, 2019: forget my answer (below). It has been brought to my attention that fastcor can compute a large correlation matrix and only return the top half of this. See here: A: R functions for parallel processing

-------------------------------------

---------------------------------

Hi again Vincent,

Just for everyone else, this is a follow-up to a previous discussion about SNOW functionality and parallel implementation in R: Snowfall Parallelisation to compute Correlation Matrix

Vincent, I recommended mclapply in the previous thread in the knowledge that you were using some other package, which was then conducting some correlation analysis for you.

If you want to do the correlation yourself, foreach (works on all platforms) would be a better option. This code works:

load required packages

library(parallel)
library(doParallel)

Choose and set number of cores

cores <- makeCluster(detectCores(), type='PSOCK') # grabs max available

cores <- 6  # explicitly choose number of cores

Windows

cl <- makeCluster(getOption('cl.cores', cores))
registerDoParallel(cl)
registerDoSEQ()
on.exit(stopCluster(cl))

Mac / Linux / UNIX

options('mc.cores' = cores)
registerDoParallel(cores)

Generate 10000 x 2000 random data matrix

mat <- matrix(rexp(20000000, rate=.1), ncol=10000)
colnames(mat) <- paste0('Sample', 1:ncol(mat))
rownames(mat) <- paste0('Gene', 1:nrow(mat))
dim(mat)
[1]  2000 10000

mat[1:5,1:5]
        Sample1    Sample2    Sample3    Sample4   Sample5
Gene1 0.2991302 16.2747625 12.0524225 50.1207356 14.180350
Gene2 9.5160190  5.1977788  0.2951713  8.6172124  4.022473
Gene3 4.1651056  0.5793664  9.3953203  0.8494895 19.985724
Gene4 8.5216108 15.1742012  7.2660707  7.8875059  6.572094
Gene5 8.9048996  4.7932319  1.6599490  1.2448652 25.325664

use `foreach` to parallelise the correlation

res <- foreach(i = seq_len(ncol(mat)),
  .combine = rbind,
  .multicombine = TRUE,
  .inorder = FALSE,
  .packages = c('data.table', 'doParallel')) %dopar% {
  cor(mat[,i], mat, method = 'pearson')
}

res[1:10,1:10]
           Sample1      Sample2      Sample3      Sample4      Sample5      Sample6      Sample7      Sample8      Sample9     Sample10
 [1,]  1.000000000 -0.002229854 -0.019986425  0.012332275 -0.028186074  0.001989957 -0.023539602  0.019429899  0.018364991 -0.005468089
 [2,] -0.002229854  1.000000000 -0.028642588  0.043449095 -0.042194471  0.022356480  0.033054025 -0.023498252 -0.032226185 -0.009321138
 [3,] -0.019986425 -0.028642588  1.000000000 -0.027933870  0.006611151  0.047751182 -0.038504742 -0.015981761 -0.044780981  0.042516990
 [4,]  0.012332275  0.043449095 -0.027933870  1.000000000 -0.019726310 -0.011895754 -0.004191525  0.039994518  0.011178578 -0.038820658
 [5,] -0.028186074 -0.042194471  0.006611151 -0.019726310  1.000000000  0.010329296 -0.016766063 -0.004144883  0.032845623 -0.019617268
 [6,]  0.001989957  0.022356480  0.047751182 -0.011895754  0.010329296  1.000000000  0.007669371  0.002309490 -0.010981713 -0.008976607
 [7,] -0.023539602  0.033054025 -0.038504742 -0.004191525 -0.016766063  0.007669371  1.000000000  0.007287154 -0.021833100  0.004763376
 [8,]  0.019429899 -0.023498252 -0.015981761  0.039994518 -0.004144883  0.002309490  0.007287154  1.000000000  0.018914405 -0.012672712
 [9,]  0.018364991 -0.032226185 -0.044780981  0.011178578  0.032845623 -0.010981713 -0.021833100  0.018914405  1.000000000  0.004782734
[10,] -0.005468089 -0.009321138  0.042516990 -0.038820658 -0.019617268 -0.008976607  0.004763376 -0.012672712  0.004782734  1.000000000

check some results to ensure that we have done it right

table(round(cor(mat[,1:50]), 3) == round(res[1:50,1:50], 3))
TRUE 
2500 

table(round(cor(mat[,678:900]), 3) == round(res[678:900,678:900], 3))
 TRUE 
49729

table(cor(mat[,1], mat) == res[1,])
 TRUE 
10000 

table(cor(mat[,6666], mat) == res[6666,])
 TRUE 
10000

I have done this now using R on Windows and it worked fine.

Kevin

-------------------------------------

---------------------------------

load required packages

Choose and set number of cores

Generate 10000 x 2000 random data matrix

use foreach to parallelise the correlation

check some results to ensure that we have done it right

use `foreach` to parallelise the correlation