Correlation Matrix - Big Data - R
2
1
Entering edit mode
5.2 years ago
pablo ▴ 300

Hello everyone,

I would like to create a correlation matrix from big data (about 1 000 000 genes & 40 samples)

I use R , and the size of the matrix is too big (4994.2 Go) . It can't be created. I work on a cluster with a big computing power so I don't think the problem comes from this cluster.

Have you got an idea to generate this matrix?

Bests,

Vincent

R Hadoop correlation matrix • 4.3k views
ADD COMMENT
0
Entering edit mode

are you clustering the genes or the samples?

ADD REPLY
0
Entering edit mode

I am clustering genes

ADD REPLY
3
Entering edit mode
5.2 years ago

Could you put the command you tried ? Do you want to compute correlation between samples ? or between genes ? also which species has 1,000,000 genes ?

For correlation between samples :

# generate test dataset - 40 samples x 1,000,000 genes
m <- matrix(runif(40e6,min = 0,max=100),nrow = 1000000,ncol = 40)
m <- as.data.frame(m)
colnames(m)<-paste0("sample",1:40)
row.names(m)<-paste0("gene",1:1000000)

# compute correlation
cor.res <- cor(m)

For gene-gene correlation you will have to generate a 1,000,000 x 1,000,000 matrix that will be quiet big in memory ..

# Example of 1M x 1M matrix in R
m <- matrix(0,ncol=1e6,nrow=1e6)
Error: cannot allocate vector of size 7450.6 Gb

Maybe you could try to find a solution by using the bigmemory or ff packages.

In fact someone already implemented a solution based on ff.

ADD COMMENT
0
Entering edit mode

I would like to compute correlation between genes actually, to make clusters then. I work on metagenomics data.

I use "propr" package :

library(propr)

test<-read.table("/home/vipailler/PROJET_M2/raw/truelength2.prok2.uniref2.rare> tsv", h=T, row.names=1, sep="\t")

test<-t(test)

propr<-propr(test, metric="rho")

Alert: Replacing 0s with next smallest value. Alert: Saving log-ratio transformed counts to @logratio. Erreur : impossible d'allouer un vecteur de taille 4994.2 Go

ADD REPLY
2
Entering edit mode
5.2 years ago

There are various solutions to this. Here are a few suggestions:

  1. Perform clustering on a random sample of the data then assign data points to the closest/most similar cluster.
  2. Another, related, approach consists in removing as much of the data as possible in a preprocessing step, i.e. are all of the million genes really of interest ? Maybe a large fraction of them could be put into a group such as "not interesting because do not vary across samples".
  3. Use an online clustering algorithm (e.g. online k-means).
  4. If you need the whole correlation matrix, first notice that the matrix is symmetric so you actually only need one half of it (minus the diagonal), second, parallelize the computation by splitting it into groups of rows and store the results into a database. You can save space by only storing significant values (i.e. sufficiently different from 0). Clustering using the database can be done for example with agglomerative hierarchical clustering.
ADD COMMENT

Login before adding your answer.

Traffic: 2857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6