Biostar Beta. Not for public use.
Question: Analysis Big Data from Hi-C--> a way to find significant interaction ?
0
Entering edit mode

Hey everyone.

I have just started an internship in bioinformatic and I have to deal with big data from Hi-C. I want to analyze my data with R.
The data looks like this:---> chrom start end count
I want to build a matrix where each bins fills with the count. And after, select significants interectations and (if it is possible) to plot the heat map.

It works for low resolution (500kb,100kb), but when I try to run my code with high resolution (10kb, 5kb), problems occurs and R doesn't want to compute with big data.
So I try to use a sparse matrix but I can't process all my code with this, I have to transform into an matrix.

So if you have a solution and you have already managed this kind of problem, let me know.
If you have a method to find significant interactions with high resolution, it will be great. =)
Thank you very much,

Baptiste

ADD COMMENTlink 4.6 years ago Baptiste • 90 • updated 4.4 years ago Bryan Lajoie • 10
Entering edit mode
1

Can you be a bit more specific than "R doesn't want to compute" ? Are you running out of memory ? Or getting an error message (which one ?) ? Is it a problem with a package from CRAN ? Or your own code ? Or is it a problem with the data itself ? What kind of operation are you trying to apply ?

ADD REPLYlink 4.6 years ago
Jean-Karim Heriche
19k
Entering edit mode
0

hey

Thank you for reply.
So this is my code:

file1---> is the rawdata from hic (chrom start end count)
file2--->is the file to normalize rawdata
binze---> for the resolution so 500kb=5e5
dimension---> size of the matrix

MyMatrix <- sparseMatrix(i = file1$V1/binSize + 1, j = file1$V2/ binSize+1, x = file1$V3,dims = c(dimension,dimension))
vector<-file2$V1
MatrixVector <- vector %o% vector
MatrixNorm <- MyMatrix / MatrixVector
as.matrix(MatrixNorm)
MatrixNorm1<-as.matrix(forceSymmetric(MatrixNorm))#I want to have a symmetric matrix for the heatmap.

The real problem is not really my code. But to deal with bigdata in R and find significants interactions between both side of the DNA. I am sorry is the my request was not clear,

Real problem is: Does it exist a way to find significant interaction with high resolution ?

ADD REPLYlink 4.6 years ago
Baptiste
• 90
Entering edit mode
0

Hey Asaf thank for your reply.

The problem is: I don't have the mapped data. I try to find other software like:
-SeqMonk
-HOMER
-HiClib
-HiBrowse

But these software only work with mapped data in input, and it seems like a lot of work to process with that way (to convert my data).
What do you think ?

ADD REPLYlink 4.6 years ago
Baptiste
• 90
3
Entering edit mode

To solve your problem with large matrices I recommend you to do your analysis per chromosome. This will dramatically reduce the size of the matrix.

Moreover, which method are you using to identify enriched contacts?

Apart from the problem of handling large matrices in R, I would be concerned that, with increased resolution the statistical power to discern significant contacts is reduced. Be sure that you have sufficient counts per cell in your matrix. This is off course dependent on the depth of sequencing, the final number of usable reads, and the size of the genome.

ADD COMMENTlink 4.6 years ago Fidel ♦ 1.9k
Entering edit mode
0

Hy Fidel,
Thank you for your reply.
Yes I forgot specifying that I only work with intrachromosomal interaction and only one chromosome.
To identify enriched contacts, I use "quantile" to find a threshold, then I apply this threshold to select the values above.
Yes this is a real problem because there are a lot of "NaN" (it means that does not converge) and I have to deal with that. Unfortunately, I can't replace NaN by 0.

ADD REPLYlink 4.6 years ago
Baptiste
• 90
Entering edit mode
0

I work with python and so far I didn't have a problem with the matrix size. Maybe you can try with python.

Have you checked the methods to compute long-range contacts by Job Dekker (Sanyal et al. Nature, 2012), Victor Corces (Hou et al. Mol. Cell 2012), Bing Ren (same as Corce's) (Jin et al. Nature 2013) and Lieberman-Aiden (Rao et al. 2014) ?

ADD REPLYlink 4.6 years ago
Fidel
♦ 1.9k
1
Entering edit mode

You might want to try python + numpy for this.

Though - why do you need to hold the entire genome-wide matrix in memory? Do you need the trans data as well? Or you can do as Fidel suggests and perform your calculations on each chromosome separately? Can you do your calculations in blocks/chunks?

In fact, the matrix format while useful for visualization, is not ideal as a data structure. What about sub-setting by genome distance (then you can remove n-diagonals from the matrix - effectively hidden from memory in sparse format)?

Some recent papers perform a local peak calling, which in effect allows each submatrix to be 'peak-called' independently and thus reduces memory requirements and allows you to compute in parallel! Though you need to think carefully about what you hope to achieve when calling peaks and what your definition of a 'peak/loop' actually means. (global vs local peak calling will produce vastly different results)

Also be aware about the distance bias with any interaction data. Loci close in the linear genome will also be close in the 3D genome and will have the strongest interactions signals. Depending on how you implement your peak calling - you may have to normalize for genome distance first before performing any quantile based peak calling...!

ADD COMMENTlink 4.4 years ago Bryan Lajoie • 10
0
Entering edit mode

Have you tried the Bioconductor packages: Bioconductor - GOTHiC and/or Bioconductor - HiTC?

Regardless of these packages you can bin your data according to the restriction enzyme recognition sites which should reduce its complexity (if it's not already binned).

ADD COMMENTlink 4.6 years ago Asaf 5.6k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0