You might want to try python + numpy for this.
Though - why do you need to hold the entire genome-wide matrix in memory? Do you need the trans data as well? Or you can do as Fidel suggests and perform your calculations on each chromosome separately? Can you do your calculations in blocks/chunks?
In fact, the matrix format while useful for visualization, is not ideal as a data structure. What about sub-setting by genome distance (then you can remove n-diagonals from the matrix - effectively hidden from memory in sparse format)?
Some recent papers perform a local peak calling, which in effect allows each submatrix to be 'peak-called' independently and thus reduces memory requirements and allows you to compute in parallel! Though you need to think carefully about what you hope to achieve when calling peaks and what your definition of a 'peak/loop' actually means. (global vs local peak calling will produce vastly different results)
Also be aware about the distance bias with any interaction data. Loci close in the linear genome will also be close in the 3D genome and will have the strongest interactions signals. Depending on how you implement your peak calling - you may have to normalize for genome distance first before performing any quantile based peak calling...!