Question

Struggling to work on a large Count Matrix

0

Entering edit mode

4.2 years ago

castravete2712 • 0

Hello there,

I'm struggling loading/importing my large count Matrix on Rstudio in order to analyze it. It's a quite medium sized data (3 Giga) but R is crashing every time and my PC want to seppuku itself each time.

So, importing by only making a read.table won't work. I tried stocking it as a big.matrix file, it won't work either, R crashes again.

What can I do? I can't find any nice tutorial for this kind of problem.

R countMatrix single-cell • 2.2k views

ADD COMMENT • link updated 4.2 years ago by Jean-Karim Heriche 27k • written 4.2 years ago by castravete2712 • 0

2

Entering edit mode

Is it crashing due to running out of RAM? Have you tried a sparse matrix?

ADD REPLY • link 4.2 years ago by Devon Ryan 104k

0

Entering edit mode

Yep, the RAM can't keep up.

I was thinking of that but I can't manage to read the file directly into a sparseMatrix, avoiding the read.table step. read.matrix maybe?

ADD REPLY • link 4.2 years ago by castravete2712 • 0

2

Entering edit mode

How about doing that sequentially, like in chunks of 10%?

ADD REPLY • link 4.2 years ago by ATpoint 82k

3

Entering edit mode

By the way, don't bother with read.table, it is super slow. Use for example (among many good options) data.table::fread() or readr::readr(). Speed gains are notable.

ADD REPLY • link 4.2 years ago by ATpoint 82k

1

Entering edit mode

Might be a job for {disk.frame} https://github.com/xiaodaigh/disk.frame

ADD REPLY • link 4.2 years ago by russhh 5.7k

0

Entering edit mode

Do you mean "load/import" a big file?

charging my large count Matrix on R

ADD REPLY • link 4.2 years ago by zx8754 11k

0

Entering edit mode

Yeah, sorry, I was indeed meaning to say to import or load data

ADD REPLY • link 4.2 years ago by castravete2712 • 0

0

Entering edit mode

I struggle to see how this is related to bioinformatics, or why it has attracted so many answers. Loads of questions get killed for asking something about biology and maybe tangentially related to bioinformatics. I don't see how this question is related to either.

ADD REPLY • link 4.2 years ago by Mensur Dlakic ★ 27k

2

Entering edit mode

Dealing with large data sets has become a more common issue although it is not specific to bioinformatics. However, for bioinformatics data types, there may exist specific tools. Here we're dealing with a count matrix and although replies currently suggest generic solutions, maybe someone has a more specific solution for count matrices as part of their analysis pipeline that they can share.

ADD REPLY • link 4.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Agreed. The single-cell packages are starting to output counts in sparse matrices inside hdf5 containers for this reason, so if one could go back a step in OP's workflow there are likely some tweaks that could be made there to make life easier.

ADD REPLY • link 4.2 years ago by Devon Ryan 104k

0

Entering edit mode

Maybe I am missing your point. Do you see anything in the question that implies biology or bioinformatics application of what this poster is trying to do?

My point was that lots of posters are turned away even though they sometimes have legitimate biology question that may be related to bioinformatics. To me, that is closer to the intended purpose of this site than the current post.

ADD REPLY • link 4.2 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Your point is valid, but as it does not add to the content of this thread I suggest we discuss things like that in our Slack, which you are invited to join:

biostar.slack.com: Chat for the biostars community -- [ feel free to join ]

ADD REPLY • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

Well, as it wasn't really important to say why I was needing it, I didn't mention it. But I need to find what I have to do in order to import this complex and large data in R because I need to analyze a large count Matrix issued of single-cell sequencing (split-seq if I want to be precise).

The count matrices issued of the pipelines that analyze the single-cell raw data are huge and in conclusion, R has troubles working on them and needs a lot of RAM.

So, I was just trying to ask around as I can't really find the right method that will help me use R with such large files and I want to do it properly

ADD REPLY • link 4.2 years ago by castravete2712 • 0

score 2 · Answer 1 · 2020-02-12

2

Entering edit mode

4.2 years ago

Jean-Karim Heriche 27k

Two R packages that may be of interest:

the bigmemory package and its read.big.matrix() function.
the ff package with its read.table.ffdf() function.

ADD COMMENT • link 4.2 years ago by Jean-Karim Heriche 27k