Question

Loading Large Bed Files Into Bioconductor

3

Entering edit mode

10.5 years ago

danielcook2013 ▴ 90

Hello -

I am trying to work with some very large BED Files obtained from GEO of ChIP-Seq results (Specifically - this data here). There are a few BED Files that I am hoping to visualize in addition to the MAC Peaks that were called. I wanted to try to use BioConductor to do this, but the BED Files (which contain upwards of 24 million aligned reads) cause R to choke. I am just starting with BioConductor.

What's the best way to visualize the reads with the peaks (called by MACS)? Is this something that can be done with BioConductor?

Thank you!

bioconductor chip-seq • 14k views

ADD COMMENT • link updated 7.9 years ago by ATpoint 81k • written 10.5 years ago by danielcook2013 ▴ 90

score 5 · Answer 1 · 2013-10-03

5

Entering edit mode

10.5 years ago

seidel 11k

When you say "use BioConductor" what specifically have you tried? The GenomicRanges and rtracklayer packages make this straightforward. Examining peaks could be as simple as:

library(GenomicRanges)
library(rtracklayer)

reads <- import.bed(con="yourfile.bed", asRangedData=F)

cov <- coverage(reads)

# for a peak on chr1 from 2000:3000
plot(cov[["chr1"]][1000:4000])

and it should work very efficiently with 24 million reads on most any computer.

ADD COMMENT • link 10.5 years ago by seidel 11k

0

Entering edit mode

How long should it take to read in the bed file? This works no problem with smaller files but once they get up to 500 Mb or a few Gb it chokes. I am on a brand new macbook pro with 8 Gb of memory.

ADD REPLY • link 10.5 years ago by danielcook2013 ▴ 90

0

Entering edit mode

It shouldn't take more than a few minutes in my experience. However, you can check where the choke point is. Do you use "top" on your mac? Open a terminal (Terminal.app), type 'top' at the command line, then start your import process. By looking at top, you will be able to see how much memory is being consumed, and if you are running out. Type 'q' to quit top.

ADD REPLY • link 10.5 years ago by seidel 11k

0

Entering edit mode

Still no luck in getting this to work. All my memory gets eaten up. Granted - some are large files (2-3gb). Am I missing something major here? I see how bioconductor could be useful ... but don't quite understand how this is the case if it can't work with large files. None of the tutorials cover this import process.... Although I would hope you don't have to load entire files in to work with them?

I am trying to zip and index the files with tabix right now...we'll see if that helps. Thanks for your suggestions.

update: zipping and indexing did not work.

ADD REPLY • link 10.5 years ago by danielcook2013 ▴ 90

0

Entering edit mode

You mention "All my memory gets eaten up." Have you examined the relationship between file size and memory? In other words, if you sample 1 million reads into a bed file (head -1000000 your.bed > new.bed) and try that, how much memory is used? Then try 2 million, how much is used? This will help you extraoplate the nature of your problem. I have noticed with some genomes that rtracklayer can consume resources an be inefficient (i.e. non-canonical genomes with thousands of contigs). If you're simply reading in a bed file, what happens to the tests above is you just read it in as a dataframe? You can always convert it to one of the efficient bionconductor objects after you have it in the R environment. Bioconductor can work with large files, but different packages and functions are not all guaranteed to work in the same hardware footprint.

ADD REPLY • link 10.5 years ago by seidel 11k

score 2 · Answer 2 · 2016-06-11

The question is ~3 years old now, but as sequencing technology constantly improves and creates more and more output per run:

You may use the data.table package, providing the fread() function. It loads a 50mio rows, 6 columns bed file in as little as 10 sec, without using more than 2GB of RAM on standard laptop. It clearly outcompetes the read.table() function of R base, which in my experience may take an eternity (if it finishes at all for very large files).

fread("/path/to/file.bed", data.table=FALSE, header=FALSE)

If you want to select certain columns, then use

fread("/path/to/file.bed", data.table=FALSE, header=FALSE, select=c(1,2,3,6)

This would give you only chr, start, end and strand. You can then proceed by transforming the df into a GRanges object, which allows to use coverage() function from the GenomicRanges package, as described in on of the above posts.