Question

Locate positions in a large bed file

0

Entering edit mode

6.5 years ago

DVA ▴ 630

I have a ~20G bed file for a few samples, and the bed file contains read depth for all positions of the human genome. Also, I have a list of chromosome positions (~30,000 to 50,000) for each sample, and I would like to know the read depth at these positions. What would be the most efficient method please?

I'm currently thinking to split my large bed file and the list per chromosome or block (via grep or BEDOPS), and work from there, but I would love to know better ways to do this. Thank you very much.

bed • 2.2k views

ADD COMMENT • link updated 6.5 years ago by Alex Reynolds 35k • written 6.5 years ago by DVA ▴ 630

1

Entering edit mode

6.5 years ago

GenoMax 141k

BEDOPS should have a function equivalent to intesectbed from Bedtools.

ADD COMMENT • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you so much for your reply. I am looking at this too - do you happen to know how fast these two functions are for large files?

ADD REPLY • link 6.5 years ago by DVA ▴ 630

score 2 · Accepted Answer · 2017-10-10

You don't need to split the file.

Sort your BED file with BEDOPS sort-bed, if unsorted:

$ sort-bed reads.unsorted.bed > reads.bed

Then pass the chromosome name to bedops or bedmap or closest-features via --chrom <chromosome> to do work only on that chromosome:

$ bedops --chrom chrN --operations ...

Or:

$ bedmap --chrom chrN --operations ...

Etc.

If you want a list of all operations, take a look at the BEDOPS documentation.

If you want a fast list of chromosomes:

$ bedextract --list-chr reads.bed

You can pipe this list to a script loop, to do work over each chromosome on a computational cluster. BEDOPS enables parallelization by chromosome pretty easily.

These operations are very fast, because this uses the sort order in sort-bed sorted BED files to do a binary search that determines chromosome bounds. Operations that take minutes or hours can go down to seconds.

If your reads are single-end, you should be able to add --faster to BEDOPS commands to make them use similar techniques to speed operations up even further.

Paired-end reads can result in "nested" elements that prevent the use of --faster without some pre-processing tricks. Your data is on the multi-GB scale such that, if you are working with paired-end reads, you may be interested in investigating these tricks if you're doing these operations repeatedly. See the "extra work" note in the bedextract documentation.