You don't need to split the file.
Sort your BED file with BEDOPS sort-bed
, if unsorted:
$ sort-bed reads.unsorted.bed > reads.bed
Then pass the chromosome name to bedops
or bedmap
or closest-features
via --chrom <chromosome>
to do work only on that chromosome:
$ bedops --chrom chrN --operations ...
Or:
$ bedmap --chrom chrN --operations ...
Etc.
If you want a list of all operations, take a look at the BEDOPS documentation.
If you want a fast list of chromosomes:
$ bedextract --list-chr reads.bed
You can pipe this list to a script loop, to do work over each chromosome on a computational cluster. BEDOPS enables parallelization by chromosome pretty easily.
These operations are very fast, because this uses the sort order in sort-bed
sorted BED files to do a binary search that determines chromosome bounds. Operations that take minutes or hours can go down to seconds.
If your reads are single-end, you should be able to add --faster
to BEDOPS commands to make them use similar techniques to speed operations up even further.
Paired-end reads can result in "nested" elements that prevent the use of --faster
without some pre-processing tricks. Your data is on the multi-GB scale such that, if you are working with paired-end reads, you may be interested in investigating these tricks if you're doing these operations repeatedly. See the "extra work" note in the bedextract
documentation.
Thank you very much for the detailed information. The list of positions I'm talking about is no longer at reads level - they are just chromosome & positions. It looks like I should turn this list into another bed file, and just use "6.1.2.3.3. Retrieving elements which overlap target elements" and I'm going to give it a try.