I think you could get an answer to the first part of your question with a series of a few set operations, using BEDOPS support for Unix streams, i.e., piping results from one bedops
or bedmap
operation to the next.
Here's the "flow" of operations:
Retrieve all elements from the input BED file that fall entirely within -N of the original element's start and +N of the original element's end (this will include the original element itself, but that's fine)
Of those candidates, do some culling to remove elements that are out of bounds:
a. Eliminate the ones that fall entirely within -N of original start and -N of original end
b. Eliminate those that fall entirely within +N of original start and +N of original end
Merge whatever intervals are left from culling; this is the answer
It may help to sketch this out on paper to see how this works.
To do this in an automated way, perhaps:
#!/bin/bash
N=1234
input=intervals.bed
bedops --range -${N}:${N} --everything ${input} > all.bed
bedops --range -${N}:-${N} --everything ${input} > left.bed
bedops --range ${N}:${N} --everything ${input} > right.bed
bedops --element-of 100% ${input} all.bed \
| bedops --not-element-of 100% - left.bed \
| bedops --not-element-of 100% - right.bed \
| bedops --merge - \
> answer.bed
rm all.bed left.bed right.bed
The result answer.bed
should contain a merge of intervals which meet your criteria.
For example, with N
set to 5
and the following input file intervals.bed
:
chr1 8 22 id-1
chr1 10 25 id-2
chr1 14 26 id-3
The merged genomic space is:
chr1 8 25
because:
chr1 14 26 id-3
will fall outside of the overall +/-5 nt bounds.
I believe the logic here is correct, but check this with your input and see if this works for you.
If you want a more efficient script that doesn't generate intermediate files, process substitutions should work at the cost of readability:
#!/bin/bash
N=1234
input=intervals.bed
bedops --element-of 100% ${input} <(bedops --range -${N}:${N} --everything ${input}) \
| bedops --not-element-of 100% - <(bedops --range -${N}:-${N} --everything ${input}) \
| bedops --not-element-of 100% - <(bedops --range ${N}:${N} --everything ${input}) \
| bedops --merge - \
> answer.bed
The first script, however, may help for troubleshooting and double-checking my work, since it clearly outlines the sets and set operations being performed.
I'm out of time and I'll have to think some more about the second part of your question later on. Using awk
to preprocess the input BED file and piping it to bedops
or bedmap
may prove a useful approach.
for
bedtools merge
with option -dHi Pierre, I've adjusted my description to be more specific. The Bedtools option is much more lenient than what I am looking for.
It's highly unlikely that anyone has written a program to handle this very unusual methodology, so you're probably going to have to write it yourself (or daisy chain a bunch of different things together). Out of curiosity, what is the actual use-case where this is useful?
Hi Devon, I'm trying to merge read pair information that represent anomolous SVs. Given that the length of the SV can be up to a chromosome length, the usual parameters for these tools don't provide options that would suitably merge bed entries that represent the same event, while allowing for the insert size noise, especially when and inversion is inside a deletion. Basically I want to merge read pairs that show the same SV.
I'm a bit surprised that the SV callers don't do this, thanks for the info.
Hopefully piping with BEDOPS works for your needs. Let me know if you have any questions!
Hello,
Just wondering if you ever found a working solution for this since I am dealing with the exact same problem. I wrote a not so great solution in Python but it doesn't seem to work exactly as intended. I can post it if you're interested.
Please let me know, any help would be greatly appreciated. Thanks!