Question

merge Bed entries with both ends within maximum distance

0

Entering edit mode

6.2 years ago

Richard ▴ 590

Hi folks,

I have a bed file like the following:

19      52175458        52177604        13
19      52175458        52177663        17
19      52175458        52177717        39
19      52175458        52178717        43

and I would like to be able to merge any entries that have both ends of both merged entries to be within N bases. For example, I would like to either:

merge when the start coord of two entries are within N bases and the end coords are also within N bases of each other
merge entries where the sum of the differences in start and end coords for two entries is less than 2*N.

EDIT Note that bedtools merge allows a -d option to merge when some amount of each entry overlaps another, or they are within some maximum distance. What I am looking for is more specific in that it is requiring both ends of merged entries to match within some distance

bedops/bedmap comes closer allowing merging of entries with required overlap represented by --fraction-both and similar, but doesn't seem to allow restriction of a match to cases where the total non-overlapping amount of the entries is less than some value.

thanks!

next-gen alignment • 2.4k views

ADD COMMENT • link updated 6.2 years ago by Alex Reynolds 35k • written 6.2 years ago by Richard ▴ 590

0

Entering edit mode

for

merge when the start coord of two entries are within N bases while the end coords are also within N bases of each otherc

bedtools merge with option -d

-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged.

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 162k

0

Entering edit mode

Hi Pierre, I've adjusted my description to be more specific. The Bedtools option is much more lenient than what I am looking for.

ADD REPLY • link 6.2 years ago by Richard ▴ 590

0

Entering edit mode

It's highly unlikely that anyone has written a program to handle this very unusual methodology, so you're probably going to have to write it yourself (or daisy chain a bunch of different things together). Out of curiosity, what is the actual use-case where this is useful?

ADD REPLY • link 6.2 years ago by Devon Ryan 104k

1

Entering edit mode

Hi Devon, I'm trying to merge read pair information that represent anomolous SVs. Given that the length of the SV can be up to a chromosome length, the usual parameters for these tools don't provide options that would suitably merge bed entries that represent the same event, while allowing for the insert size noise, especially when and inversion is inside a deletion. Basically I want to merge read pairs that show the same SV.

ADD REPLY • link 6.2 years ago by Richard ▴ 590

0

Entering edit mode

I'm a bit surprised that the SV callers don't do this, thanks for the info.

ADD REPLY • link 6.2 years ago by Devon Ryan 104k

0

Entering edit mode

Hopefully piping with BEDOPS works for your needs. Let me know if you have any questions!

ADD REPLY • link 6.2 years ago by Alex Reynolds 35k

0

Entering edit mode

Hello,

Just wondering if you ever found a working solution for this since I am dealing with the exact same problem. I wrote a not so great solution in Python but it doesn't seem to work exactly as intended. I can post it if you're interested.

Please let me know, any help would be greatly appreciated. Thanks!

ADD REPLY • link 4.1 years ago by pierrj • 0

score 1 · Answer 1 · 2018-03-27

1

Entering edit mode

6.2 years ago

Alex Reynolds 35k

I think you could get an answer to the first part of your question with a series of a few set operations, using BEDOPS support for Unix streams, i.e., piping results from one bedops or bedmap operation to the next.

Here's the "flow" of operations:

Retrieve all elements from the input BED file that fall entirely within -N of the original element's start and +N of the original element's end (this will include the original element itself, but that's fine)
Of those candidates, do some culling to remove elements that are out of bounds:

a. Eliminate the ones that fall entirely within -N of original start and -N of original end

b. Eliminate those that fall entirely within +N of original start and +N of original end
Merge whatever intervals are left from culling; this is the answer

It may help to sketch this out on paper to see how this works.

To do this in an automated way, perhaps:

#!/bin/bash

N=1234
input=intervals.bed

bedops --range -${N}:${N} --everything ${input} > all.bed
bedops --range -${N}:-${N} --everything ${input} > left.bed
bedops --range ${N}:${N} --everything ${input} > right.bed

bedops --element-of 100% ${input} all.bed \
    | bedops --not-element-of 100% - left.bed \
    | bedops --not-element-of 100% - right.bed \
    | bedops --merge - \
    > answer.bed

rm all.bed left.bed right.bed

The result answer.bed should contain a merge of intervals which meet your criteria.

For example, with N set to 5 and the following input file intervals.bed:

chr1    8       22      id-1
chr1    10      25      id-2
chr1    14      26      id-3

The merged genomic space is:

chr1    8       25

because:

chr1    14      26      id-3

will fall outside of the overall +/-5 nt bounds.

I believe the logic here is correct, but check this with your input and see if this works for you.

If you want a more efficient script that doesn't generate intermediate files, process substitutions should work at the cost of readability:

#!/bin/bash

N=1234
input=intervals.bed

bedops --element-of 100% ${input} <(bedops --range -${N}:${N} --everything ${input}) \
    | bedops --not-element-of 100% - <(bedops --range -${N}:-${N} --everything ${input}) \
    | bedops --not-element-of 100% - <(bedops --range ${N}:${N} --everything ${input}) \
    | bedops --merge - \
    > answer.bed

The first script, however, may help for troubleshooting and double-checking my work, since it clearly outlines the sets and set operations being performed.

I'm out of time and I'll have to think some more about the second part of your question later on. Using awk to preprocess the input BED file and piping it to bedops or bedmap may prove a useful approach.

ADD COMMENT • link 6.2 years ago by Alex Reynolds 35k

0

Entering edit mode

Hi Alex, I was just trying this out, and I'm not quite where I need to be. Can you take a look?

Example bed file (output of sort-bed):

9 90917053 90918018
9 90921192 90922024
9 90945236 90946137
9 90982846 90983792
9 91000814 112010931
9 91005014 91005862
9 91009275 91014097
9 91009296 91014077
9 91009474 91014162

and I'm trying something simpler (and I imagine incorrect) than your example above:

N=300
$sortBed $input > ${input}.sorted.bed

$bedops --range -${N}:${N} --everything ${input}.sorted.bed > outer.bed
$bedops --range ${N}:-${N} --everything ${input}.sorted.bed > inner.bed

$bedops --element-of 100% ${input}.sorted.bed outer.bed \
    | $bedops --not-element-of 100% - inner.bed \
    | $bedops --merge -

but this doesn't report any of my records beyond 9 91000814 112010931. The rest of the records are being dropped during intersection with 'inner.bed'. I believe this is happening because that one large record encapsulates all those that follow it.

ADD REPLY • link 6.2 years ago by Richard ▴ 590

0

Entering edit mode

I think that is correct, since successive elements fall entirely with the "forbidden space" in inner.bed for that large record:

$ echo -e "9\t91009275\t91014097" | bedmap --echo --echo-map --fraction-ref 1 - inner.bed
9       91009275        91014097|9      91001114        112010631

I suspect that you want to do a merge test for every single, individual element against all others. Or maybe successive pairs of elements? Can you confirm if this is the case or not? If so, I think a little tweaking will help though it may take a bit longer to run.

ADD REPLY • link 6.2 years ago by Alex Reynolds 35k

0

Entering edit mode

Hi Alex, Thanks for your continued help with this!

I have bed elements representing all sort of anomalous read pairs from all around the genome. I am looking to merge (and count) the entries that redundantly represent the same event. To qualify as a matching event I wanted to apply a distance threshold on each end of each bed entry based on some insert size distribution estimate. The example you gave me above gets close to this, but got stuck when a large spanning read pair absorbed a number of records that were a small subset of the large record.

The whole point of this is to generate a count of reads that show a similar event. If you can spare the bandwidth to help further, can you also show me how to count the merged entries?

ADD REPLY • link 6.2 years ago by Richard ▴ 590

0

Entering edit mode

This reports the number of elements from the original set, which fall entirely with the listed element:

$ bedmap --echo --count --echo-map --fraction-map 1 elems.bed 
9   90917053    90918018|1|9    90917053    90918018
9   90921192    90922024|1|9    90921192    90922024
9   90945236    90946137|1|9    90945236    90946137
9   90982846    90983792|1|9    90982846    90983792
9   91000814    112010931|5|9   91000814    112010931;9 91005014    91005862;9  91009275    91014097;9  91009296    91014077;9  91009474    91014162
9   91005014    91005862|1|9    91005014    91005862
9   91009275    91014097|2|9    91009275    91014097;9  91009296    91014077
9   91009296    91014077|1|9    91009296    91014077
9   91009474    91014162|1|9    91009474    91014162

As you can see, the 9:91001114-112010631 element grabs a lot of nested elements.

Not all of those nested elements may necessarily meet the +-N criteria you set, but it may provide a starting point for different logic.

ADD REPLY • link 6.2 years ago by Alex Reynolds 35k