bedtools merge - removing features completly overlapped by other features
1
0
Entering edit mode
6.6 years ago
Adrian Pelin ★ 2.6k

Hello,

Assuming we start with only one bed file, example:

F1 ------------------------- 
F2     ----
F3                ----
F4                            --------------

Default bedtools merge behaviour will output one sole feature spanning from F1 start to F4 end, since F1 and F4 overlap.

Is it possible to remove features that are completely covered by any other feature? In other words, output should be F1 and F4.

Thanks.

EDIT: I suppose another way to ask this is "How can I output features from one file that do not overlap 100% any other features in the same file?"

bedtools • 4.1k views
ADD COMMENT
1
Entering edit mode
6.6 years ago

Here's a one-liner with another toolkit:

$ bedops -m A.bed B.bed | bedmap --echo-map --exact - <(bedops -u A.bed B.bed) | sort-bed - > answer.bed

Or if you only have one file:

$ bedops -m C.bed | bedmap --echo-map --exact - C.bed > answer.bed

It is assumed A.bed and B.bed and C.bed are sorted before set operations, i.e.:

$ sort-bed A.unsorted.bed > A.bed
$ sort-bed B.unsorted.bed > B.bed
$ sort-bed C.unsorted.bed > C.bed

Demo:

$ more C.bed
chr1    100     120     id-1
chr1    110     112     id-2
chr1    114     117     id-3
chr1    125     130     id-4

$ bedops -m C.bed | bedmap --echo-map --exact - C.bed
chr1    100 120 id-1
chr1    125 130 id-4

Another option is to filter the input file for purely nested elements.

Nested elements are a pain, but there are ways to deal with them.

Start with a utility awk script:

#!/usr/bin/env awk -f
{
    if (NR > 1) {
        currentChr = $1
        currentStart = $2
        currentStop = $3
        if ((previousStart < currentStart) && (previousStop > currentStop)) {
            print $0;
        }
        else {
            previousChr = currentChr
            previousStart = currentStart
            previousStop = currentStop
        }
    }
    else {
        previousChr = $1
        previousStart = $2
        previousStop = $3
    }
}

Let's call that script getNestedElements.awk. Make sure it is executable (chmod u+x ./getNestedElements.awk).

You can use this script to identify elements in A2.bed that are nested, and you would then pipe them to bedops --not-element-of 100% to filter them out:

$ more A2.bed
chr1    80      101     id-0
chr1    100     120     id-1
chr1    110     112     id-2
chr1    114     117     id-3
chr1    125     130     id-4

$ ./getNestedElements.awk A2.bed | bedops --not-element-of 100% A2.bed -
chr1    80      101     id-0
chr1    100     120     id-1
chr1    125     130     id-4
ADD COMMENT
0
Entering edit mode

There is only one file to begin with. I tried specifying the file twice for A and B and ended up getting every entry as a duplicate from the original file.

ADD REPLY
0
Entering edit mode

Pls see revised answer.

ADD REPLY
0
Entering edit mode

Thanks, getting closer. Your example seems to be working flawlessly (I have recreated A.bed and tried the command), but when I try it on my annotation, I lose any features that overlap with each other. Consider your example, if we were to add "chr1 80-101 id-0", we would lose all entries but the last one. We should be keeping id-0, id-1 and id-4, since only id-2 and id-3 are completely covered by another feature, id-0 and id-1 merely overlap but do not cover each other 100%.

ADD REPLY
0
Entering edit mode

Pls see revised answer.

ADD REPLY
0
Entering edit mode

Eureka! Works amazing. Indeed very important to not forget sort-bed!

ADD REPLY
0
Entering edit mode

Yes, sorting is very important. It only needs to be done once, however, so hopefully the pain is minimal.

ADD REPLY
0
Entering edit mode

The bedops one-liner didn't work correctly with my input file (out.bed). Bedops version is 2.4.26

$ cat out.bed 
chr1    274     324
chr1    292     356
chr2    28      80
chr2    89      156
chr3    16      191
chr3    28      174
chr4    17      191
chr4    24      174
chr4    40      212
chr5    55      144
chr6    77      150
chr7    277     319
chr7    285     327
chr7    298     335
chr7    303     343
chr7    314     351
chr7    319     359
chr7    330     363
chr7    331     362
chr7    335     366
chr7    343     383
chr7    355     416
chr9    274     287
chr9    274     299
chr9    278     319
chr9    288     365
chr9    296     335
chr9    306     343
chr9    312     351
chr9    322     359


$ bedops -m out.bed | bedmap --echo-map --exact - out.bed

chr2    28      80
chr2    89      156
chr3    16      191

chr5    55      144
chr6    77      150
ADD REPLY
0
Entering edit mode

And the awk script doesn't seem to be working either. (To be sure, I used the same A2.bed file as above). The awk script is executable.

$ cat A2.bed
chr1    80      101     id-0
chr1    100     120     id-1
chr1    110     112     id-2
chr1    114     117     id-3
chr1    125     130     id-4

$ ./getNestedElements.awk A2.bed | bedops --not-element-of 100% A2.bed -
chr1    80      101     id-0
chr1    100     120     id-1
chr1    110     112     id-2
chr1    114     117     id-3
chr1    125     130     id-4

$ ./getNestedElements.awk A2.bed
[no output]
ADD REPLY
0
Entering edit mode

Instead of:

$ bedops -m out.bed | bedmap --echo-map --exact - out.bed

Try:

$ bedops -m out.bed | bedmap --echo-map --exact --skip-unmapped - out.bed | sort-bed -
ADD REPLY
0
Entering edit mode

Still fails:

$ bedops -m out.bed | bedmap --echo-map --exact --skip-unmapped - out.bed | sort-bed -
chr2    28      80
chr2    89      156
chr3    16      191
chr5    55      144
chr6    77      150
ADD REPLY

Login before adding your answer.

Traffic: 2544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6