Question

bedtools intersect "complete outer join"

0

Entering edit mode

6.6 years ago

graeme.thorn ▴ 100

I'm in the process of generating two BED files, and I want to use something like bedtools intersect to join the two together. However, I would like a behaviour like the -loj flag (which reports an empty B record, if no B overlaps a given feature in A), but to report an empty A record if no A feature overlaps B also. i.e. if I have A.bed file

chr1 1 10
chr1 21 30 
chr1 31 40

and B.bed file

chr1 11 20
chr1 21 30

then I would like an output similar to

chr1  1 10 .    -1 -1
.    -1 -1 chr1 11 20
chr1 21 30 chr1 21 30
chr1 31 40 .    -1 -1

so some kind of "complete outer join", being a union of a left outer join (-loj) flag with a "right outer join" (which BEDtools doesn't do)

I suspect I could do this by first doing the left outer joins

bedtools intersect -loj -wa -wb -a A.bed -b B.bed > loj1.bed
bedtools intersect -loj -wa -wb -a B.bed -b A.bed > loj2.bed

then reordering the columns in loj2.bed using "cut" (to put the record from the A file in the first set of columns), followed by

cat loj1.bed loj2_reordered.bed | sort -k1,1 -k2,2n > sorted.bed

merging the two files then

uniq sorted.bed > complete_outer_join.bed

removing duplicate lines, but I wonder if there's a quicker method for doing all this.

P.S. The features in the A and B bed files will be defined on the same coordinates, so the overlap of an A feature on a B feature would be the same as an overlap of a B feature on an A feature.

bedtools • 3.3k views

ADD COMMENT • link updated 6.6 years ago by Alex Reynolds 35k • written 6.6 years ago by graeme.thorn ▴ 100

score 3 · Accepted Answer · 2017-09-28

Assumes use of BEDOPS bedops and bedmap, awk and sorted inputs with BEDOPS sort-bed:

$ sort-bed A.unsorted.bed > A.bed
$ sort-bed B.unsorted.bed > B.bed

Make a union of elements with bedops --everything and pipe to awk to get the uniques:

$ bedops --everything A.bed B.bed | awk '!a[$0]++' > unique-union.bed

Map sets to this union, using bedmap --indicator to get a binary 1/0 overlap flag, which is used to conditionally print the element where the flag is 1 (true), or a placeholder where the flag is 0 (false). Use paste to build a BED6-like pairing:

$ paste <(bedmap --indicator --echo unique-union.bed A.bed | awk -v FS='|' '{ if ($1==1) { printf("%s\n", $2); } else { printf(".\t-1\t-1\n"); } }') <(bedmap --indicator --echo unique-union.bed B.bed | awk -v FS='|' '{ if ($1==1) { printf("%s\n", $2); } else { printf(".\t-1\t-1\n"); } }')
chr1    1   10  .   -1  -1
.   -1  -1  chr1    11  20
chr1    21  30  chr1    21  30
chr1    31  40  .   -1  -1

Streams are pretty useful. The only intermediate file here is the unique-union.bed file, so this should run pretty fast.

Because we're also using streams, while this demo shows how to work with two files, this pipeline scales to n inputs pretty easily and quickly -- just add more paste, and you can glue more sets together.