bedtools intersect "complete outer join"
1
0
Entering edit mode
6.6 years ago
graeme.thorn ▴ 100

I'm in the process of generating two BED files, and I want to use something like bedtools intersect to join the two together. However, I would like a behaviour like the -loj flag (which reports an empty B record, if no B overlaps a given feature in A), but to report an empty A record if no A feature overlaps B also. i.e. if I have A.bed file

chr1 1 10
chr1 21 30 
chr1 31 40

and B.bed file

chr1 11 20
chr1 21 30

then I would like an output similar to

chr1  1 10 .    -1 -1
.    -1 -1 chr1 11 20
chr1 21 30 chr1 21 30
chr1 31 40 .    -1 -1

so some kind of "complete outer join", being a union of a left outer join (-loj) flag with a "right outer join" (which BEDtools doesn't do)

I suspect I could do this by first doing the left outer joins

bedtools intersect -loj -wa -wb -a A.bed -b B.bed > loj1.bed
bedtools intersect -loj -wa -wb -a B.bed -b A.bed > loj2.bed

then reordering the columns in loj2.bed using "cut" (to put the record from the A file in the first set of columns), followed by

cat loj1.bed loj2_reordered.bed | sort -k1,1 -k2,2n > sorted.bed

merging the two files then

uniq sorted.bed > complete_outer_join.bed

removing duplicate lines, but I wonder if there's a quicker method for doing all this.

P.S. The features in the A and B bed files will be defined on the same coordinates, so the overlap of an A feature on a B feature would be the same as an overlap of a B feature on an A feature.

bedtools • 3.3k views
ADD COMMENT
3
Entering edit mode
6.6 years ago

Assumes use of BEDOPS bedops and bedmap, awk and sorted inputs with BEDOPS sort-bed:

$ sort-bed A.unsorted.bed > A.bed
$ sort-bed B.unsorted.bed > B.bed

Make a union of elements with bedops --everything and pipe to awk to get the uniques:

$ bedops --everything A.bed B.bed | awk '!a[$0]++' > unique-union.bed

Map sets to this union, using bedmap --indicator to get a binary 1/0 overlap flag, which is used to conditionally print the element where the flag is 1 (true), or a placeholder where the flag is 0 (false). Use paste to build a BED6-like pairing:

$ paste <(bedmap --indicator --echo unique-union.bed A.bed | awk -v FS='|' '{ if ($1==1) { printf("%s\n", $2); } else { printf(".\t-1\t-1\n"); } }') <(bedmap --indicator --echo unique-union.bed B.bed | awk -v FS='|' '{ if ($1==1) { printf("%s\n", $2); } else { printf(".\t-1\t-1\n"); } }')
chr1    1   10  .   -1  -1
.   -1  -1  chr1    11  20
chr1    21  30  chr1    21  30
chr1    31  40  .   -1  -1

Streams are pretty useful. The only intermediate file here is the unique-union.bed file, so this should run pretty fast.

Because we're also using streams, while this demo shows how to work with two files, this pipeline scales to n inputs pretty easily and quickly -- just add more paste, and you can glue more sets together.

ADD COMMENT
0
Entering edit mode

And, of course, this would work if the BED files have more than three columns, just change the printf(".\t-1\t-1\n"); to include more fields.

ADD REPLY
0
Entering edit mode

Yes, but in the case where there are more than three columns, watch out for how you create the unique-union set. You may need to adjust the awk logic to use only the first three columns as a key, probably, depending on what you would decide to call "unique".

ADD REPLY

Login before adding your answer.

Traffic: 2571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6