Question

Define unique intervals from a set of peaks

0

Entering edit mode

7.2 years ago

fusion.slope ▴ 250

Hello,

i have called peaks from 6 individual bam files and now I would like to merge the bed files in a way in which i will have uniq intervals. (where each uniq interval is present in all the 6 files).

How is it possible to do that? Is there a way with bedtools? I read the tutorial but there is nothing about this. Any idea would be really appreciated.

Cheers

Overlapping Regions Uniq Intervals Peaks • 1.9k views

ADD COMMENT • link updated 7.2 years ago by Alex Reynolds 35k • written 7.2 years ago by fusion.slope ▴ 250

0

Entering edit mode

Unique would be a set element present in only one of N inputs. Non-unique would mean an element present or overlapping in two or more of N inputs. Which do you mean?

ADD REPLY • link 7.2 years ago by Alex Reynolds 35k

0

Entering edit mode

An element overlapping 6 out of 6 Inputs.

ADD REPLY • link 7.2 years ago by fusion.slope ▴ 250

0

Entering edit mode

The simplest solution would be multiple iterations of BEDTools 'intersect' with each of your six BAMs (intersect bam1 and bam2, intersect that output with bam3, etc).

BTW, you should use the 'Add Comment' button instead of 'Answer' when responding.

ADD REPLY • link 7.2 years ago by harold.smith.tarheel ★ 4.9k

1

Entering edit mode

Consider the following example:

example of overlaps

A intersects B, and B intersects C. However, A does not intersect C.

If you want intervals overlapping across all sets, you can do pairwise comparison tests between all the input sets, which does not scale well and requires storing intermediate files.

Or you can use the faster approach I describe, which uses sorted BED files with bedops and bedmap, which does not require pairwise comparisons or intermediate files.

ADD REPLY • link 7.2 years ago by Alex Reynolds 35k

score 2 · Accepted Answer · 2017-02-10

Here is an efficient solution with BEDOPS, which avoids approaches with other tools that require M^N pairwise comparisons, for M elements and N sets.

Say you have N files (your specific example uses 6 files, but this shows the generic solution for N files).

You appear to want all elements from the N files, which overlap at least one element from each of all other N-1 sets.

First, convert any non-BED files to sorted BED with the appropriate conversion tools, e.g.: bam2bed from BEDOPS:

$ for inFn in `find . -name "*.bam"`; do echo $inFn; outFn=$inFn.bed; bam2bed < $infn > $outFn; done

Second, label each BED file to note where it is coming from:

$ for inFn in `find . -name "*.bam.bed"`; do echo $inFn; outFn=${inFn%%.*}.labeled.bed; prefix=${inFn%%.*}; echo ${prefix} | cut -f1-3 ${inFn} - > $outFn; done

Set a shell variable with the number N set to your overlap threshold, e.g.:

$ export N=6

Then do the set operations on the labeled files with BEDOPS bedops and bedmap:

$ bedops --everything file1.labeled.bed file2.labeled.bed ... fileN.labeled.bed \
    | bedmap --echo --echo-map-id-uniq --delim '\t' - \
    | cut -f1-3,5 - \
    | awk -vN=${N} '{ n = split($4, a, ";"); if (n==N) { print $0; } }' \
    > intervals_overlapping_all_N_inputs.bed

The file intervals_overlapping_all_N_inputs.bed contains a list of intervals that overlap all N sets.

This may be enough, but perhaps you want more detail. Now you can go back and filter all the converted BAM files for those elements:

$ for inFn in `find . -name "*.bam.bed"`; do echo $inFn; $outFn=${inFn%%.*}.filtered.bed; bedops --element-of -1 $inFn intervals_overlapping_all_N_inputs.bed > $outFn; done

Each *.filtered.bed will contain BED-formatted reads specific to the parent BAM file, which overlap all N sets.

This looks like a lot of work, but this will run much faster than a pairwise approach with other tools.