Best to overlap hundreds of BED files to one master BED file?
2
2
Entering edit mode
8.7 years ago
Ian 6.0k

I have in excess of 400 BED files of transcription factor binding site coordinates that I want to compare with one master BED file of binding regions. The aim is to identify which of the ~400 TFBS overlap with the master file.

Normally I would go straight to bedtools, but given the number of TFBS files I wonder if there is a "better" method? bedtools intersect would appear to do the trick, but for 400 files....?

The type of output I am looking to get is:

master_pos1    TFBS8, TFBS16, TFBS200
master_pos2    TFBS1, TFBS333
..

Thank you

bed overlap intersect • 3.2k views
ADD COMMENT
3
Entering edit mode
8.7 years ago

If your BED files are sorted, you could use bedops to union all the TFBS to standard output, and pipe that result to bedmap to do the mapping of TFs to master regions.

(This technique assumes that the TFBS files are minimally BED4. That is, the fourth column in each TFBS file contains the ID of the TF. If that is not the case, describe your format in more detail and I'll suggest a quick one-liner with awk to fix up files into the correct form.)

Here's the one-liner that unions and maps:

$ bedops --everything tfbs*.bed | bedmap --echo --echo-map-id-uniq --delim '\t' master.bed - > answer.bed

Piping to standard output avoids the unnecessary step of making an intermediate file somewhere on the hard drive, which is otherwise very expensive in time. So this should be very fast.

Assuming that the ID fields in each of tfbs001.bed through tfbs400.bed contain the desired TF names or other identifiers of choice, the file answer.bed contains the results as you expect, except that it uses a semi-colon as an ID delimiter, instead of a comma. You could add --multidelim ',' to the bedmap statement, if that is a requirement.

If your BED files are not sorted, you could first prepare them with BEDOPS sort-bed, which is faster at sorting BED files than GNU sort.

$ for tfbs_fn in `ls tfbs*.bed`; do sort-bed $tfbs_fn > sorted.$tfbs_fn; done
$ sort-bed master.bed > sorted.master.bed

Then use the sorted files in downstream BED ops. You only need to sort once.

ADD COMMENT
0
Entering edit mode

Great method, I shall have to look more into what bedops offers. Just wondering why '+t' appears at the start of the last column? E.g. '+tCREB1;CST6;'. I have used 'sed' to remove it for the moment.

ADD REPLY
1
Entering edit mode

Do your BED files come from Excel or Windows? Such files usually need to be cleaned up. You might take a look at the suggestions in my answer in this thread: bedmap output on one line

ADD REPLY
0
Entering edit mode

I ran the one-liner, apparently with success, however there are no clusters on chromosomes 10-22 (human), even though there are TFBS on those chromosomes in the bedops input. Is there any obvious reason when I am seeing this? -- IGNORE I failed to also sort the master file --

ADD REPLY
0
Entering edit mode

Yeah, just run sort-bed on BED files and you'll be fine.

ADD REPLY
1
Entering edit mode
8.7 years ago

That's a synopsis of how I would do it:

  • Intersect each tfbs file with the master file, to each output file add a column with the tfbs id (here the file name itself), like: sort -k1,1 -k2,2n mytfbs.bed | intersectBed -sorted -u -b - -a master.bed | awk -v tfbs=mytfbs.bed 'print $0 "\t" tfbs' > master.mytfbs.bed Skip the sort step if the tfbs files are already sorted (master.bed should be sorted as well). Use the appropriate options for intersectBed of course.
  • Merge sort all the output files and pipe to mergeBed to get the desired output: sort -k1,1 -k2,2n -m master.*.bed | mergeBed -i - -c 4 -o collapse > out.tfbs.bed

(Not tested)

ADD COMMENT

Login before adding your answer.

Traffic: 2596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6