split bed file into several bed files where each region is separated of any other by N bases
2
1
Entering edit mode
8.2 years ago

I have a bed file with regions of interest, and I would like to split it into a minimum number of bed files each of which will have all regions separated to each other by N bases or more. Any ideas what tool I could use?

E.g. input.bed

chr1 1000 1050
chr1 1080 1130
chr1 2000 2050

Would be split by:

split_by_distance -n 150 -i input.bed

And would produce:

input0001.bed

chr1 1000 1050

input0002.bed

chr1 1080 1130
chr1 2000 2050

Explained graphically:

Original file has 3 entries

Minimum distance:

[xxxxxxxxx]

First and second are too close:

       [xxxxxxxxx]
##########
               ##########
                                        ##########

Output is:

file1: first
file2: second and third

Another example:

Minimum distance: [xxxxxx]

        [xxxxxx]     [xxxxxx]      [xxxxxx]        [xxxxxx]
AAAAAAAAAA   BBBBBBBBBB   CCCCCCCCCC     DDDDDDDDD          EEEEEEEEEE

Output files:

File1:

AAAAAAAAAA                CCCCCCCCCC                        EEEEEEEEEE

File2:

             BBBBBBBBBB                  DDDDDDDDD

Thx

bedtools bedops bed • 7.3k views
ADD COMMENT
0
Entering edit mode

I don't think this completely solves the problem. But bedtools makewindows might be something to look into. It does not output the results into multiple bed files unfortunately, but it's a start.

ADD REPLY
0
Entering edit mode

So row1 is in its own file because row2 is less than 150bp away, but row3 is with row2 because its further than 150bp away?

wat.

ADD REPLY
0
Entering edit mode

I think he might have added the 1 in -n 150 by accident. We shouldn't assume, but it seems that he wants to be able to split his input bed file into multiple bedfiles based on the -n number. So every n bases the input would be split into a different file. I'm not sure WHY exactly.

ADD REPLY
0
Entering edit mode

I could understand that - but i think he actually wants to split the file so that subfile represent a "lone inverval", or a cluster of intervals. Perhaps from a peak caller. Also, not exactly sure why :P

ADD REPLY
3
Entering edit mode
8.2 years ago

I quickly wrote something

java -jar dist/biostar178713.jar -d 100000 -o out.zip in1.bed in2.bed 

[main] INFO jvarkit - sorting 1524
[main] INFO jvarkit - creating zip jeter.zip
[main] INFO jvarkit - creating bed001.bed
[main] INFO jvarkit - closing bed001.bed N=186
[main] INFO jvarkit - creating bed002.bed
[main] INFO jvarkit - closing bed002.bed N=143
[main] INFO jvarkit - creating bed003.bed
[main] INFO jvarkit - closing bed003.bed N=122
$ unzip -t out.zip

Archive:  jeter.zip
    testing: bed001.bed               OK
    testing: bed002.bed               OK
    testing: bed003.bed               OK
No errors detected in compressed data of jeter.zip.
ADD COMMENT
0
Entering edit mode
8.2 years ago

I think you can use bedtools spacing command:

bedtools spacing -i in.bed | awk '$4 > 150' > out1.bed
bedtools spacing -i in.bed | awk '$4 < 150' > out2.bed

Assuming here your input bed as only 3 columns.

ADD COMMENT
0
Entering edit mode

I could not find the spacing command in bedtools or bedtools2... Any ideas?

ADD REPLY
0
Entering edit mode

What version of bedtools do you have? The spacing command has been included starting from release 2.23. Here's the help:

bedtools spacing

Tool:    bedtools spacing
Version: v2.25.0
Summary: Report (last col.) the gap lengths between intervals in a file.

Usage:   bedtools spacing [OPTIONS] -i <bed/gff/vcf/bam>

Notes: 
    (1)  Input must be sorted by chrom,start (sort -k1,1 -k2,2n for BED).
    (2)  The 1st element for each chrom will have NULL distance. (".").
    (3)  Distance for overlapping intervaks is -1 and bookended is 0.

Example: 
    $ cat test.bed 
    chr1    0   10 
    chr1    10  20 
    chr1    21  30 
    chr1    35  45 
    chr1    100 200 

    $ bedtools spacing -i test.bed 
    chr1    0   10  . 
    chr1    10  20  0 
    chr1    21  30  1 
    chr1    35  45  5 
    chr1    100 200 55 

    -bed    If using BAM input, write output as BED.

    -header Print the header from the A file prior to results.

    -nobuf  Disable buffered output. Using this option will cause each line
        of output to be printed as it is generated, rather than saved
        in a buffer. This will make printing large output files 
        noticeably slower, but can be useful in conjunction with
        other software tools and scripts that need to process one
        line of bedtools output at a time.

    -iobuf  Specify amount of memory to use for input buffer.
        Takes an integer argument. Optional suffixes K/M/G supported.
        Note: currently has no effect with compressed files.
ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6