Question

How To Separate A List Of Genomics Regions Into Several Disjoint Subsets

2

Entering edit mode

11.2 years ago

dustar1986 ▴ 380

Hi,

I‘ve got a list of genomic regions in python, say:

list=["chr1: 100-300", "chr1:150-350", "chr1:200-400", "chr1:500-700", "chr1:600-800", "chr1:900-1000"]

Some of them overlap each other. I wanna create a new list contain subsets of genomic regions and without overlapping regions in each subset. The result should be something like:

new_list=[["chr1:100-300", "chr1:500-700", "chr1:900-1000"], ["chr1:150-350", "chr1:600-800"], ["chr1:200-400"]]

I created my own function but it tooks 20 mins to achieve the result, if the origin list contain 1000 regions.

Is there any quick way to do that? Please help.

python genomics • 3.5k views

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 11.2 years ago by dustar1986 ▴ 380

2

Entering edit mode

It seems like you could have at least two different answers: new_list_1=[["chr1:100-300", "chr1:500-700", "chr1:900-1000"], ["chr1:150-350", "chr1:600-800"], ["chr1:200-400"]] or new_list_2=[["chr1:100-300", "chr1:600-800", "chr1:900-1000"], ["chr1:150-350", "chr1:500-700"], ["chr1:200-400"]]. Which result would be correct?

ADD REPLY • link 11.2 years ago by Alex Reynolds 35k

0

Entering edit mode

Either one will be great.

ADD REPLY • link 11.2 years ago by dustar1986 ▴ 380

0

Entering edit mode

Could this be done with bedtools?

ADD REPLY • link 11.2 years ago by 14134125465346445 ★ 3.6k

4

Entering edit mode

11.2 years ago

lh3 33k

As Alex has noted, your problem has multiple solutions. If you just want to find one of them, you can use a.zielezinski's script. If you want to iteratively find the maximum non-overlapping chain (i.e. you choose the chain having the longest total length and then choose the next longest chain from the rest of intervals), here is a naive solution based on dynamic programming.

Let I(i) be the i-th interval, |I(i)| be the length of the interval and f(i) the maximum total length if I(i) is chosen. We can compute f(i) with:

f(i) = max{f(j)+|I(i)| : j<i and I(j) not overlapping with I(i)}

If you keep j that maximizes f(i) in the above equation, you can backtrack to get the interval list. This is an O(N^2) algorithm. I said it is naive because there are O(NlogN) algorithm for such problems. You usually need a tree to maintain a sorted list of the rightmost coordinates. But for your 1000 intervals, O(N^2) is more than enough.

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 11.2 years ago by lh3 33k

0

Entering edit mode

Thanks Ih3. Sorry I did not make it clear. I only need one out of multiple solutions. But yours is really smart.

ADD REPLY • link 11.2 years ago by dustar1986 ▴ 380

Ram · Accepted Answer · 2013-02-01

9

Entering edit mode

11.2 years ago

Andrzej Zielezinski 11k

I would build a recursive function

def disjoint(l, res=[]):
    sl = []
    n = 1
    while n < len(l):
        p,q = l[n-1:n+1]
        if p[2] >= q[1] and p[0]==q[0]:
           l.remove(q)
           sl.append(q)
        else:
            n+=1
    res.append(l)
    if sl: disjoint(sl, res)
    return res

Running it on your example:

l = [('chr1',100,300), ('chr1', 150, 350), ('chr1', 200, 400), ('chr1', 500, 700), ('chr1',600, 800), ('chr1', 900, 1000)] 
print disjoint(l)

Gives you what you want:

[[('chr1', 100, 300), ('chr1', 500, 700), ('chr1', 900, 1000)], [('chr1', 150, 350), ('chr1', 600, 800)], [('chr1', 200, 400)]]

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 11.2 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

This is exact what I want. Thank you very much indeed Zielezinski.

ADD REPLY • link 11.2 years ago by dustar1986 ▴ 380