How To Separate A List Of Genomics Regions Into Several Disjoint Subsets
2
2
Entering edit mode
11.2 years ago
dustar1986 ▴ 380

Hi,

I‘ve got a list of genomic regions in python, say:

list=["chr1: 100-300", "chr1:150-350", "chr1:200-400", "chr1:500-700", "chr1:600-800", "chr1:900-1000"]

Some of them overlap each other. I wanna create a new list contain subsets of genomic regions and without overlapping regions in each subset. The result should be something like:

new_list=[["chr1:100-300", "chr1:500-700", "chr1:900-1000"], ["chr1:150-350", "chr1:600-800"], ["chr1:200-400"]]

I created my own function but it tooks 20 mins to achieve the result, if the origin list contain 1000 regions.

Is there any quick way to do that? Please help.

python genomics • 3.5k views
ADD COMMENT
2
Entering edit mode

It seems like you could have at least two different answers: new_list_1=[["chr1:100-300", "chr1:500-700", "chr1:900-1000"], ["chr1:150-350", "chr1:600-800"], ["chr1:200-400"]] or new_list_2=[["chr1:100-300", "chr1:600-800", "chr1:900-1000"], ["chr1:150-350", "chr1:500-700"], ["chr1:200-400"]]. Which result would be correct?

ADD REPLY
0
Entering edit mode

Either one will be great.

ADD REPLY
0
Entering edit mode

Could this be done with bedtools?

ADD REPLY
9
Entering edit mode
11.2 years ago

I would build a recursive function

def disjoint(l, res=[]):
    sl = []
    n = 1
    while n < len(l):
        p,q = l[n-1:n+1]
        if p[2] >= q[1] and p[0]==q[0]:
           l.remove(q)
           sl.append(q)
        else:
            n+=1
    res.append(l)
    if sl: disjoint(sl, res)
    return res

Running it on your example:

l = [('chr1',100,300), ('chr1', 150, 350), ('chr1', 200, 400), ('chr1', 500, 700), ('chr1',600, 800), ('chr1', 900, 1000)] 
print disjoint(l)

Gives you what you want:

[[('chr1', 100, 300), ('chr1', 500, 700), ('chr1', 900, 1000)], [('chr1', 150, 350), ('chr1', 600, 800)], [('chr1', 200, 400)]]
ADD COMMENT
0
Entering edit mode

This is exact what I want. Thank you very much indeed Zielezinski.

ADD REPLY
4
Entering edit mode
11.2 years ago
lh3 33k

As Alex has noted, your problem has multiple solutions. If you just want to find one of them, you can use a.zielezinski's script. If you want to iteratively find the maximum non-overlapping chain (i.e. you choose the chain having the longest total length and then choose the next longest chain from the rest of intervals), here is a naive solution based on dynamic programming.

Let I(i) be the i-th interval, |I(i)| be the length of the interval and f(i) the maximum total length if I(i) is chosen. We can compute f(i) with:

f(i) = max{f(j)+|I(i)| : j<i and I(j) not overlapping with I(i)}

If you keep j that maximizes f(i) in the above equation, you can backtrack to get the interval list. This is an O(N^2) algorithm. I said it is naive because there are O(NlogN) algorithm for such problems. You usually need a tree to maintain a sorted list of the rightmost coordinates. But for your 1000 intervals, O(N^2) is more than enough.

ADD COMMENT
0
Entering edit mode

Thanks Ih3. Sorry I did not make it clear. I only need one out of multiple solutions. But yours is really smart.

ADD REPLY

Login before adding your answer.

Traffic: 2036 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6