Common genomic intervals in R
2
0
Entering edit mode
10.0 years ago
viniciushs88 ▴ 50

I would like to infer shared genomic interval between different samples.

My input:

sample    chr start end
NE001      1   100  200
NE001      2   100  200
NE002      1   50   150
NE002      2   50   150
NE003      2   250  300

My expected output:

chr start end  freq
1    100  150   2
2    100  150   2

Where the "freq" is the how many samples have contribuited to infer the shared region. In the above example freq = 2 (NE001 and NE002).

Cheers!

R genome overlap • 3.8k views
ADD COMMENT
0
Entering edit mode

What have you tried so far?

ADD REPLY
2
Entering edit mode

Actually, I know how to measure the shared numbers between two regions.

input:

start1  end1    start2  end2    bp_overlapped   
   20    30       25      35          6        
   25    35       20      30          6
  100    190     126     226          65
  126    226     100     190          65

Function:

f <- function(x) length( intersect(seq(x[1],x[2],1), seq(x[3],x[4],1)) )

a <- apply(X,1,f)

And I known how to overlap regions and a "cbind" with overlapped ones too. But when all regions are in a same dataframe, and have a collumn tagging different animals... sincerely I donĀ“t know how can I start the logic...

ADD REPLY
0
Entering edit mode

Cool, we try to ensure that posters have given it a go before providing help. Hopefully my answer (below) will work well for you.

ADD REPLY
2
Entering edit mode
10.0 years ago

This is a bit easier to do if you use the GenomicRanges package from bioconductor:

library(GenomicRanges)
gr <- GRanges(seqname=c(1,2,1,2,2), range=IRanges(start=c(100,100,50,50,250),
    end=c(200,200,150,150,300))) #N.B., You wouldn't do this manually
bins <- disjoin(gr)
mcols(bins)$count <- countOverlaps(bins, gr)

The disjoin() command splits the ranges into no-overlapping segments.

Edit: I left off a ) in an earlier version. Mea culpa!

ADD COMMENT
1
Entering edit mode
10.0 years ago

If you can use system() within R, or if you don't need to use R, here's a way to do it with BEDOPS (i.e., you'd wrap these calls in system() or use the command-line):

$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --intersect inputs.bed | bedmap --echo --count - inputs.bed > answer.bed
ADD COMMENT

Login before adding your answer.

Traffic: 2011 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6