Biostar Beta. Not for public use.
findOverlaps function in R
4
Entering edit mode
13 months ago
SGMS • 60
European Union

Hi everyone,

I have used the "findOverlaps" function in R to find which positions of my two datasets overlap. I have also used "countOverlaps" to see how many overlaps I have.

What I want to do now, is to find the exact coordinates which overlapped. I was looking how to do it using findOverlaps but it has no such options and by doing a google search, I couldn't find much help.

A little help on this would be greatly appreciated.

Thank you

ADD COMMENTlink
0
Entering edit mode

I have switched to foverlaps (data.table package) whenever I working with genomic regions.

A fast binary-search based overlap join of two data.tables. This is very much inspired by findOverlaps function from the bioconductor package IRanges...

Simply intersect two data tables and extract overlapping region. Please post example data and I will show how it may be done.

ADD REPLYlink
0
Entering edit mode

Thank you for your reply Pgibas. Example data as required:

1st dataset:

1   668630  850204
1   713044  755966
1   738570  742020
1   766600  769112

2nd dataset:

1    149854870    149909434
1    17325224     17400665
1    19971779     20026055
1    24180459     24259817

Thanks again!

ADD REPLYlink
0
Entering edit mode

What is the expected output? Do you want a new set of regions that represents the overlaps between dataset 1 and 2?

ADD REPLYlink
0
Entering edit mode

yes! exactly that!

ADD REPLYlink
0
Entering edit mode

You datasets don't overlap.

ADD REPLYlink
0
Entering edit mode

I personally use mergeByOverlaps and subsetByOverlaps for these cases.

ADD REPLYlink
7
Entering edit mode
14 months ago
National Institutes of Health, Bethesda…

The findOverlaps() method is not the right tool for the job. Instead, you want the set intersection which can be obtained with intersect() on two GRanges objects.

# create two GRanges objects for testing
gr1 = GRanges(seqnames='chr1',ranges=IRanges(start=seq(1,1001,100),width=50))
gr2 = GRanges(seqnames='chr1',ranges=IRanges(start=seq(25,1025,100),width=50))

# and "intersect" them, finding the set of regions
# formed by the overlaps of gr1 and gr2
gr3 = intersect(gr1,gr2)

# and the result
gr3

GRanges object with 11 ranges and 0 metadata columns:
       seqnames       ranges strand
          <Rle>    <IRanges>  <Rle>
   [1]     chr1 [  25,   50]      *
   [2]     chr1 [ 125,  150]      *
   [3]     chr1 [ 225,  250]      *
   [4]     chr1 [ 325,  350]      *
   [5]     chr1 [ 425,  450]      *
   [6]     chr1 [ 525,  550]      *
   [7]     chr1 [ 625,  650]      *
   [8]     chr1 [ 725,  750]      *
   [9]     chr1 [ 825,  850]      *
  [10]     chr1 [ 925,  950]      *
  [11]     chr1 [1025, 1050]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENTlink
0
Entering edit mode

Thank you! A much easier solution! I'm wondering: I get 137 ranges, meaning we have 137 overlapped coordinates right?

When I was using countOverlaps though, it gave me 195. Shouldn't the number of these two be the same -despite the fact that "intersect" also gives us the coordinates-?

Thanks again!

ADD REPLYlink
1
Entering edit mode

If regions in dataset1 overlap with other regions in dataset1, they will end up being "collapsed" into a single region in the output of intersect(). This is the discrepancy with countOverlaps, which will count the overlapping regions from dataset1 twice.

ADD REPLYlink
6
Entering edit mode
13 months ago
PoGibas 4.8k
Vilnius

I have done this with dummy data.

Lets say we have we have regions A (groupA) and regions B (groupB).

groupA <- data.table(
    chr = rep("chr1", 5),
    start = seq(0, 1000, 100),
    end = seq(50, 1050, 100))
setkey(groupA, chr, start, end)

groupB <- data.table(
    chr = rep("chr1", 5),
    start = seq(25, 1025, 100),
    end = seq(75, 1075, 100))
setkey(groupB, chr, start, end)

Check:

  1. If your datasets are data.table class(groupA), if not do setDT(groupA)
  2. If keys are chr start end, if not do setkey(groupA, chr, start, end)

# Find overlaps
over <- foverlaps(groupA, groupB, nomatch = 0)
# Extract exact regions
over2 <- data.table(
    chr = over$chr,
    start = over[, ifelse(start > i.start, start, i.start)],
    end = over[, ifelse(end < i.end, end, i.end)])
ADD COMMENTlink
0
Entering edit mode

Thanks so much for this. My coordinates actually overlap -my datasets are much larger than the above-. I used your code on my data and it works perfectly. I have double-checked the number of hits with findOverlaps and the number is the same, but now I also have the overlapped coordinates! Thanks again :)

ADD REPLYlink
0
Entering edit mode

If it helped to solve your problem mark the answer :)

ADD REPLYlink
0
Entering edit mode

@Pgibas If I want to label the values, for example in a new column I assign two values, "Exact" or "Partial" based on if the value in groupA exactly falls within the range of region in groupB and if there is a partial overlap, respectively. How can I do that?

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1