Biostar Beta. Not for public use.
Question: Merge duplicate GRanges ranges
2
Entering edit mode

Hi there, I have a GRanges object that I wanna merge duplicate ranges and append counts as a metadata columns. I know similar thing can be easily done using uniq -c, but I want do it in R/Bioconductor. Is there any suggestion?

>orginalGRanges


GRanges with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     Chr1   [2, 24]      +
  [2]     Chr1   [2, 25]      +
  [3]     Chr1   [2, 25]      +
  [4]     Chr1   [4, 25]      +
  [5]     Chr1   [5, 25]      +
  [6]     Chr1   [6, 21]      +
  ---
  seqlengths:
           Chr1
             NA


>expectedGRanges


GRanges with 5 ranges and 1 metadata column:
      seqnames    ranges strand |      hits
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]     Chr1   [2, 24]      + |         1
  [2]     Chr1   [2, 25]      + |         2
  [3]     Chr1   [4, 25]      + |         1
  [4]     Chr1   [5, 25]      + |         1
  [5]     Chr1   [6, 21]      + |         1
  ---
  seqlengths:
           Chr1
             NA
ADD COMMENTlink 5.5 years ago ifreecell • 170 • updated 5.5 years ago Devon Ryan 90k
3
Entering edit mode

You're looking for the countOverlaps() function:

g <- GRanges(seqnames=c(rep("Chr1",6)), strand=rep("+",6), ranges=IRanges(start=c(2,2,2,4,5,6), end=c(24,25,25,25,25,21)))
g2 <- unique(g)
g2$hits<- countOverlaps(g2, g, type="equal")

You could also run countOverlaps(g,g,type="equal")first and then use unique(). You'll get the same result either way.

ADD COMMENTlink 5.5 years ago Devon Ryan 90k
Entering edit mode
0

I am testing your approach on a sorted 5 million ranges object, it took hours before Rstudio encountered a fetal error. BTW, countOverlaps is really RAM consuming, it ate up all RAM (64G) on my computer.

ADD REPLYlink 5.5 years ago
ifreecell
• 170
Entering edit mode
0

Yeah, I imagine that that could prove pretty memory heavy, since I don't think it assumes that things are sorted. You might split() by chromosome and then use lapply(). Perhaps that'll be a bit more reasonable.

ADD REPLYlink 5.5 years ago
Devon Ryan
90k
Entering edit mode
0

There is only one chromosome in this GRanges, I don't think split() will help. I wrote the GRanges to a bed file, and it took one second for uniq -c to compute. I really appreciate it if the entire workflow can be done in Bioconductor, is there other workaround?

ADD REPLYlink 5.5 years ago
ifreecell
• 170
Entering edit mode
0

Is it the unique() step that takes forever or just countOverlaps (or both)? I would guess that it's just the countOverlaps() function. Since you only ever care about exact overlaps/matches, you might be best off writing a custom function to do this. At least if you have to do this more than once or twice.

ADD REPLYlink 5.5 years ago
Devon Ryan
90k
Entering edit mode
0

You are right, it's just the countOverlaps() function. Thank you anyway, I will try to write a custom function.

ADD REPLYlink 5.5 years ago
ifreecell
• 170

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0