Question: How do I subset data with the following conditions? The the data must match one column (say, a chromosome), and one other column must fit in a range provided by two columns of another data set?
Background: I have two treatment groups, with columns like chromosome#, start, and stop position. Since the two treatment groups, which are two different data frames, are different size, to be able to cbind them without introducing NA values, I need to try and extract a treatment subset (a total of two) where I only keep the values that are from the same chromosome and overlap somewhere between their start and stop positions in the other chromosome. In other words, I want to get two data sets of the same length where each row is directly comparable within a range position of the other dataset.
An alternative solution could be a way to counter the uneven lengths of the data, without introducing NA's or truncating viable data.
What I've tried so far:
For one data set:
UU_working = subset(UU, UU$start[UU$chromosome == z] %in%
DD$start[DD$chromosome == z]:DD$end[DD$chromosome==z])
But this give the error " numerical expression has 120612 elements: only the first used" and a blank table with only header row.
I also tried using the tidyverse:
DD_filtered = DD %>%
filter(DD$chromosome==UU$chromosome,
DD$start >= UU$start,
DD$end <= UU$end)
But using this I got disturbingly low matches (6 in a genome's worth of methylation frequencies) and gives the following warnings:
Warning messages:
1: In DD$chromosome == UU$chromosome :
longer object length is not a multiple of shorter object length
2: In DD$start >= UU$start :
longer object length is not a multiple of shorter object length
3: In DD$end <= UU$end :
longer object length is not a multiple of shorter object length
as well as this output when switching DD for UU to try and even out the other data set too:
Error in filter_impl(.data, quo) :
Result must have length 2069400, not 2264003
In addition: Warning messages:
1: In UU$chromosome == DD$chromosome :
longer object length is not a multiple of shorter object length
2: In UU$start >= DD$start :
longer object length is not a multiple of shorter object length
3: In UU$end <= DD$end :
longer object length is not a multiple of shorter object length
I think this should work to generate some basic sample data:
chrs = c(1:10)
starts = c(1:10)
ends = c(2:11)
uu_sample=data.frame(chrs,starts,ends)
chrs = c(2:12)
starts = c(2:12)
ends = c(4:14)
dd_sample=data.frame(chrs,starts,ends)
The ideal output to play with is two uneven data sets where some, but not all of the observations in the data frame will overlap with observations from the other data frame. The basic logic of the desired command would be: if the chromosome from an observation in uu_sample matches the chromosome observation in dd_sample and the uu_sample start is between the dd_sample start and end then the two observations must overlap - keep them, but filter out the others. This should generate two data sets of even length where each observation overlaps with an observation in the other table (I have the code I need to analyze it as soon as I can get these paired up properly and even).
What you are describing is not "filter" but "merge by overlap".
Read about
data.table::foverlap
andiranges::findOverlaps
, and see this post at SO:This one is a little new... I followed the link in the answer using foverlap to try and replicated with the following code:
But I get this error running the final line
like this?
Thank you, this did work. It is worth noting that this vs. the foverlap code provided by zx8754 has a striking difference in computing time. Although SQL in usually a really efficient language, it took over 30 minutes (I went to bed) to complete the SELECT statement as for example the actual 'DD' data frame contained millions of observations. The other one completed in less than a minute.
sure..for certain things data.table is wicked fast.
It would be easier to visualize if you gave us some sample data. Check out this post on how to produce a reproducible example.
I've edited it to add code for a basic pair of data sets. Let me know if this works - the goal was to make sure there was overlap and that they weren't even.