for loop to filter dataframe based on conditions
1
0
Entering edit mode
3.0 years ago
katinstack • 0

Dear all,

In the following df:

enter image description here

I have several groups (A to C, but could be up to 18 groups). I made a for loop to calculate the median T1 for each group like this:

  for(Var in unique(df$Group)) {
  assign(paste("T1_", Var, sep = ""), median(filter(df, Group == Var)$T1))
}

I then filter the df, so that I keep only rows from Group A that have a T1 which is +/-1 from the median T1.I have the following code, but I would like to make a for loop to do it in a varying number of groups.

`%+-1%` <- function(T1, T1_A) (T1 >= T1_A-1) & (T1 <= T1_A+1)

df<- df %>% filter ((Group == "A" & T1 %+-1% T1_A) | (Group == "B" & T1 %+-1% T1_B) | (Group == "C" & T1 %+-1% T1_C))

r loop filter • 3.3k views
ADD COMMENT
1
Entering edit mode
3.0 years ago
ATpoint 81k

Generally you never filter in R using any kind of loop, there are usually ways to do that without any explicit iteration. This below will work with any number of groups without explicit looping which is super slow. Here are two solutions, the first with tidyverse (dplyr) and the second with base R:

library(dplyr)

#/ Dummy data:
set.seed(2021)

n <- 20
df <-data.frame(Group=rep(LETTERS[1:n], each=n),
                T1=rnorm(25,10,n))

#/ tidyverse:
tidy <-function(x) {
  df %>%
    group_by(Group) %>%
    filter(T1 > (median(T1)-1) & T1 < (median(T1)+1))
}

#/ base R:
base <- function(x){

  #/ get medians
  medians <- aggregate(T1~Group, median, data=df)

  #/ merge results for simple filtering
  merged <- merge(x=df, y=medians, by.x="Group", by.y="Group", sort=FALSE)

  #/ apply the filter
  merged[merged[,2] > (merged[,3]-1) & merged[,2] < (merged[,3] + 1), c(1,2)]

}

#/ returns TRUE if all values are TRUE so the results of both tindy and base are the same:
all(tidy() == base()) 
[1] TRUE


#/ The base solution is almost double as fast but "less handy" (imho):
library(microbenchmark)
microbenchmark(tidy=tidy(),
               base=base())

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
 tidy 3.525646 3.819852 4.303717 4.119709 4.360010 13.25243   100
 base 2.059017 2.118340 2.367351 2.247750 2.347017 11.38729   100

Note that the base solution not generic, you could of course write this into a more customizable wrapper function for generic filtering based on column names etc.

ADD COMMENT
0
Entering edit mode

thanks, could you please explain what the first part does? and also how the second part will filter each group with his respective T1median?

ADD REPLY
1
Entering edit mode

The group_by declares that all rows with the same value for Group are a group. The second one then calculates the median for these declared groups based on the T1 column, and then filters rows that meet the criteria you mention, being +/- 1 from the groupwise median. This is all tidyverse syntax, which is a convenient grammar for data science, you can learn more here: https://dplyr.tidyverse.org/

Basically the verbs (that is how tidyverse calls these standardized functions like filter, mutate, group etc), do a lot of stuff codewise under the hood but for the end user it comes down to learn the limited tidyverse syntax, enabling you to do efficient manipulation of data without lots of code. Base R can often be faster for the same operation, but requires more custom code to write, therefore the tidyverse is often preferrable in analysis scripts to make things short, readable and thereby "tidy".

ADD REPLY
0
Entering edit mode

thanks for all the info. I am trying to adjust it to my data, but it doesn't work or I am missing something. Meanwhile I would really like to find a way to call the "T1_"Var, with a generic name, in order to be able to filter using the +/- of T1_A or T1_B for e.g.

when trying this for e.g. for (Var in unique(df$Group)) { temp2 <- df %>% filter((Group== Var)$T1 < T1_A)} it works, but I need to call each respective T1_ for each Group, e.g. for Group==A, then the T1_A

ADD REPLY
0
Entering edit mode

If you add data examples via dput and an example desired output we can see to get some code working.

ADD REPLY
0
Entering edit mode

Hi, thanks. It's the same data I am talking about and same question. As I could not use your commands, I was trying to find a way... the desired outcome is the same to subset keeping only the rows with T1_"Var"-1<T1<T1_"Var"+1, but as I am newbie I don't know how you can call the T1_"Var", when you have already created the vectors T1_A, T1_B....T1_W

ADD REPLY

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6