Empirical p-value after sampling matching distribution in R
0
0
Entering edit mode
5.2 years ago
User 7754 ▴ 250

I am trying to understand how to compute an empirical p-value after a sampling a procedure, to test whether a variable "var" we observe in a "case" dataset appears more than expected from a "control" dataset, after having matched for a variable "length".

cases = data.frame(id = 1:10, length = sample(1:1000,10), var = sample(c(TRUE,FALSE), 10, TRUE)) 
control = data.frame(id = 1:100, length = sample(1:1000,100), var = sample(c(TRUE,FALSE), 100, TRUE)) 

res = data.frame()
nperm = 10
for (perm in 1:nperm) {
    control_random = control[sample(nrow(control),nrow(control),replace=FALSE),] # draw the control into a random order
    for (i in 1:nrow(cases)) { # loop through cases to find a match for each
    for (j in 1:nrow(control)) { # for each case, loop through control looking for a match
        if (abs(control_random$length[j] - cases$length[i]) < 100) { # match if length is within 100
            break
        }
    }

 res = rbind.data.frame(res, data.frame(perm, cases$id[i], control_random$id[j], control_random$var[j] ))
 # and remove it so we don't use it again
 control_random = control_random[-j,]
 }
 }

# pvalue:
# count how many times var is TRUE in control set compared to cases
library(dplyr)
num_controls = res %>%
    group_by(perm) %>%
    summarise(n = sum(control_random.var.j.))
# Is it the number of original true values compared to random controls?
num_cases = sum(cases$var)

pval = (sum(num_controls$n > sum(cases$var))) / nrow(res)

Is the p-value correct? What if we had bins instead of matched random samples, would we compare within each bin? Any help would be very appreciated!

R empirical p-value sampling binning • 1.2k views
ADD COMMENT

Login before adding your answer.

Traffic: 2323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6