Question

glm() for correlating protein binding and gene expression?

1

Entering edit mode

8.3 years ago

TriS ★ 4.7k

I am doing some analysis using RNA-Seq and ChIP-Seq data and I am playing with different ways to correlate gene expression and binding. other posts on this topic are here or here and another paper here

A big paper in genome biology, that I linked in my answer in the linked post, demonstrated that i.e. H3K79me2 has a relative contribution of ~20% (0.20) to the R^2 between observed and predicted gene expression.

However, I didn't want to go through the binning of the genome etc etc...so I tested a generalized linear model. glm allows to correlate a binary outcome (0-1 or bound-not bound) to continuous data.

So, to try it out, I downloaded the H3K79me2 (and cMyc) data + K562 RNA-Seq data, annotated ChIPSeq with ChIPSeeker (+/- 1kb from TSS) and built a glm with those data, assigning 1 to bound genes and 0 to not bound genes.

The results show a max predicted probability/response of 0.21...pretty close to what the paper found and interestingly it only happens with H3K79me2 within the promoter region...not in distal intergenic, not at the 3'UTR or 5'UTR or in exons.

More interestingly, I found that cMyc gives a predicted response of ~0.92 ONLY in the promoter region, with higher binding=higher expression.

For RNASeq data, I removed 0s + log2'd and ChIPseq data were annotated using protocol described in the ChIPSeeker manual

The code I used for the glm is:

k79_names <- data from ChIPSeeker with col1=gene name, col2=ID, col3=location
x  <- as.numeric(k562) #mean expression
y <- ifelse(rownames(k562) %in% k79_names[which(k79_names[,3] == "Promoter"),1],1,0)
g_k562 <- glm(y~x, family=binomial)
summary(g_k562)
xweight <- seq(min(x), max(x),0.1)
yweight <- predict(g_k562,list(x=xweight),type="response")
max(yweight)

So, in my head those results suggest that the approach could be valid, but I'm not a statistician. So here's my question: Is there any reason why I should NOT use glm() for this approach?

RNA-Seq ChIP-Seq regression • 2.9k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 8.3 years ago by TriS ★ 4.7k

score 2 · Answer 1 · 2016-01-23

2

Entering edit mode

8.3 years ago

Jean-Karim Heriche 27k

The model you're using i.e. glm(y~x, family=binomial) is called logistic regression and is appropriate when the response is a binary variable as in your case. So unless I've missed something about the data, this is a perfectly valid approach.

ADD COMMENT • link 8.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

you are correct, logistic regression is the right term, thanks for the correction! and thanks for looking into it :)

ADD REPLY • link 8.3 years ago by TriS ★ 4.7k