Question

How to in R: Include principal components as covariates in analysis

0

Entering edit mode

5.0 years ago

RNAseqer ▴ 260

I know that principal components can be included in a model as covariates, but what are the r commands necessary to that actually do this? I have been looking around on the internet for hours and have yet to turn up an example. Does anyone have one?

Suppose you have a factors file, and a datas.pca file. And you want to incorporate the first 2 PC's from .pca as covariates alongside things like BMI, Batch, Age etc in the factors file. What commands are necessary?

Many thanks for the help!

pca covariates principal components • 5.7k views

ADD COMMENT • link updated 5.0 years ago by Kevin Blighe 87k • written 5.0 years ago by RNAseqer ▴ 260

score 1 · Accepted Answer · 2019-05-14

1, create random data

randomdata <- matrix(rexp(800, rate=.1), ncol = 40)
colnames(randomdata) <- paste0("sample", 1:ncol(randomdata))
rownames(randomdata) <- paste0("gene", 1:nrow(randomdata))

randomdata[1:5,1:5]
         sample1   sample2   sample3   sample4   sample5
gene1  0.3603136  1.159749 3.0826237  3.309323 0.5730092
gene2 35.0652692 20.814076 5.7487437  3.502387 1.2635026
gene3  8.2158417  6.442191 2.1134863 26.724393 9.2087334
gene4 19.6870141  6.617958 5.7245038 10.698933 9.1174246
gene5  6.8046549  5.262233 0.2679569 25.170229 8.3293813

2, create fake metadata for `DiseaseStatus` and `BMI`

metadata <- data.frame(DiseaseStatus = c(rep("case", 20), rep("control", 20)), BMI = sample(10:40, 40, replace = TRUE))

head(metadata)
  DiseaseStatus BMI
1          case  12
2          case  12
3          case  18
4          case  20
5          case  27
6          case  32

3, perform PCA and plot bi-plot for PC1 and PC2

pca <- prcomp(t(randomdata))

plot(pca$x[,1:2])

Also see my answer here for more 'PCA' plots using base R functions: A: PCA plot from read count matrix from RNA-Seq

4, now prepare an object for regression modeling

modeling <- data.frame(metadata, pca$x[,1:2], t(randomdata))

modeling[1:5,1:5]
        DiseaseStatus BMI        PC1         PC2     gene1
sample1          case  12 -18.485678   8.3374667 0.3603136
sample2          case  12  -3.393260  -0.9962191 1.1597494
sample3          case  18   6.579711 -10.1392232 3.0826237
sample4          case  20  13.311610  26.2884376 3.3093231
sample5          case  27  16.998644  -3.0140936 0.5730092

# set control as reference level
modeling$DiseaseStatus <- factor(modeling$DiseaseStatus, levels = c('control','case'))

5, binary logistic regression - adjust for `BMI`, `PC1`, and `PC2`

summary(glm(DiseaseStatus ~ gene1 + BMI + PC1 + PC2, data = modeling, family = binomial(link = 'logit')))


Call:
glm(formula = DiseaseStatus ~ gene1 + BMI + PC1 + PC2, family = binomial(link = "logit"), 
    data = modeling)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.43692  -1.12846   0.01018   1.15101   1.33862  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.266677   1.030321  -0.259    0.796
gene1       -0.009663   0.041650  -0.232    0.817
BMI          0.013031   0.036195   0.360    0.719
PC1         -0.004580   0.018005  -0.254    0.799
PC2         -0.010359   0.020470  -0.506    0.613

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 55.452  on 39  degrees of freedom
Residual deviance: 54.908  on 35  degrees of freedom
AIC: 64.908

Number of Fisher Scoring iterations: 4

6, linear regression - adjust for `BMI`, `PC1`, and `PC2`

summary(lm(gene1 ~ DiseaseStatus + BMI + PC1 + PC2, data = modeling))

Call:
lm(formula = gene1 ~ DiseaseStatus + BMI + PC1 + PC2, data = modeling)

Residuals:
   Min     1Q Median     3Q    Max 
-8.447 -5.121 -2.338  3.445 31.670 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)  
(Intercept)        8.25736    4.11994   2.004   0.0528 .
DiseaseStatuscase -0.56933    2.60968  -0.218   0.8286  
BMI               -0.04532    0.14733  -0.308   0.7602  
PC1                0.13024    0.06949   1.874   0.0693 .
PC2               -0.11322    0.08107  -1.397   0.1713  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.202 on 35 degrees of freedom
Multiple R-squared:  0.1373,    Adjusted R-squared:  0.03869 
F-statistic: 1.392 on 4 and 35 DF,  p-value: 0.2567

-------------------------------------

You should have good justification for wanting to adjust for PCs. Usually, it is performed in genetic association studies for the purpose of adjusting for population stratification.

Don't just adjust for PCs because your supervisor may have told you to do it - question him/her why. One should not add covariates to a regression model without justification.

Kevin