Question

Gaussian modeling of GWAS data

1

Entering edit mode

6.4 years ago

zizigolu ★ 4.3k

Hi,

If our phenotypes in a table named clinical data is like so

##       FamID CAD sex age  tg  hdl  ldl
## 10002 10002   1   1  60  NA <NA> <NA>
## 10004 10004   1   2  50  55   23   75
## 10005 10005   1   1  55 105   37   69
## 10007 10007   1   1  52 314   54  108
## 10008 10008   1   1  58 161   40   94
## 10009 10009   1   1  59 171   46   92

instead of one variable in our clinical table for instance hdl like below

phenoSub$phenotype <- rntransform(as.numeric(phenoSub$hdl), family="gaussian")

how can I use another variables like tg, hdl, ldl at the same time???

I tried so but went wrong

phenoSub$phenotype <- rntransform(as.numeric(phenoSub$c(colnames(clinical)), family="gaussian")

Thank you

R snp GWAS • 1.9k views

ADD COMMENT • link updated 6.4 years ago by Kevin Blighe 87k • written 6.4 years ago by zizigolu ★ 4.3k

score 1 · Answer 1 · 2017-11-26

1

Entering edit mode

6.4 years ago

Kevin Blighe 87k

Hello again, to use multiple variables, you will have to create a generalised linear model in the form y ~ x, where y is your phenotype or end-point, and x is a predictor.

In this example, we predict coronary artery disease (CAD) using high- and low-density lipids, and triglycerides.

rntransform(CAD ~ hdl + ldl + tg, data=df, family="gaussian")
[1]         NA  1.2815516 -0.2533471 -0.2533471 -0.2533471 -0.2533471

ADD COMMENT • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Sorry, what I should put when I have > 4000 variables including hdl, ldl, tg and there is no Y like CAD?

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

I had assumed that you were aiming to predict coronary artery disease (CAD). The rntransform function is just attempting to normalise your variables so that they can be more reliably used in modeling.

Thus, when you run:

phenoSub$phenotype <- rntransform(as.numeric(phenoSub$hdl), family="gaussian")

,all that is returned is a normalised version of your hdl variable, which can then be more reliably used in a binomial logistic regression model.

When I run:

rntransform(CAD ~ hdl + ldl + tg, data=df, family="gaussian")

,it is normalising the residuals from my model

Here's what the package says:

Details: Rank-transformation to normality generates perfectly normal distribution from ANY distribution, unless many/heavy ties are present in variable (or residuals, if formula is used).

Thus, if you have no endpoint, my recommendation is to run rntransform independently on all of your variables of interest (preferably in a loop), and then run whatever regression models you were originally aiming to do in order to test your hypotheses, presumably relating to SNP genotypes.

Further reading:

Look at the tutorial Genome-wide association analysis - Association analysis of typed SNPs using parallel processing, where they first normalise the same data that you're looking at.
Take a look at recent work by people I worked with just this past year ( Screening for interaction effects in gene expression data - page 12), where they used rntransform

Good luck

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Sorry, if I have 4914 variables what would be the loop in this part of code to rntransform all one by one?

phenoSub$phenotype <- rntransform(as.numeric(phenoSub$hdl), family="gaussian")

whatever I tried I could not figure out

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

#Get column indices of variables to be transformed
variableIndices <- which(colnames(phenoSub) %in% c("hdl", "ldl", ..., ...))

#Create a new data-frame to store the transformed variables
newDataFrame <- data.frame(row.names=rownames(phenoSub))

#Keep a loop counter
j <- 1

#Loop through indices in variableIndices
for (i in variableIndices) {
    temp <- rntransform(as.numeric(phenoSub[,i]), family="gaussian")

    newDataFrame <- data.frame(newDataFrame, temp)

    colnames(newDataFrame)[j] <- colnames(phenoSub)[i]

    j <- j + 1
}

Then newDataFrame will contains your transformed variables. You can then merge this with your genotype data and then test your hypotheses through modelling.

ADD REPLY • link 4.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you, you are so kind that share your precious time for solving people’s problems. I will try your kindly provided code. Let me thank you once again

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

No problem, you may also want to take a look at this study, in which rntransform was also used. However, they appeared to have used the normalisation of regression model residuals: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3400731/

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Sorry,

Could one use Gaussian model or another models to predict phenotype for example cholesterol level based on the microarray data??

ADD REPLY • link 6.1 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Yes, that is possible. In that case, you would use a linear model with cholesterol as the y variable:

lm(cholesterol ~ gene1)
lm(cholesterol ~ gene2)
lm(cholesterol ~ gene3)
...
lm(cholesterol ~ geneX)

Then take the statistically significant genes an pt them together into a preliminary 'final' model that would require yet further testing:

lm(cholesterol ~ gene4 + gene10 + gene678)

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

1

Entering edit mode

Hello friend. I also just replied to your new question: A: Predicting phenotype based on the expression of genes

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k