Gaussian mixture model in R
1
1
Entering edit mode
7.9 years ago
LJ ▴ 280

I have some sets of data(data was transformed to log-ratio),and i want to do data normalization like an article wrote: " A 2-component Gaussian mixture model-based normalization algorithm was used to achieve this normalization.The two Gaussians(μ1,σ1) and(μ2,σ2) for a sample i were fitted and used in the normalization process as follows: the mode mi of the log-ratio distribution was determined for each sample using kernel density estimation with a Gaussian kernel and Shafer-Jones bandwidth. Then A two-component Gaussian mixture model was then fit with the mean of both Gaussians constrained to be 𝑚i, i.e., 𝜇1i = 𝜇2i = 𝑚i. The Gaussian with the smaller estimated standard deviation 𝜎𝑖 = min⁡(𝜎̂1𝑖, 𝜎̂2𝑖) was used to normalize the sample. The sample was standardized using N(mi,σi) by subtracting the mean mi from each gene and dividing by the standard deviation σi. Constrained fitting of mixture models was implemented using the mixtools R package."

And i'm not good at statistics stuff ,So can anyone be kind to teach me how to write the R code to achieve the normalization effect just like the article wrote? Thanks in advance.

R Gaussian mixture model data normalization • 8.0k views
ADD COMMENT
1
Entering edit mode

Does not the paper comes with a code snippet that you can use? it clearly comes with the method of incorporating the mix modeling , the package is a good way to start. Take a look at the package and try to play with the examples and see the mixture plots and how the fit model is derived. You can also see how mclust is used for the same purpose of mixture model fit for clustering here.

This blog post is also having a nice and lucid explanation you can take a look at it and alternatively this link also serves a nice code snippet and explanation of when and why to use with mixtools and apply it for clustering.

ADD REPLY
1
Entering edit mode

Thanks first.There is no code snippet in the article.And I just want to do the normalization by 2-component Gaussian mixture model-based algorithm as the article wrote,So how to finish the following code : library(mixtools) a<-rnorm(1000) mi<-density(a)#### kernel density estimation with a Gaussian kernel and Shafer-Jones bandwidth ??? out<-mvnormalmixEM(a,k=2,lambda=NULL,mu =?,sigma = ?) ####like the article said "μ1i =μ2i =mi" ,so the'mu'should be set to equal? But the mi is a vector of length 1000,SO how do i set the 'mu' and 'sigma'? And at last ,i hope to get the result that 2 component (μ1i =μ2i =mi,sigma1≠sigma2),and i can choose the smaller sigma to normalize the sample.

ADD REPLY
3
Entering edit mode
7.9 years ago

Try something like this:

library(stats) # for bw.SJ() and density()
library(modeest) # for mlv()
library(mixtools) # for mvnormalmixEM()
# Assume data for sample i is in D
b<-bw.SJ(D,...) # Find bandwith with Shafer-Jones method
d<-density(data, bw=b, kernel="gaussian") # Gaussian kernel density estimate
M<-mlv(d,method="Parzen") # Find the mode of a univariate distribution
mode<-M$M
model<-normalmixEM(data,mu=c(mode,mode),sigma=NULL)
sigma = min(model$sigma)
ADD COMMENT
1
Entering edit mode

Thanks for the reply.And some questions for you: 1)M<-mlv(d,method="Parzen") mode<-M$M #### What dou you mean by this ? and the article said "two-component Gaussian mixture model was then fit with the mean of both Gaussians constrained to be mi(μ1i =μ2i =mi) 2)I just ran your code,and i used the 2-component Gaussian mixture model,so k was set to 2,and the result is : summary(model) summary of normalmixEM object: comp1 comp2 lambda 0.9718554 0.0281446 mu -0.0263047 0.0768719 sigma 1.0001167 0.0764447 loglik at estimate:-1402.182 SO the question is why mu1≠mu2≠mode when we set the mu=c(mode,mode)?

ADD REPLY
1
Entering edit mode

mlv() returns an mlv object in which the value of the mode is in $M. To constrain the means in normalmixEM, you need mean.constr=c(mode,mode). Always check the functions docs.

ADD REPLY
1
Entering edit mode

ok,i get it,thanks. And(1)is this right that i set the k to 2 for the situation in the article? other parameters,just like lambda,epsilon,maxit...? or just defalut when there is no special requirement? (2)The article wrote:"The sample was standardized using N(mi,σi) by subtracting the mean mi from each gene and dividing by the standard deviation σi",and the mi is the mode of the sample i data distribution,so mi should be a single value,so why the author wrote"by subtracting the mean mi from each gene"?confused..

ADD REPLY
1
Entering edit mode

1- If by k, you're referring to the number of components in the Gaussian mixture then yes

2- As I understand what you describe, once the model is computed (one model per sample), the Gaussian with the smallest sigma is selected for normalization i.e. its mean and sigma are used to standardize values for each gene in the sample.

ADD REPLY
1
Entering edit mode

i just tried the code:
a<-rnorm(1000) ### example data
b<-bw.SJ(a)
d<-density(a, bw=b, kernel="gaussian") # Gaussian kernel density estimate
M<-mlv(d,method="Parzen") # Find the mode of a univariate distribution
mode<-M$M
And i found the mode sometimes not a single value,maybe 2 or more,so which value should i choose for mu=c(mode,mode) in normalmixEM?

ADD REPLY
1
Entering edit mode

Plot the density estimate with

plot(d)

to check if the density estimate has more than one peak or a pronounced shoulder.

Actually, since we're dealing with a Gaussian, we can also do

mode<-d$x[which.max(d$y)]
ADD REPLY
1
Entering edit mode

Thanks a million for your nice help. It helps me a lot.

ADD REPLY

Login before adding your answer.

Traffic: 1487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6