estimate population allele frequencies with EM (expectation-maximation) algorithm using R
0
0
Entering edit mode
6.0 years ago
Ana ▴ 200

Hello everyone, I have asked this question in Stack Exchange Bioinformatics forum, unfortunately I have not been able to get any help there. So, I am asking it here, maybe some R professionalists can help me to fix the final part of my R-code. I would really appreciate if anyone can help me to resolve the issue.

I am trying to estimate population allele frequency for CNVs (it can only result in genotyping CNV as a dominant marker) using EM algorithm manual in R. Below is the first few lines of my data-set. I have 40,000 genes and for each gene I have 2 rows (the first row is number of individuals that do not have the gene, __a, and the second row is the number of individuals that actually have the gene, __p). The present gene is dominant and can be either homozygote (pp) or heterozygote (pa).

head(data_all)
                genes pop_1
    1 Ha1_00044004__a    0
    2 Ha1_00044004__p    8
    3 Ha1_00043269__a    1
    4 Ha1_00043269__p    7
    5 Ha1_00045379__a    0
    6 Ha1_00045379__p    8

I want to estimate the frequency of present genes in my population. I have used the following code, but I am stuck in the last part (I would like to repeat the process unless a condition is met and it is converged).

even_indexes<-seq(2,48216,2)  ## draw even rows which are counts for present genes
x.loadings <- data.frame(x=data_all[even_indexes,2])

mu.p <- 0.5 
mu.a <- 0.5

    ## step E  (loop through all genes)
            out_res <- NULL

             for (i in 1:nrow(x.loadings)){
                 #ABOexpect <- function(pp,mu.a,mu.o){
                     mu.p <- 0.5 
                     mu.a <- 0.5
                     npp <- x.loadings[i,] * (mu.p^2/(mu.p^2 + 2*(mu.p*mu.a)))
                     npa <- x.loadings[i,] * ((2*(mu.p*mu.a)/(mu.p^2 + 2*(mu.a*mu.p))))
                     N <- data.frame(npp,npa)
                     out_res<-rbind (out_res,N)
             }

### combine the outcome of Estep with present and absent count 
        data_expect<-cbind(out_res,count_present,count_absent)
        colnames(data_expect)<-c("npp","npa","count_present","count_absent")


            ### step M (loop through all genes)

            out_res_m <- NULL
            for (i in 1:nrow(data_expect)){
            #ABOmaximize <- function(npp,npa,naa) {
                n <-8
                p <- ((2*data_expect[i,1]) + data_expect[i,2])/ (2 * n)
                a <- ((2*data_expect[i,4]) + data_expect[i,2])/ (2 * n)
                L <- data.frame(p,a)
                out_res_m<-rbind (out_res_m,L)
                }

## combine out_res_m with data expect to make a final data set
data_max<-cbind(out_res_m,data_expect)

head(data_mx)
gene_iD.     Present_count absent_count app   naa  p  a
Ha1_00044004  8  0  2.666667  5.333333  0.66666667 0.3333333
Ha1_00043269  7  1  2.333333  4.666667  0.58333333 0.4166667
Ha1_00045379  8  0  2.666667  5.333333  0.66666667 0.3333333
.
.
.

The final part that I am stuck is the iteration part, repeat the process until convergence. If I had only one gene I could use something like this :

   iter <- 1
    diff <- 1
    tol <- 0.0001 ## tolerance

    while (diff > tol) {
          E <- ABOexpect(np,mu.p,mu.a)
          M <- ABOmaximize(E$pp,E$pa,np)
          diff <- abs(M$mu.p - mu.p) + abs(M$mu.a - mu.a)
          mu.a <- M$mu.p
          mu.o <- M$mu.a
          cat(sprintf("iter:%d, diff:%.2f\n",iter,diff))
          iter <- iter + 1
    }
    cat(sprintf("mu_a=%.4f, mu_o=%.4f\n",mu.a,mu.o))

But my problem is that I have 40,000 genes. I do not know how can I modify the code above to get my desired out put!

R EM algorithm • 968 views
ADD COMMENT

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6