Question

Having Problems Reproducing Published Results Of Classifying Promoters Into High-Cpg, Low-Cpg Classes

5

Entering edit mode

10.9 years ago

daniel.soronellas ▴ 330

Dear all,

I've been trying to classify promoters into HCP, LCP and ICP and didn't had success, so I ask for help or suggestions.

The workflow I used is:

Obtain all UCSC refseq promoters
Take TSS of all genes and make windows around TSS of -2Kb and +500bp
Divide each window in bins of 500bp with a 5bp offset (sliding of 5bp) --> bedtools makewindows -i src -w 500 -s 5

At each window calculate: # of CGs, # of Cs, # of Gs, and % of GC content

bedtools nuc -pattern CG -fi genome.fasta -bed genes.TSS.w500.s5.bed | awk 'NR>1{OFS="\t"; print $1,$2,$3,$4,$14,$8,$9,$13,$6}' > file

The final file looks like:

CHR    Start    End    Gene    CGs   Cs   Gs   Length   GC_content 
chr1    12268    12373    DDX11L1    6    40    35    105    0.714286
chr1    12273    12373    DDX11L1    5    38    32    100    0.700000
chr1    12278    12373    DDX11L1    4    34    32    95    0.694737
chr1    12283    12373    DDX11L1    4    31    31    90    0.688889
chr1    12288    12373    DDX11L1    4    30    29    85    0.694118

Use R to find genes which are HCP, LCP or ICP using the following script:

t<-read.table(file,sep="\t")

ratio_cpg <- function(x){     
  x1=x[,1]*x[,4] 
  x2=x[,2]*x[,3]     
  x3=x1/x2     
  return(x3)
}

t2=cbind(t[,1:ncol(t)],ratio_cpg(t[,5:9]))
colnames(t2)<-c("CHR","Start","End","Gene","CGs","Cs","Gs","Length","GC_content","CpG_ratio")

names=levels(t2[,4])

hcp_genes <- NULL; hcp_values<-NULL; 
lcp_genes <- NULL; lcp_values<-NULL; 
icp_genes <- NULL; icp_values<-NULL;

for ( i in 1:length(names) ) {         ## Take table slices which gene names are equal to NAMES vector created
    gc_con <- as.data.frame(t[ t[,4] == names[i] ,][,9])
    cpg_ratio <- as.data.frame(t[ t[,4] == names[i] ,][,10])
    values <- cbind(gc_con,cpg_ratio)         ## Find HCP promoters
    if ( length(which(values[,1] > 0.55)) > 0 && length(which(values[,2] > 0.75 )) > 0 ) {
            if ( is.null(hcp_genes) ) { 
                    hcp_genes <- names[i]
                       hcp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]>0.75,]),na.rm=TRUE)
            } else {
                    if ( !isTRUE( names[i] %in% hcp_genes ) ) { 
                        hcp_genes <- c(hcp_genes,names[i]) 
                        hcp_values <- c( hcp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]>0.75,]),na.rm=TRUE) ) 
                    }        
            }
              next         ## Find LCP promoters        
    } else if ( length(which(values[,2] < 0.48)) > 0 && length(which(values[,2] > 0.48)) == 0) {
            if ( is.null(lcp_genes) ) {
                    lcp_genes <- names[i]
                    lcp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]<0.48,]),na.rm=TRUE)
            } else {
                    if ( !isTRUE( names[i] %in% lcp_genes ) ) { 
                        lcp_genes <- c(lcp_genes,names[i]) 
                        lcp_values <- c( lcp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]<0.48,]),na.rm=TRUE) )
                    }
            }
              next
      ## Find ICP promoters      
    } else {
            if ( is.null(icp_genes) ) {
                    icp_genes <- names[i]
                    icp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]>=0.48 & cpg_ratio[,1]<=0.75,]),na.rm=TRUE)    
            } else {
                    if ( !isTRUE( names[i] %in% icp_genes ) ) { 
                        icp_genes <- c(icp_genes,names[i]) 
                        icp_values <- c( icp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]>=0.48 & cpg_ratio[,1]<=0.75,]),na.rm=TRUE) )
                    }        
            }
              next
    } }

Unfortunately with this method I could not obtain the results as the paper of Weber (2007, http://www.nature.com/ng/journal/v39/n4/full/ng1990.html#a1), and I don't know why.

This is my result after this steps:

enter image description here

Could anybody give me some advice on this issue?

Thanks for your help!

dna methylation ucsc promoter • 4.5k views

ADD COMMENT • link updated 10.9 years ago by Istvan Albert 100k • written 10.9 years ago by daniel.soronellas ▴ 330

score 2 · Answer 1 · 2013-06-11

2

Entering edit mode

10.9 years ago

Istvan Albert 100k

First you should produce the histogram of the number of promoters vs their CpG content (Figure 2, left side in the paper). This will tell you if your starting data is agrees to theirs.

Then also note how their histograms are not mutually exclusive, they overlap, whereas in your case you are imposing hard cutoff on your data limits.

ADD COMMENT • link 10.9 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks for your point. So you are suggesting that a good starting point would be to just calculate the CpG ratio over promoters and just plot to see if the results are close enough? In that case, do you think that the strategy of defining bins over promoter is not suitable for just plotting the left side of the figure? I mean that maybe I just need to consider that in order to plot that histogram, consider calculating CpG ratio on a window 1000bp for example and use only one value/promoter? (as opposed to calculate CpG ratio over windonws of 500 bins)

ADD REPLY • link 10.9 years ago by daniel.soronellas ▴ 330

0

Entering edit mode

yes, start with that. People tend to come up with some tweaks to the way they define concepts and you have to make sure that you are sufficiently close to that to begin with. I have skimmed over the methods but I did not fully understand how the histograms are separated. But as I pointed out note these do actually overlap whereas in your plot you have sharp cutoffs.

ADD REPLY • link 10.9 years ago by Istvan Albert 100k