Having Problems Reproducing Published Results Of Classifying Promoters Into High-Cpg, Low-Cpg Classes
1
5
Entering edit mode
10.9 years ago

Dear all,

I've been trying to classify promoters into HCP, LCP and ICP and didn't had success, so I ask for help or suggestions.

The workflow I used is:

  1. Obtain all UCSC refseq promoters
  2. Take TSS of all genes and make windows around TSS of -2Kb and +500bp
  3. Divide each window in bins of 500bp with a 5bp offset (sliding of 5bp) --> bedtools makewindows -i src -w 500 -s 5
  4. At each window calculate: # of CGs, # of Cs, # of Gs, and % of GC content

    bedtools nuc -pattern CG -fi genome.fasta -bed genes.TSS.w500.s5.bed | awk 'NR>1{OFS="\t"; print $1,$2,$3,$4,$14,$8,$9,$13,$6}' > file

    The final file looks like:

    CHR    Start    End    Gene    CGs   Cs   Gs   Length   GC_content 
    chr1    12268    12373    DDX11L1    6    40    35    105    0.714286
    chr1    12273    12373    DDX11L1    5    38    32    100    0.700000
    chr1    12278    12373    DDX11L1    4    34    32    95    0.694737
    chr1    12283    12373    DDX11L1    4    31    31    90    0.688889
    chr1    12288    12373    DDX11L1    4    30    29    85    0.694118
    
  5. Use R to find genes which are HCP, LCP or ICP using the following script:

t<-read.table(file,sep="\t")

ratio_cpg <- function(x){     
  x1=x[,1]*x[,4] 
  x2=x[,2]*x[,3]     
  x3=x1/x2     
  return(x3)
}

t2=cbind(t[,1:ncol(t)],ratio_cpg(t[,5:9]))
colnames(t2)<-c("CHR","Start","End","Gene","CGs","Cs","Gs","Length","GC_content","CpG_ratio")

names=levels(t2[,4])

hcp_genes <- NULL; hcp_values<-NULL; 
lcp_genes <- NULL; lcp_values<-NULL; 
icp_genes <- NULL; icp_values<-NULL;

for ( i in 1:length(names) ) {         ## Take table slices which gene names are equal to NAMES vector created
    gc_con <- as.data.frame(t[ t[,4] == names[i] ,][,9])
    cpg_ratio <- as.data.frame(t[ t[,4] == names[i] ,][,10])
    values <- cbind(gc_con,cpg_ratio)         ## Find HCP promoters
    if ( length(which(values[,1] > 0.55)) > 0 && length(which(values[,2] > 0.75 )) > 0 ) {
            if ( is.null(hcp_genes) ) { 
                    hcp_genes <- names[i]
                       hcp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]>0.75,]),na.rm=TRUE)
            } else {
                    if ( !isTRUE( names[i] %in% hcp_genes ) ) { 
                        hcp_genes <- c(hcp_genes,names[i]) 
                        hcp_values <- c( hcp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]>0.75,]),na.rm=TRUE) ) 
                    }        
            }
              next         ## Find LCP promoters        
    } else if ( length(which(values[,2] < 0.48)) > 0 && length(which(values[,2] > 0.48)) == 0) {
            if ( is.null(lcp_genes) ) {
                    lcp_genes <- names[i]
                    lcp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]<0.48,]),na.rm=TRUE)
            } else {
                    if ( !isTRUE( names[i] %in% lcp_genes ) ) { 
                        lcp_genes <- c(lcp_genes,names[i]) 
                        lcp_values <- c( lcp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]<0.48,]),na.rm=TRUE) )
                    }
            }
              next
      ## Find ICP promoters      
    } else {
            if ( is.null(icp_genes) ) {
                    icp_genes <- names[i]
                    icp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]>=0.48 & cpg_ratio[,1]<=0.75,]),na.rm=TRUE)    
            } else {
                    if ( !isTRUE( names[i] %in% icp_genes ) ) { 
                        icp_genes <- c(icp_genes,names[i]) 
                        icp_values <- c( icp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]>=0.48 & cpg_ratio[,1]<=0.75,]),na.rm=TRUE) )
                    }        
            }
              next
    } }

Unfortunately with this method I could not obtain the results as the paper of Weber (2007, http://www.nature.com/ng/journal/v39/n4/full/ng1990.html#a1), and I don't know why.

This is my result after this steps:

enter image description here

Could anybody give me some advice on this issue?

Thanks for your help!

dna methylation ucsc promoter • 4.5k views
ADD COMMENT
2
Entering edit mode
10.9 years ago

First you should produce the histogram of the number of promoters vs their CpG content (Figure 2, left side in the paper). This will tell you if your starting data is agrees to theirs.

Then also note how their histograms are not mutually exclusive, they overlap, whereas in your case you are imposing hard cutoff on your data limits.

ADD COMMENT
0
Entering edit mode

Thanks for your point. So you are suggesting that a good starting point would be to just calculate the CpG ratio over promoters and just plot to see if the results are close enough? In that case, do you think that the strategy of defining bins over promoter is not suitable for just plotting the left side of the figure? I mean that maybe I just need to consider that in order to plot that histogram, consider calculating CpG ratio on a window 1000bp for example and use only one value/promoter? (as opposed to calculate CpG ratio over windonws of 500 bins)

ADD REPLY
0
Entering edit mode

yes, start with that. People tend to come up with some tweaks to the way they define concepts and you have to make sure that you are sufficiently close to that to begin with. I have skimmed over the methods but I did not fully understand how the histograms are separated. But as I pointed out note these do actually overlap whereas in your plot you have sharp cutoffs.

ADD REPLY

Login before adding your answer.

Traffic: 1524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6