trying to reproduce heat map made via cBioportal using download data
1
1
Entering edit mode
5.9 years ago

I need to make a statistical comparison using breast cancer data. I have made a heat map at the following link on the Bioportal: http://www.cbioportal.org/index.do?cancer_study_id=brca_tcga&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&data_priority=0&case_set_id=brca_tcga_mrna&gene_list=BRCA1%250ABRCA2%250ATP53%250ACCL2%250ACCR3%250ACD44%250AENG%250AIL6%250AIL33%250ACD33%250ACSF1%250AHIF1A%250ACLEC7A&geneset_list=%20&tab_index=tab_visualize&Action=Submit&genetic_profile_ids_PROFILE_MRNA_EXPRESSION=brca_tcga_mrna_median_Zscores&show_samples=false&heatmap_track_groups=brca_tcga_mrna_median_Zscores%2CBRCA1%2CBRCA2%2CTP53%2CCCL2%2CCCR3%2CCD44%2CENG%2CIL6%2CIL33%2CCD33%2CCSF1%2CHIF1A%2CCLEC7A

enter image description here

To do this I initially went to the http://www.cbioportal.org page and selected the cancer samples that I am interested in: Breast Invasive Carcinoma (TCGA, Provisional)

I then went to the textbook to enter the names of the genes that I am interested in: BRCA1 BRCA2 TP53 CCL2 CCR3 CD44 ENG IL6 IL33 CD33 CSF1 HIF1A CLEC7A

Then I submitted the query and was able to plot a clustered heat map (by clicking 'Heatmap>>Add genes to heatmap>>cluster heatmap') in the "oncoprint" tab.

While this type of annotation is useful, I would also like to be able to download the data that is generating this heat map and make a similar heat map on my local

computer (for experimental reasons).

To attempt this I went to the original search page and clicked the "View summary" button.

From this I found a "Download Data" button at the top of the page.

This returns a 'tar.gz' file with lots of interesting datasets. e.g.:

  • data_mRNA_median_Zscores.txt

  • data_expression_median.txt

  • data_RNA_Seq_v2_mRNA_median_Zscores.txt

  • data_RNA_Seq_v2_expression_median.txt

I want to find the expression data that was used to generate the histogram shown in the first provided link. From the downloaded files, I initially

tried data_RNA_Seq_v2_expression_median.txt

Below is my attempt to reproduce a heatmap similar to the one above:

data=read.table('data_RNA_Seq_v2_expression_median.txt',header=T,fill = T,stringsAsFactors = F)
genes_OI=c("BRCA1","BRCA2","TP53","CCL2","CCR3","CD44","ENG","IL6","IL33","CD33","CSF1","HIF1A","CLEC7A")

data_OI=data.frame()
for(i in genes_OI$V1){
 data_OI=rbind(data_OI,data[which(data[,1]==i),])
}
sumis.na(data_OI))
library(gplots)
png('test_TCGA_patients.png',height = 1000,width=1000)
data_OI[,-c(1,2)]=apply(as.matrix(data_OI[,-c(1,2)]), 2, as.numeric)

data=na.omit(data)
heatmap.2(as.matrix(data_OI[,-c(1,2)]),labCol = NA,
         labRow = data_OI[,1],cexRow = 1.4,keysize = 1.4)
dev.off()

The resulting heatmpat is as follows:

enter image description here

But this is not at all like the heatmap in the link at the top of the page.... is there some normalisation step that I am missing?

I also tried using the file where the the Zscores were computed: data_RNA_Seq_v2_mRNA_median_Zscores.txt

This file however (just from looking at the file content does not require any distance matrix):

data=read.table('data_RNA_Seq_v2_mRNA_median_Zscores.txt',header=T,fill = T,stringsAsFactors = F)
genes_OI=c("BRCA1","BRCA2","TP53","CCL2","CCR3","CD44","ENG","IL6","IL33","CD33","CSF1","HIF1A","CLEC7A")

data_OI=data.frame()
 for(i in genes_OI$V1){
     data_OI=rbind(data_OI,data[which(data[,1]==i),])
 }
png('test_TCGA_patients.png',height = 1000,width=1000)
data_OI[,-c(1,2)]=apply(as.matrix(data_OI[,-c(1,2)]), 2, as.numeric)

data=na.omit(data)

hclustfunc <- function(x, method = "complete", dmeth = "euclidean") {    
  hclust(dist(x, method = dmeth), method = method)
}
rc<-hclustfunc(data_OI[,-c(1,2)])
cd=t(data_OI[,-c(1,2)])
cc<-hclustfunc(cd)
heatmap(as.matrix(data_OI[,-c(1,2)]), Rowv=as.dendrogram(rc), 
        Colv=as.dendrogram(cc),labRow = data_OI[,1],labCol = NA)
dev.off()

This unfortunately does not produce anything similar to the heat map seen in the link above....

Hence I am wondering were it is that I am going wrong....? Am I using the correct file or is there

a normalisation step that I am missing?

RNA-Seq • 3.6k views
ADD COMMENT
0
Entering edit mode

The cBioPortal group only refers to the part at the bottom as the heatmap. The main figure is an 'oncoprint', with some form of a heatmap at the bottom.

Take a look here: OncoPrint

ADD REPLY
0
Entering edit mode

Hi sorry forgot to add step about "clicking 'Heatmap>>Add genes to heatmap>>cluster heatmap". The question has been modified necessarily.

ADD REPLY
0
Entering edit mode

I have posted an answer for you - see below

ADD REPLY
0
Entering edit mode

See: How to add images to a Biostars post to understand how to embed images in a post. I've made the necessary changes for now.

ADD REPLY
0
Entering edit mode

thanks this has been amended

ADD REPLY
0
Entering edit mode

What is the basis of heatmap clustering? From what factor depends the order of patients after selecting "cluster heatmap" in the oncoprint? Thank you.

ADD REPLY
0
Entering edit mode

The code contains the answer to your question. Check out ?hclust and ?heatmap.2

ADD REPLY
0
Entering edit mode

Thanks. So is it essentially hierarchical clustering to find the maximum number of similarities and cluster them together? Because in my case, I have clustered only 4 genes and perhaps it's quite expected to not observe any clustering.

ADD REPLY
0
Entering edit mode

If that's what the manual says, that's what it is. Do you have any questions that the manual does not cover?

ADD REPLY
1
Entering edit mode
5.9 years ago

Hello again,

I see - thank you for clarifying.

Can you do me a favour and plot a histogram of the data from data_RNA_Seq_v2_expression_median.txt ? I assume that it's just the normalised counts, whereas the other file ( data_RNA_Seq_v2_mRNA_median_Zscores.txt ) represents the Z-scores of these normalised counts. Just use:

hist(data)

or

hist(data.matrix(data))

If you use the data from the first file (data_RNA_Seq_v2_expression_median.txt), you are virtually guaranteed a better heatmap if you do the following:

Read in data

data <- read.table('data_RNA_Seq_v2_expression_median.txt',header=T,fill = T,stringsAsFactors = F)
genes_OI <- c("BRCA1","BRCA2","TP53","CCL2","CCR3","CD44","ENG","IL6","IL33","CD33","CSF1","HIF1A","CLEC7A")

Set rownames

rownames(data) <- data[,1]
data <- data.matrix(data[,-1])

Select out genes of interest

data_OI <- data[which(rownames(data) %in% genes_OI),]

Pick a color scheme and set breaks

require("RColorBrewer")
myCol <- colorRampPalette(c("darkblue", "black", "darkred"))(100)

Set coluor 'breaks' at -2 and +2 Z-score

myBreaks <- seq(-2, 2, length.out=101)

Transform your data to the Z-scale

heat <- t(scale(t(data_OI)))

Plot the heatmap, specify your custom breaks and colour scheme, and switch further scaling (by heatmap.2) off

library(gplots)
pdf('test_TCGA_patients.pdf', width=7, height=5)
  heatmap.2(heat,
    col=myCol,
    breaks=myBreaks,
    main="Title",
    key=T, keysize=1.0,
    scale="none",
    density.info="none",
    reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
    trace="none",
    labRow
    cexRow=1.0,

    cexCol=0.2,
    distfun=function(x) dist(x, method="euclidean"),
    hclustfun=function(x) hclust(x, method="ward.D2"))
dev.off()

You can switch off the dendrograms by adding the following parameters:

heatmap.2(..., dendrogram="none", Rowv=FALSE, Colv=FALSE, ...)

If you want to do this with the data_RNA_Seq_v2_mRNA_median_Zscores.txt data, then just skip the heat <- t(scale(t(data_OI))) command.

This code is untested because I don't actually have the files in front of me. It should work, though. Some of your functions, like for selecting out genes and then converting to a data matrix, looked overly complex...

Kevin

ADD COMMENT
0
Entering edit mode

Hi thank you. You're answer is certainly an improvement on mine. enter image description here https://ibb.co/giCkcy

As for creating the histogram:

    head(data[,c(1:4)])
   Hugo_Symbol Entrez_Gene_Id TCGA.3C.AAAU.01 TCGA.3C.AALI.01
1 LOC100130426      100130426          0.0000          0.0000
2     UBE2Q2P3      100133144         16.3644          9.2659
3     UBE2Q2P3      100134869         12.9316         17.3790
4    LOC149767          10357         52.1503         69.7553
5       TIMM23          10431        408.0760        563.8934
6        MOXD2         136542          0.0000          0.0000

      hist(data.matrix(data[,-c(1,2)]))

this returns the following histogram

I did this but the resulting distribtion is very skewed

enter image description here https://ibb.co/eEs1Hy

ADD REPLY
0
Entering edit mode

Okay, so, it just looks like normalised RNA-seq counts on the negative binomial scale.

I would use the Z-scaled data and skip my scale() command, as I mentioned.

Also, if you 'squish' the PDF, you can make it look elongated like the cBioPortal image.

For example:

pdf(filename, width=9, height=2.5)
   ...
dev.off()
ADD REPLY

Login before adding your answer.

Traffic: 1697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6