I have a dataset which consists of 3 groups of replicates the 1st and 2nd with 3 replicates each and the 3rd group with 4 replicates. I performed DE gene expression analysis using edgeR, and obtained a list of statistically significant genes in the form of a table of gene names and TPMs. I created a data matrix and log2-transformed it with the following code:
z <- data.frame(read.table("/Volumes/David_seq/RNAseq_data/5-1_2018_lmo7aMO aggregates bulk RNA-seq/DE_data/Sorted_NCCs_top800deGenes.txt", header = TRUE, sep = "\t"))
row.names(z) <- z$Gene
z <- z[,2:11]
z_matrix <- data.matrix(z)
z_log2matrix <- log2(z_matrix)
Next I performed hierarchical clustering with centroid linkeage and generated a heatmap.
cor_t <- 1 - cor(t(z_log2matrix))
distancet <- as.dist(cor_t)
hclust_centroid <- hclust(distancet, method = "centroid")
dendcentroid <- as.dendrogram(hclust_centroid)
heatmap.2(z_log2normMatrix, Rowv = dendcomplete, Colv = TRUE, scale = "row", col = redblue(256), trace = "none",
dendrogram = "both", cexCol = 0.5, density.info = "none", labRow = NA)
I know there are many ways to perform clustering, and I'm still a newbie. However, I was pleased with the initial results in that it provided me with the information I was looking for. In short, I'm looking for clusters of genes that are upregulated in the 1st group of replicates as compared to the 3rd group (WT), but not upregulated in the 2nd group as compared to the 3rd group (WT).
https://postimg.cc/image/bdsbe0sfb/
I don't know where to go from here. How do I pull out those clusters and put them into their own matrices?
edit: I'm not sure how to insert images, so I've included a URL to the heatmap so you can see what I'm talking about.
Hi Kevin,
Thanks for the advice! I am looking into cutree() as you suggested. I am wondering: what kind of output does it give you? What I ultimately want is a table or matrix with gene names and TPMs for each cluster. Also, it seems you can specify manually how you want it to divide up the clusters, but I don't understand how I would decide those location values. What exactly is meant by "height"?
Thanks!
David
The tree height gives an indication of 'distance' / 'dis-similarity' between the units being clustered in a dendrogram. Here's the image from the other question:
upload pictures
The tree heights (at left) are measured as Euclidean distances in this case. Other possible metrics include Canberra and Manhattan Distance, correlation distance, et cetera. The max distance is roughly 8, in this case, meaning that the 2 most dis-similar genes in this dendrogram are dis-similar by a Euclidean distance of 8. If we cut the tree at a height of Euclidean distance 7.0 (red line), then we are segregating the genes into clusters that are dis-similar at a minimum Euclidean distance of 7.0 - so, quite well segregated. If we cut the tree at lower heights (i.e. lower levels of dis-similarity), then our clusters become less dis-similar and more similar.
In your example, you appear to be using 1 minus Pearson correlation as your distance metric, and then centroid (UPGMC) as your agglomeration function. Your resulting dendrogram clearly has 3 different groups of genes. If you applied
cutree
to that, you may not get exactly what you want. I would encourage you to retry yourhclust
function with ward.D2 as the agglomeration function, and then see if that gives a more evenly-distributed dendrogram, on which you could then applycutree
.Of course, you don't have to use cutree here. It really depends on what exactly you want to do with the information presented to you in the heatmap. In other studies, for example, I have done pathway analysis on the different clusters of genes.
Hi Kevin,
Thanks again for the explanation. I tried what you said, redoing the clustering using ward.D2 as the agglomeration and did get a smoother dendrogram. As you say, there are clearly 3 groups, but within those are some more subtle clusters that represent the groups I am interested in. Specifically, if you look at the heatmap, you'll see groups at the very top and bottom that are up in the first set of replicates only and down in the first set of replicates only, respectively. these are the clusters I want to pull out.
Pathway analysis is actually exactly what I want to do once I've extracted those clusters :) would you be able to point me towards a pipeline for doing this as well? Thanks so much!
Yes, that's generally the result of using Ward's linkage, but there's not exactly a right or wrong in terms of which distance metric and linkage method to use. Some are more suitable for certain types of data, though.
In terms of pathway analysis itself, there is no real standard way to do it. Some people do gene enrichment and pathway analysis via DAVID Functional Annotation, which you may have used? Another option for KEGG-specific pathways includes KEGGprofile.
Usually one would also have clinical data against which you could compare your clusters, too, but that's usually for the samples and not the genes, i.e., this cluster of samples exhibits a high inflammatory response, whereas this cluster is immuno-suppressed, based on blood cytokine and other markers like CRP and ESR.
Back when I was doing something similar, a colleague ad a licence for 'Ingenuity', a commercial tool, but it is costly.
Another tool that would be worth exploring is GSVA, but I find it's implementation a bit cumbersome.