Biostar Beta. Not for public use.
What is the point of standardizing each gene in gene expression data?
Entering edit mode
15 months ago

I was advised to standardize each gene (so that each gene has a zero mean and unit variance across all samples) when clustering the genes in a gene expression dataset, but I don't understand why. Let's say we have a gene whose expression is around 4000 for all cancer patients, and another gene whose expression is around 4 for all cancer patients, and say the expression levels are highly correlated. Then by standardizing each of those 2 genes, the distance between them (which originally was quite high due to the different scales) will be very low, and those genes will very likely end up in the same gene cluster. However, those genes look quite different to me (totally intuitively), and I don't understand why they should end up in the same cluster.

An example I read about the need for standardizing is that when a variable is in kilograms and another variable is grams, for example, these variables are not directly comparable, so standardization is needed, which makes a lot of sense. But in a gene expression dataset, all variables are in the same unit (say intensity in a microarray dataset), so I cannot relate gene expression data to that example. Can somebody explain why we need standardization for gene expression data generally, as well as specifically for clustering the genes? Thanks.

Entering edit mode
2.9 years ago
Sirus • 770

I think it is just for visualization purposes.

Suppose that you use a certain threshold to select a group of genes to display (expl: RPM>1), the highly expressed genes will skew the scale in your heatmap, which will make some genes (that are also highly expressed in the same group of samples) seem like as not expressed (they will have the same color as the lowly expressed ones) Some people use log2(RPM+1) to minimize this effect. But, in this case a 2 fold-change will seem as just 1 step change

A z-score, will show the relative change. Which will make the heatmap (clusters) clearer.

Entering edit mode
2.2 years ago
theobroma22 ♦ 1.1k

The standardization helps to cluster the data. Say two clusters are similar before standardization, then dissimilar after standardization. This means standardizing the data made it capable to distinguish these two clusters rather than erroneously forcing them into one cluster.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3