Hi,

Imagine I have 200 samples, and for each, I have expression values of 500 genes (so it's a matrix with 500 rows and 200 columns). Expression values are normalized, it’s TPM. I need to compute a distance between the samples. What is the better way of doing it? Should I take log(TPM+1), then scale with z-transformation, and do Euclidean distance? Or there is a better way? And should I scale rows (calculate z-scores within genes but across samples) or columns?

Thanks

Hi, maybe try a tutorial first to understand the basics. With google I find many RNA-seq tutorials, like this one for example. Good luck.

Well, I do know the basics. And I know that people usually calculate the distance with log CPM. However, my question is about TPM. What did I miss in the basics in your opinion?

I mean the basics to understand the difference between row or column z-scores

Row z-score is to make sure that genes with higher expression don't have much influence on what samples are similar to each other. I just wanted to make sure that it makes sense and I didn't miss anything

Yes, this is the default way that pheatmap and heatmap.2 do it, i.e., scale by row. It is also the way that I usually scale my data to Z-scale. Assuming you have samples as columns and genes as rows, this can be done via:

I show that each produces the same output in this proof, here: A: cannot replicate the pheatmap scale function