Biostar Beta. Not for public use.
Distance for gene expression samples
0
Entering edit mode
13 months ago
chipolino • 40

Hi,

Imagine I have 200 samples, and for each, I have expression values of 500 genes (so it's a matrix with 500 rows and 200 columns). Expression values are normalized, it’s TPM. I need to compute a distance between the samples. What is the better way of doing it? Should I take log(TPM+1), then scale with z-transformation, and do Euclidean distance? Or there is a better way? And should I scale rows (calculate z-scores within genes but across samples) or columns?

Thanks

RNA-Seq distance • 189 views
ADD COMMENTlink
0
Entering edit mode

Hi, maybe try a tutorial first to understand the basics. With google I find many RNA-seq tutorials, like this one for example. Good luck.

ADD REPLYlink
0
Entering edit mode

Well, I do know the basics. And I know that people usually calculate the distance with log CPM. However, my question is about TPM. What did I miss in the basics in your opinion?

ADD REPLYlink
0
Entering edit mode

I mean the basics to understand the difference between row or column z-scores

ADD REPLYlink
0
Entering edit mode

Row z-score is to make sure that genes with higher expression don't have much influence on what samples are similar to each other. I just wanted to make sure that it makes sense and I didn't miss anything

ADD REPLYlink
1
Entering edit mode

Yes, this is the default way that pheatmap and heatmap.2 do it, i.e., scale by row. It is also the way that I usually scale my data to Z-scale. Assuming you have samples as columns and genes as rows, this can be done via:

t(scale(t(x)))

I show that each produces the same output in this proof, here: A: cannot replicate the pheatmap scale function

ADD REPLYlink
2
Entering edit mode
11 months ago
Republic of Ireland

Hey, there is actually no standard way. For example, either pheatmap() or heatmap.2() (gplots) (cannot remember which) will scale your data to Z-scores, first, and then perform clustering on these [Z-scores] and produce a heatmap using the same; whereas, the other function will perform clustering on the un-scaled data and then present the Z-scaled data in the actual heatmap.

You can have complete control over your own scaling by switching off the scaling feature in both functions (or whatever other function you're using).

Your logged TPM+1 data should already be on a 'presentable' distribution. I see no issue further scaling this to Z-scores and performing both clustering and generating the heatmap on that Z-scaled data. Euclidean distance would be fine, or 1 minus Pearson correlation. Ward's linkage (ward.D2) as the linkage method will then give an evenly distributed tree layout.

Kevin

ADD COMMENTlink
1
Entering edit mode

Thank you for your awesome answer!

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1