Question

Difference between heatmap using Euclidian distance and Poisson distance

3

Entering edit mode

6.5 years ago

tlorin ▴ 360

(This thread is related to a Bioconductor package so I also posted this question on the dedicated forum. However, I thought this could relate to bioinformatics in general.)

Using DESeq2 I would like to obtain a heatmap of sample-to-sample distances using the rlog-transformed values. However, I'm not sure if I should use the "Euclidian" distance or the "Poisson" distance (both are suggested here). I have obtained both graphs but don't know which one I should "trust". While I think I understand the "Euclidian" approach I'm not sure to get the advantages of the "Poisson" approach. I've tried reading the fundament paper (Witten 2011) but got lost at some point B-)

Could someone:

illustrate a simple sample case where both methods would give the same result?
illustrate a simple sample case where both methods would give different results?
explain the advantages and drawbacks of both methods?

Many thanks!

deseq2 RNA-Seq poisson heatmap • 8.4k views

ADD COMMENT • link updated 6.5 years ago by Kevin Blighe 87k • written 6.5 years ago by tlorin ▴ 360

score 6 · Answer 1 · 2017-10-17

It's all about the distribution of the data that you've got.

Witten states that Euclidean distance has long been used in clustering of microarray data, data that is typically normalised to a Gaussian ('normal') distribution through RMA or gcRMA. To this day, most functions will still default to Euclidean distance.

Witten then makes the argument that RNA-seq data has different distributions, with counts and normalised counts more representative of a Poisson or negative binomial. In these distributions, the data is non-negative and the peak will be shifted to the 'left' along the x-axis toward 0 (zero). Mass spectrometry/cytometry data also follows this distribution.

If you plot your normalised DESeq2 counts, you'll see a poisson- or negative binomial-like distribution. If you plot the regularised log counts, you'll see a Gaussian. These last two statements are important, and are why the DESeq2 authors state:

We use the R function dist to calculate the Euclidean distance between samples. To ensure we have a roughly equal contribution from all genes, we use it on the rlog-transformed data.

So, only use Euclidean when you're using rlog counts

The PoissonDistance function takes the original count matrix (not normalized) with samples as rows instead of columns, so we need to transpose the counts in dds.

...only use Poisson when you're using raw or normalised counts.

There are no drawbacks to either of these, from my perspective. If you apply both to the correct data, then you should generally see the same or very similar result. The resulting dendrogram structure will differ slightly, but the overall message should not.