Why some tumors are grouped with the controls?
1
2
Entering edit mode
5.3 years ago
Pin.Bioinf ▴ 340

Hello, I am looking at this heatmap and I do not understand why some of the tumours are grouped with the controls, seems as if the heatmap is 'moved to the left': Top of the heatmap: Top of the heatmap

(it is a long heatmap so i upload just a section)

Bottom of the heatmap: Bottom of the heatmap

I tried to check on the phenotype of the samples but I found no correlations of the phenotype with this grouping.

There seem to be two tumour subgroups: one that is the biggest, and the small group of tumors that seem to be grouped with the controls. I did a t-test on the mean beta values for all the cpgs between these two groups, and it turns out there is significance among the means of these two tumour groups. I am afraid when we publish a heatmap similar to this one, we will have trouble explaining this phenomenon. Any ideas or any opinions on this? Thank you!

Methylation cpgs • 1.5k views
ADD COMMENT
6
Entering edit mode
5.3 years ago

There can be many reasons, some not really relating to bioinformatics:

  1. these tumours genuinely exhibit the 'normal-like' methylation profile over these probes
  2. these tumours have normal cell contamination

Some informatics reasons:

  • your coding is incorrect and you have incorrectly assigned a normal sample as a tumour
  • you have scaled your data incorrectly
  • you should consider the distance and linkage metric that you're using

By the way, for methylation, you could probably also perform the Wilcoxon Signed Rank test on the matched T-N pairs. If just a regular t-test, I would at least use the Mann Whitney (non-parametric) test.

You must also have an extra checkpoint: you should obtain the difference in mean β value between tumour and normal, i.e.:

difference in mean = mean β (tumour) - mean β (normal)

Then, use that as an extra cut-off in addition to the p-value.

Kevin

ADD COMMENT
0
Entering edit mode

Thanks a lot, kevin!

For scaling, I tried many ways and got the same for all: -Converting to M values and then using parameter scale='row'; -Using scale='row' on b values; -Using scale='none' but scaling before the heatmap.

All look similar, so I think I did it correctly. I will check on distance and linkage metrics, as I used the default ones. And thank you for the test input, I now realized I should have used a non parametric test.

ADD REPLY
2
Entering edit mode

When I previously did this, the Wilcoxon Signed Rank test p-value, combined with an extra cut-off for difference in mean β, were enough to adequately separate my groups of interest. The heatmap / clustering was then performed on unscaled β values:

d

ADD REPLY
0
Entering edit mode

I tried with :

hclustfun = function(x) hclust(x,method = 'centroid'),
distfun = function(x) dist(x,method = 'maximum'),

and here is the heatmap I got: seems much better, should I stick with this one?

Heatmap with different hclust and dist methods

ADD REPLY
0
Entering edit mode

I was more interested in just learning about which metrics you were currently using. I would not use a metric that I did not understand, and obviously it is not good practice to just choose the metric that makes the data look better.

My usual default (for most data-types) is either of:

  • Euclidean distance with Ward's linkage (ward.D2)
  • 1 minus Pearson/Spearman correlation distance with Ward's linkage (ward.D2)
ADD REPLY
0
Entering edit mode

Yes, I agree. I was using default parameters, which were "complete" and "euclidean". I tried euclidean with ward.D2 and still, the differentiation between tumor and controls is not clear, it is a blur in the middle of both, and some tumors are the same color as controls as in my original heatmap.

ADD REPLY
1
Entering edit mode

...but this may not necessarily be a problem, i.e., it may be the genuine result. Biology is much more complex than we can currently comprehend with our analytical methods. Every time that we take a sample and put it through our instruments, we are only looking at a 'snapshot' / moment in the evolution of the tissue/cell that is being studied, and much information is automatically eliminated because our very analytical methods are limited in what they can show.

So, if you cannot identify any issues with your coding, then it is the genuine result given the data that has been obtained.

I should add that you need to both filter by p-value and the difference in mean between tumour and normal.

ADD REPLY

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6