Question

Heatmap to identify differential regions between 2 groups

0

Entering edit mode

9.4 years ago

ChIP ▴ 600

Hi,

This looks quite easy but somehow it is proving a bit tricker.

I have a dataset, that looks like this:

Agg1 Agg2 Agg3 N.Agg1 N.Agg2 N.Agg3 N.Agg4 N.Agg5 N.Agg6 N.Agg7 N.Agg8 N.Agg9
cg00000029 0.8819706388 0.9113193695 0.8265891628 0.6743240268 0.4737382432 0.3787381683 0.8866032463 0.653199981 0.8118330618 0.8146570104 0.4004270438 0.5876413372
cg00000165 0.0862068969 0.9415531979 0.1479752852 0.0758735408 0.1122275454 0.3264429366 0.7067720503 0.151948052 0.1052844624 0.1115441623 0.4114138631 0.172884541
cg00000236 0.5258692989 0.5789058517 0.5287079712 0.2164139456 0.6858661668 0.3664265506 0.3543830637 0.6875024064 0.5683123989 0.4136942994 0.5926122866 0.6156535159
cg00000289 0.8347702893 0.8554045135 0.8594854672 0.8138672088 0.9050874925 0.8841502097 0.8316837328 0.5466381905 0.7995203477 0.8175727787 0.7920295767 0.8343047783
cg00000292 0.8562553796 0.8254643383 0.7978685442 0.8792732888 0.8272031231 0.8497711032 0.9001252185 0.7747919571 0.9023242755 0.8702859041 0.8246651357 0.806963772
cg00000321 0.208842496 0.4029158789 0.220634011 0.535740552 0.4670446221 0.3606374648 0.6852900843 0.1361782422 0.5070323351 0.2871776115 0.4047205584 0.4854098091
cg00000363 0.0754226054 0.0797807 0.1840101226 0.0790851263 0.1598712746 0.1237091671 0.1267954259 0.2087704626 0.2059222535 0.1152623609 0.20014435 0.1697362848
cg00000622 0.0109292346 0.0111004687 0.0123186534 0.0097088521 0.0103543813 0.0118732361 0.0107446469 0.0099396902 0.0109167376 0.0119302571 0.0118499582 0.0137091484

where Agg is aggressive & N.Agg is Non-aggressive.

In R it is quite simple.

d<-read.table("table.txt", header=T)
row.names(d)<-d$ID
d<-[,2:8]
dm<-data.matrix(d)
heatmap(dm)

Now, the thing is if I need to plot only those IDs, that are significantly different between the two groups, how can I do it? I cannot plot all the IDs as it is more that 300 thousand.

Kindly suggest an edit in the above code.

Thank you

heatmap R • 4.3k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by ChIP ▴ 600

1

Entering edit mode

There are some typos like "where 01005 & 03002 are untreated and rest are untreated."

If you want to plot heatmap only for significantly different entities/IDs, first you need to do some stats to define which are significant and which are not. For that you need to give more details about your data, so that anybody might suggest you how to go about it.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by GouthamAtla 12k

0

Entering edit mode

Hi, I have corrected the typos, this is basically DNA methylation data. I need to find out the regions which are significantly enriched in one group compared to the other.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by ChIP ▴ 600

3

Entering edit mode

9.4 years ago

Devon Ryan 104k

A shorter version of Giovanni's answer:

pvals <- apply(dm, 1, function(x) wilcox.test(x~c(rep(0,3), rep(1,9)))$p.value)
heatmap(dm[pvals<0.05,])

You might want to adjust the p-values, of course. BTW, the apply() syntax will scale nicely, since you can just use bpapply() instead if needed with the exact same syntax.

ADD COMMENT • link 9.4 years ago by Devon Ryan 104k

1

Entering edit mode

Just a clarification: rep(0,3) stands for the first three columns being Aggressive, and rep(1,9) for the others being Not-Aggressive. If you want to generalize this, without having to manually define which column is which, you may use grepl on the column names:

pvals = apply(dm, 1, function(x) wilcox.test(x~grepl('^N\\.', names(dm), perl=T))$p.value)

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Yup, though one should really not encode group assignments in a header (as you wisely pointed out in your answer).

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

Hey Devon,

Is it possible to get IDs also for the clusters? If yes then how?

Kindly help.

Thank you

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 8.8 years ago by ChIP ▴ 600

0

Entering edit mode

You'll have to remind me what clusters you're asking about. The labels for the adjusted p-values will typically be the row names of the whatever contains the p-values.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 8.8 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon,

Thing is when I run heatmap and cluster the dataset, I get nice clusters that are differential betwee two groups based on some p-value. But I also need to know the IDs of the members that have been plotted.

So how can I have these IDs?

Thank you

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 8.8 years ago by ChIP ▴ 600

0

Entering edit mode

Hi Devon,

The test you are applying is also known as Mann-Whitney test? I think yes. Could you please confirm.

Thank you

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 8.3 years ago by Dataminer ★ 2.8k

1

Entering edit mode

Yup, that test has many names.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 8.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you Devon,

So if I understand correctly, I can state that this is Wilcoxon rank sum test (paired), right?

Thank you in anticipation and sorry for disturbing you over and ver again.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Dataminer ★ 2.8k

0

Entering edit mode

It's not paired unless you use "paired=T", so the default is a Mann-Whitney U test. If you set "paired=T" then it's a Wilcoxon signed-rank test, which is paired.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 8.3 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2014-12-08

Sorry in advance if I use your question to illustrate a nice example of how to apply the principles of tidy data, recently described in a seminal paper by Hadley Wickham. Let me know if the level of R programming in the following lines is too advanced for you. In any case I am sure somebody will post a simpler solution using base R.

First, let's load the dplyr and tidyr packages, and your file:

> library(dplyr)
> library(tidyr)
> d$CpG.id = row.names(d)

If you read the Wickham's paper, this data suffers from the following problem:

3.1. Column headers are values, not variable names

To solve this, we can apply the gather function from tidyr to convert the data to a 'longer' format, along with mutate from dplyr to define aggressive.status:

> d.tidy = d %>%
    gather(Individual, Score, -CpG.id) %>%
    mutate(
        aggressive.status = ifelse(grepl('N\\.', Individual, perl=T),
             'Not-Aggressive', 'Aggressive')
    )

> d.tidy %>% head
      CpG.id Individual     Score aggressive.status
1 cg00000029       Agg1 0.8819706        Aggressive
2 cg00000165       Agg1 0.0862069        Aggressive
3 cg00000236       Agg1 0.5258693        Aggressive
4 cg00000289       Agg1 0.8347703        Aggressive
5 cg00000292       Agg1 0.8562554        Aggressive
6 cg00000321       Agg1 0.2088425        Aggressive

> d.tidy %>% group_by(aggressive.status) %>% do(head(.,3))

      CpG.id Individual      Score aggressive.status
1 cg00000029       Agg1 0.88197064        Aggressive
2 cg00000165       Agg1 0.08620690        Aggressive
3 cg00000236       Agg1 0.52586930        Aggressive
4 cg00000029     N.Agg1 0.67432403    Not-Aggressive
5 cg00000165     N.Agg1 0.07587354    Not-Aggressive
6 cg00000236     N.Agg1 0.21641395    Not-Aggressive

As you can see, the resulting data frame is in a "long" format, in which each line correspond to the CpG status of a single position in a single individual.

From this dataset, we can calculate a wilcoxon test to determine if there is difference between Aggressive and not Aggressive individuals

> d.stats = d.tidy %>% 
    group_by(CpG.id) %>% 
    summarise(
        pvalue= wilcox.test(as.numeric(Score)~factor(aggressive.status),data=.)$p.value)

> d.stats

Source: local data frame [8 x 2]

      CpG.id     pvalue
1 cg00000029 0.03636364
2 cg00000165 1.00000000
3 cg00000236 1.00000000
4 cg00000289 0.20909091
5 cg00000292 0.48181818
6 cg00000321 0.14545455
7 cg00000363 0.28181818
8 cg00000622 0.48181818

You can use this dataframe to select only the CpG ids for which the p.value is lower than 0.05, and use it as input for the heat map. For example:

> d.significant = d[rownames(d) %in% subset(d.stats, pvalue<0.05)$CpG.id,]
> d.significant
                Agg1      Agg2      Agg3   N.Agg1    N.Agg2    N.Agg3    N.Agg4 N.Agg5    N.Agg6   N.Agg7   N.Agg8    N.Agg9     CpG.id
cg00000029 0.8819706 0.9113194 0.8265892 0.674324 0.4737382 0.3787382 0.8866032 0.6532 0.8118331 0.814657 0.400427 0.5876413 cg00000029

In the example you posted there is only one CpG position significant for this test. So, in order to illustrate how to do a heatmap on this, let's take a cut-off of 0.50:

> d.significant = d[rownames(d) %in% subset(d.stats, pvalue<0.50)$CpG.id,]
> d.significant %>% select(-CpG.id) %>% as.matrix %>% heatmap