Question

Help with heatmap

0

Entering edit mode

5.9 years ago

Pin.Bioinf ▴ 340

Hello, I want to check the heatmap for a number of genes but I am seeing all genes don't change through samples, ie. all genes have the same color across samples (I can only see color difference for the grouped genes).

What I do is:

I have RPKMS, this is a little part of the matrix:

                     Sample08_tumor   Sample11_control    Sample12_tumor       Sample13_control
ENSG00000000938               21.785479                46.963138              13.277114                26.846941
ENSG00000001460               17.751131                13.812688               5.310846                10.325746
ENSG00000001461               96.017482                55.250750              71.696417                90.866568
ENSG00000002016               25.012958                24.862838              30.537363                32.009814
ENSG00000002079                0.000000                 0.000000               0.000000                 0.000000
ENSG00000002587                3.227478                 8.287613               3.983134                 3.097724
                          Sample14_tumor           Sample15_control          Sample16_tumor
ENSG00000000938              22.563645                 25.073894               56.586824
ENSG00000001460               8.122912                  4.178982                4.438182
ENSG00000001461             137.186963                121.190490              114.283193





exprs<-as.data.frame(lapply(exprs, FUN = function(x) {sapply(x, FUN = log2)}))
exprs[exprs=="-Inf"]<-0


##now cluster:

heatmap.2(as.matrix(exprs),distfun = function(x) dist(x,method = 'euclidean'),hclustfun = function(x) hclust(x,method = 'average'),tracecol=NA)

Can anybody tell me why all genes don't change across samples? (I'm seeing different colors in bands, but not squares of colors)

Thank you.

heatmap r expression rpkm • 1.8k views

ADD COMMENT • link updated 5.9 years ago by Zhilong Jia ★ 2.2k • written 5.9 years ago by Pin.Bioinf ▴ 340

2

Entering edit mode

That happens, due to the large amount of variation in the RPKM values. So to resolve this problem you can transform RPKM values into log10 and can use it for heatmap.

ADD REPLY • link 5.9 years ago by Tm ★ 1.1k

1

Entering edit mode

I'm not sure that just logging the RPKM data is sufficient, or even using such counts. RPKM and FPKM are not suitable for cross-sample comparisons because they do'nt adjust for differences in library sizes. Logging the data will neither take this information into account.

Pin.Bioinf, te puedo preguntar / may I ask you from where you obtained this data?

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

2

Entering edit mode

RPKM/FPKM do account for library size differences, the main problem is that they are not very good at accounting for differences in the fragment diversity and the relative expression changes within the fragment pools. TPMs will be more suitable for that. More info

I'm seeing different colors in bands, but not squares of colors

I'm not sure what you mean by that.

Generally, you may want to consider using scale = "row" with your command. As a side note, I tend to find pheatmap package a lot more pleasent to interact with when optimizing a heatmap visualization. It offers all the functionalities of heatmap.2, but with more sensible default settings and more intuitive parameter names.

ADD REPLY • link 5.9 years ago by Friederike 8.9k

1

Entering edit mode

I'm reading the more info link, and there's a statement there that I'm unable to understand:

Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)

How does dividing by a million (a constant number) *normalize for sequencing depth" (a variable between experiments)?

ADD REPLY • link 5.9 years ago by Ram 43k

0

Entering edit mode

RPKM/FPKM do account for library size differences

That's just not true... RPKM/FPKM do not adjust for total library size [edit] across all samples

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

The formula is here:

[source: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/]

They have the total aligned reads (to all transcripts) as denominator, multiplied by L (gene length). As the numerator they have 10 to the power of 9 multiplied by the reads aligned to the gene in question. This is not doing any sort of adequate adjustment for total library size.

It's about time we end the use of FPKM, RPKM, FPKM-UQ... for good. Incorrect conclusions are being drawn from these types of data by people who are unaware of their failings.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I agree, it's not doing a good job. But it was meant to adjust for library sizes and it is part of the formula.

ADD REPLY • link 5.9 years ago by Friederike 8.9k

0

Entering edit mode

I think that [ again, like in the other thread, my friend! :) ] everything said here is correct, but we are again viewing it from different angles!

There is indeed a library size parameter in the formula, as one can see. However, the normalisation is performed per sample, i.e., some normalisation is made for library size within each sample.

What FPKM/RPKM do not do is adjust for library size differences across all samples in a study. What this means is that, for example, a FPKM value of 100 for geneX in Sample1 may actually reflect less expression than an equivalent value of 60 in Sample2 for the same gene. This happens if they are both sequenced to different depths of coverage and therefore have different library sizes.

What worries me is that people from broad fields are now transitioning into bioinformatics and they do not understand these issues. Just the other day, I saw a presentation where a medical doctor / physician showed some sample data and have plotted FPKM values. In the worst scenario, s/he will make a false conclusion about a particular gene if it was sequenced to different depths across her/his samples.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

1

Entering edit mode

I agree, FPKM should never be used. But as long as Cufflinks exists and spits those out, you will see them.

ADD REPLY • link 5.9 years ago by Friederike 8.9k

0

Entering edit mode

@Friederike, you rock!

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

thanks, appreciated. but no worries, I generally don't need to be mollified during a professional discussion :)

ADD REPLY • link 5.9 years ago by Friederike 8.9k

0

Entering edit mode

This data was obtained by another bioinformatician who used to work for my colleague, and now she is asking I do some more analyses on it

ADD REPLY • link 5.9 years ago by Pin.Bioinf ▴ 340

score 1 · Answer 1 · 2018-05-11

1

Entering edit mode

5.9 years ago

Zhilong Jia ★ 2.2k

log RNA-SEQ data after adding a small number: exprs_logged <- log2((exprs+0.001))

and key steps, scale by row heatmap.2(exprs_logged, ..., scale="row",), will make genes colored logically .

ADD COMMENT • link 5.9 years ago by Zhilong Jia ★ 2.2k

0

Entering edit mode

Zhilong, great to see you again - remember me? Do you believe it's correct to just log FPKM values?

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Kevin, Nice to see you again at biostars, a place far or near from London.

For your question, I did not dig the RPKM parameter deep. It's not easy to say correct or incorrect for data analysis. lots of times it's related with precision or the results. It seems the RPKM or FPKM issue you mentioned was discussed in Lovén, Jakob, et al. "Revisiting global gene expression analysis." Cell 151.3 (2012): 476-482 six years ago. But there are still lots of data in GEO, only including RPKM values.

As far as I know, tt seems only cuffdiff support FPKM (Maybe you can point some to help me), while DESeq(2) and edgeR, even limma (I usually use) not.

Back the @Pin.Bioinf's question, "check the heatmap for a number of genes but I am seeing all genes don't change through samples", I think my solution will be helpful even it's FPKM data, but other normalization solutions (such as voom from limma), inside of FPKM, probably make it better. Correct me if I'm wrong. Thank you.

ADD REPLY • link 5.9 years ago by Zhilong Jia ★ 2.2k

1

Entering edit mode

Yes, I suppose that we should focus more on the specific issue, i.e., regarding the heatmap. Maybe the user has access to no other data but the RPKM.

Your idea to scale by row will help. @Friederike also mentions it in her comments. In addition, I gave a previous answer in relation to this, here I mention the possibility of converting to the Z-scale and setting breaks: A: Heatmap based with FPKM values

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k