Deconvolution Methods on RNA-Seq Data (Mixed cell types)
5
15
Entering edit mode
8.5 years ago
Paul.Lin ▴ 150

Dear all,

I want to use deconvolution methods to estimate the proportions of different cell types in my RNA-Seq samples. In this post (Differential gene expession analysis in cell populations of mixed tumor and normal cells), it's mentioned that "signals from different cell-types/tissues will sum more linearly in microarrays than RNAseq, where the sum is highly non-linear" and "Any paper talking about signal separation will likely mention that the signals need to be independent for optimal performance, which they self-evidently aren't in RNAseq." Could someone please explain to me why in RNA-Seq samples the signals from different cell-types/tissues are not independent, or why the signals don't sum linearly?

Also, if I do decide to go ahead with using deconvolution methods, should I apply the deconvolution methods to raw RNA-Seq counts, log(CPM) transformed data, or voom transformed data?

Thanks.

Paul

RNA-Seq deconvolution • 21k views
ADD COMMENT
0
Entering edit mode

Someone in our group just gave a journal club talk on "CIBERSORT". They at least claim that it can be used for RNAseq data and might be useful for you if you just want to know something like, "what percentage of each sample is composed of one of a number of cell types". I'm still a bit dubious about the method, but in theory it or something like it could possibly work.

ADD REPLY
0
Entering edit mode

Agreeing with Devon, I recently used this software for RNAseq data,and it seems to give results on RNAseq as well,in terms of percentages of different cell types https://cibersort.stanford.edu

ADD REPLY
0
Entering edit mode

CIBERSORT is designed for immune cell types. If you aren't specifically looking a mixture of immune cells, you might want to use a more generalized deconvolution strategy.

If you are looking at bulk tumor expression, I would typically expect some sort of percent tumor value from the pathologist, which you could use in your differential expression model (if that's available for your samples, that might be a good alternative / positive control option).

ADD REPLY
0
Entering edit mode

Does anyone know of other cell type signatures besides LM22 from CIBERSORT which has only 22 cell types?

Also,are there any other tools for RNAseq besides CIBERSORT and DeconRNAseq?

ADD REPLY
1
Entering edit mode

Dvir Aran from Atul Butte's lab at UCSF has recently come out with a new tool, xCell, for RNAseq-based deconvolution that might be worth looking into (it is very easy to use): http://xcell.ucsf.edu

ADD REPLY
0
Entering edit mode

If you can wait, I will have a new cell deconvolution method coming out in a publication. This was tailoured for detecting immune cell populations from RNA-seq.

ADD REPLY
0
Entering edit mode

Hi Kevin

Just curious if this has been published yet? Looking into a variety of deconvolution methods for RNA-seq, and would be very interested in the method you've developed.

ADD REPLY
0
Entering edit mode

Hi Alex, that work is continued by my now former colleagues, as I moved over to USA in 2016. I am still in touch, however, and I understand that they are still trying to publish the work. Note that the deconvolution part is only one part of a manuscript that is heavily focused on molecular biology. Have no other programs yet been released in this area?

ADD REPLY
9
Entering edit mode
8.5 years ago

Since you're referring to something I wrote, perhaps it's best if I reply :)

The thing with RNAseq is that it's a 0-sum game, there are a finite (though large) number of reads that will be sequenced and all of the transcripts/genes are competing with each other for them. This ends up creating a dependence between the measured expression of each transcript/gene, since every read sequenced from gene A means that there's one fewer read that can be sequenced from gene B (and C and D and...). The question then becomes how big of a problem this is. I don't really know the answer to that, perhaps someone has done a study on it.

Regarding what to use, you might have luck with logCPM.

ADD COMMENT
3
Entering edit mode

I just wanted to give my intuition that this might be an important concern. A lot of approaches rely on signature genes, that are highly expressed in 'pure' tissues. Say A and B are signature genes for tissue 1 and C for tissue 2, if all genes are expressed at equally elevated levels and the true mixture was 80/20, then A and B together might 'steal' over-proportionally more reads together which make gene C get less reads, thereby underestimating the contribution of tissue 2.

ADD REPLY
0
Entering edit mode

@Devon: I am not sure if I understand this point correct.

My understanding is that this premise holds true, when gene B (or C) transcript availability (i.e expression quantity) is limiting compared to A. If transcripts for A and B (for eg housekeeping genes) are identical in expression, does this premise still hold true i.e competition between reads to get sequenced?

In addition, doesn't it depend on it length of the gene as well?

ADD REPLY
3
Entering edit mode

Even if genes are identically expressed they'll still be competing. All genes/transcripts are competing against each other. This is the one nice thing about microarrays, since the probes are independent.

Yes, length comes into play too.

ADD REPLY
0
Entering edit mode

Thanks, Devon! I understand the dependence between gene-level RNA-Seq reads now. :) I need some time to think about what the possible consequences are; meanwhile do you have any suggestion on how to estimate the proportions of different cell types apart from deconvolution methods?

With regard to logCPM, I understand that CPM normalises the data according to library sizes hence make data from different samples comparable; but what's the purpose of log? Is it to make it more like micro-array data? If so, doesn't voom transformation make the RNA-Seq data even more like micro-array data hence a better option here?

ADD REPLY
0
Entering edit mode

If you still have the raw samples then I've had excellent luck with qPCR. In fact, the only reason I'm familiar with this is that we (and much of the field it turned out) had contaminated samples that were screwing up results. qPCR ended up being the best method to prescreen things before sequencing. I had tried signal separation methods but never got great results (it worked well for microarray datasets though).

The purpose of the log is to change the range of the data so it no longer starts at 0, but instead extends from -infinity to +infinity. The math tends to behave a better when you don't have restricted ranges (this is also why people use log2-fold change for everything in RNAseq).

ADD REPLY
0
Entering edit mode

@ Devon

Thanks.

ADD REPLY
0
Entering edit mode

Hi Ryan.

I am still a bit confused about how would such 0-sum game causes non-linearity. This may cause some sort of dependency between variables but the overall expression is still the weighted sum of expression of its components (different cell type), right?

ADD REPLY
0
Entering edit mode

This thread is very old and it is holiday season. However, just plot RNA-seq count data as a histogram and you will clearly appreciate the non-linearity. It has been found that a negative binomial distribution is better for modeling RNA-seq count data.

Your second question relates more to the type of deconvolution that is being used. I am yet to see any clear winner in terms of methods for RNA-seq deconvolution.

ADD REPLY
3
Entering edit mode
8.4 years ago

Do NOT use log2 cpms. The data need to be in non-log linear space.

I quote

The samples profiled within PRECOG primarily represent bulk diagnostic pre-therapy tumor specimens, which often contain a variety of cell types, including diverse TALs. Given the enrichment of lymphocyte markers in favorably prognostic genes across PRECOG (Figs. 1d and 2d), a method to systematically 'unmix' or deconvolve bulk tumor GEPs in PRECOG may reveal new insights into tumor immunobiology. We recently developed a new approach for CIBERSORT, a machine-learning method that outperformed other approaches in benchmarking experiments[16]. CIBERSORT produces an empirical P value for the deconvolution using Monte Carlo sampling. Like other linear deconvolution methods, CIBERSORT only operates on expression values in non-log linear space[75].

Ref: http://www.nature.com/nm/journal/v21/n8/full/nm.3909.html?WT.ec_id=NM-201508&spMailingID=49267143&spUserID=ODkwMTM2NjI1NwS2&spJobID=741033692&spReportId=NzQxMDMzNjkyS0

ADD COMMENT
0
Entering edit mode

agreed! the method is for data in linear space.

ADD REPLY
2
Entering edit mode
7.8 years ago
Shicheng Guo ★ 9.4k

Whether you need do log-transform or not, dependent on the method (code/script). It is easy to decide whether you should do it or not, that is, do a mixture by yourself, and then do the de-convolution with there code/script, and to compare the input and deconvolution result. In my experience, you need try to use raw data, counts, signal, log-transform, logit-transform and then you can find which one is the best way. I prefer to do log and logit transform.

ADD COMMENT
1
Entering edit mode
7.3 years ago
Ron ★ 1.2k

Here is another package that can be used :

https://www.bioconductor.org/packages/devel/bioc/vignettes/DeconRNASeq/inst/doc/DeconRNASeq.pdf

ADD COMMENT
1
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6