Question

RNA-Seq expression analysis of heterogeneous data (cancer+healthy cells)

4

Entering edit mode

7.2 years ago

Nicolas Rosewick 10k

Hi,

I met an important issue when I analyzed RNA-Seq data from some leukemia samples. Most of them are not homogeneous (mix of cancer and healthy T-Cells) and when doing some prior PCA plot analysis on the gene count table (using DESeq2 rlog transformation), I see an important disparity between the samples (due to the fact that these samples are not homogeneous). However I've for each sample the information of the percentage of cancer cells. Knowing these informations, would it be possible to correct the RNA-Seq data in order to perform gene expression analysis ? I was thinking to some deconvolution methods but I'm not an expert in this field..

Thanks

RNA-Seq heterogeneous deconvolution • 2.3k views

ADD COMMENT • link updated 7.2 years ago by Steven Lakin ★ 1.8k • written 7.2 years ago by Nicolas Rosewick 10k

0

Entering edit mode

What is your sample size ?

ADD REPLY • link 7.2 years ago by GouthamAtla 12k

0

Entering edit mode

I've ~40 tumor samples (mix of tumor and healthy T-Cells) and ~10 control samples (healthy T-Cells)

ADD REPLY • link 7.2 years ago by Nicolas Rosewick 10k

score 0 · Answer 1 · 2017-01-18

0

Entering edit mode

7.2 years ago

Steven Lakin ★ 1.8k

In your metadata file, record the percentage of cancer cells for each sample and have that as one of your factors for analysis. Regression is used in differential gene expression analysis to calculate eventual log-fold change, so if you regress on your % of cancer cells, you'll hopefully capture the variation due to cancer cells separately from whatever effect you're trying to model:

~ 0 + MainEffect + %Cancer

Segmentation by any other method would be too complicated or overkill, in my opinion.

ADD COMMENT • link 7.2 years ago by Steven Lakin ★ 1.8k

0

Entering edit mode

So in R I could do something as :

# expression Table contains the expression value (read count, rlog or anything else..)
# percentage is a vector containing the percentage of tumor cells within each sample (its order correspond to the same order as the expression Table columns)
p <- apply(expressionTable,1,function(x){anova(lm(x~percentage))[1,5]})
fdr <- p.adjust(p,"fdr")
# extract significant genes
expressionTableSig <- expressionTable[fdr<=0.05,]

Some question :

Is it ok to use lm() ? or is there more "powerfull" method to do some regression ?
Which type of expression data to choose (normalized read count, rlog, vsd, TPM, FPKM, etc...
the percentage is finite (between 0 and 100). Will the regression not be biased due to the fact that the variable is finite ?

edit: In this paper http://www.nature.com/articles/srep24375 they suggest to use TMM normalization with robust regression ( rlm() function from MASS package ) in order to avoid outlier impact ( as the "classic" lm() function is sensitive to outliers). Any thoughts about that ?

ADD REPLY • link 7.2 years ago by Nicolas Rosewick 10k

1

Entering edit mode

DESeq2 as well as other differential gene expression pipelines call some kind of glm as a subroutine (lm would be inappropriate here, since it's only linear. You want something more robust, based on count distributions); DESeq is largely based on the negative binomial distribution and will take into account many other factors when doing the regression. However, you still provide DESeq with a model matrix equation, such as the one above. It uses that in the regression modeling step. You should record your % tumor cells like so in your metadata:

Sample_ID     MainEffect    PercentCancer
1    Treatment    0
2    Treatment    35
3    Treatment    70
....
15    Control     0
16    Control     0

And so on, then on the model design step in the DESeq2 vignette, call the %Cancer as a covariate in the model equation:

~ 0 + MainEffect + PercentCancer

The package will handle the rest. If you're interested in the math behind it, their publication is fairly good at explaining how they handle dispersion, normalization, etc.

I agree with the paper in that you need something more robust than simple lm here. However, without a substantial background in generalized linear models, it might be hard to figure it out without the help of a package like DESeq2, metagenomeSeq, EdgeR, or baySeq.

ADD REPLY • link 7.2 years ago by Steven Lakin ★ 1.8k