Is linearity maintained in linear regression of RNA-Seq?
1
0
Entering edit mode
5.3 years ago
CY ▴ 750

Linear regression is used in cell type decomposition (TIMER or CIBERSORT). I am wondering if the linearity assumption met during modeling.

  1. I imagine raw count is not eligible because different library size causes non-linearity. Log-transformed value probably cause non-linearity as well, right? So is there any kind of normalization may fit the linear assumption?

  2. Beside, unlike microarray, RNA-Seq library is 0-sum game which is non-linearity (although I don't why this cause non-linearity. This may cause some sort of dependency but the overall expression is still the weighted sum of expression of its conponent, right?)

RNA-Seq • 2.4k views
ADD COMMENT
0
Entering edit mode

RNAseq data (raw counts) can be transformed for linear modeling. Try voom method on RNAseq data.

ADD REPLY
0
Entering edit mode

Even though logCPM (voom) transformed expression value maintains linearity, we still face 0-sum game issue which cause variables dependence and non-linearity, right. This issue is inherited in the raw data and I can't see any way to fix it.

ADD REPLY
4
Entering edit mode
5.3 years ago

RNA-seq count data is non-linear and more closely resembles a negative binomial / Poisson-like distribution. For example, running linear regression on RNA-seq counts, normalised or otherwise, is not a great idea. DESeq2 for example, fits a negative binomial regression line through the counts and usually derives its p-value via the Wald test applied to model terms.

If you are looking to use RNA-seq data for cell deconvolution, I would go about obtaining the normalised, transformed counts, such as logCPM (EdgeR), variance-stabilised (DESeq2), or regularised log (DESeq2) expression levels. In EdgeR, you may play around with the prior count that can be added to 0-count genes prior to transformation. DESeq2's transformations deal with these low count genes in its own way.

Personally, I would then obtain Z-scores from the transformed data and use those for deconvolution - this is more readily interpreted. For example, you could regard genes with Z>3 as being highly expressed / representative of a tissue / cell-type, et cetera.

ADD COMMENT
0
Entering edit mode

Thanks Kevin. If I understand you correctly, the methods you suggest normalizes library size and make the counts more normally distributed.

However, The 0-sum game nature of RNA-Seq library still causes varibles dependency and non-linear relationship. Would it compromise linear model a lot?

ADD REPLY
0
Entering edit mode

To go more in depth into the statistics of this, go to StackExchange (CrossValidated), or even Bioconductor support forum.

ADD REPLY

Login before adding your answer.

Traffic: 2468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6