Question

What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?

14

Entering edit mode

6.1 years ago

elmahy2005 ▴ 150

I am applying different algorithms for feature selection on RNASeq data to find the best set of genes that classify a participant into normal or diseased such as using Lasso and variable importance in Random Forest fit. On the other hand, I could have done differential expression analysis e.g. with DEseq2 and get a set of genes that also distinguish disease from normal.

My question is : How does these two approaches differ from each other? And What is the best way to combine both? Any suggested reading is very welcome.

RNA-Seq rna-seq • 8.3k views

ADD COMMENT • link updated 6 months ago by Di Wu • 0 • written 6.1 years ago by elmahy2005 ▴ 150

score 43 · Answer 1 · 2018-03-23

NB - this answer has been updated January 28th, 2020

Update: It is important to point out that the assumptions of RandomForest® differ from those of, e.g., a regression model. So, RandomForest® and other classification algorithms certainly must be considered. Just use my general pointers here as just that, i.e., general - they are a guide.

While things like RandomForest®, lasso penalised regression (and both elastic-net and ridge regression), deep learning, machine learning, AI, neural networks, etc. may each sound great, from what I have seen, they do not perform better than a well- performed / curated differential expression analysis followed by gene signature refinement of differentially expressed genes (DEGs) through further modeling. In fact, I frequently find that these algorithms perform worse. The 'craze' surrounding these 'buzz' words has come but it will fade away once everyone realises that they won't bring about the next revolution in healthcare (hard work and tackling bureaucracy will).

If one could conduct the perfect study, one would do the following:

Differential expression analysis to identify DEGs using a program (e.g., EdgeR, DESeq2, limma/voom for RNA-seq) that sufficiently normalises your data and deals with the anticipated sources of bias
Further refinement / validation of the DEGs, likely using downstream methods (e.g. PCA, clustering, et cetera) applied to the transformed normalised counts from the same cohort, and / or in an independent dataset (e.g., from TCGA, CCLE for cancer, or some SRA/GEO/ArrayExpress study for other diseases). Here, people may also sometimes just manually pick the top 50 or 100 differentially expressed genes. Others, I have observed, pick them based on their own experienced knowledge of the field in which they conduct research. There are many ways to go in an analysis, from this point.
With your top genes, these should then be validated in a separate cohort of higher sample n and using a separate technology, such as high throughput PCR (Fluidigm), NanoString, or even a customised microarray. The number of top DEGs that you choose from #Step 2 will be dicatated by the chosen technology here.
Further refine the top genes through regression modeling using either of stepwise regression (forward, backward, and/or both) or by testing each gene independently and then keeping all those that are statistically significant independent predictors of your outcome. There may likely be clinical parameters included in these models at this stage, too, and possibly as covariates (e.g. BMI status, smoking status, clinical grade, lab markers of inflammation, histology scores, et cetera).
Test your final regression model (or models) through various metrics and processes, such as R2 shrinkage, cross-validaton, Cook's test (for outliers), ROC analysis, and the derivation of precision, accuracy, sensitivity, and specificity. Again, there may be clinical parameters mixed with genes in these final models

If you can do all of that and produce a robust gene signature, then you're talking about stuff that is the equivalent to, for example, OncoType DX® and MammaPrint® in breast cancer.

Note that the 'final model' to which I refer here may look something like (example for lm() and glm()):

lm(MutationLoad ~ TP53 + TumourGrade + CCNB1 + ATM + POLE)

glm(ArthritisStatus ~ ESR + CD20 + Age + SmokingStatus)

MutationLoad is continuous; ArthritisStatus is binary / categorical

------------------------------

My recommendations should not dissuade you from nevertheless trying out the RandomForest®. However, I am highly confident that it will not perform better than the method I have presented above.

Let me know if I can assist further

Kevin

PS - if you elect for stepwise regression in part #2, which is somewhat automated and based on AIC and BIC, then you may face crticism from a statistician. Still, this is much better than just chucking all of your data into a 'machine learning' algorithm and letting that do everything for you