Question

Determining Most Diagnostic Differentially Expressed Genes

2

Entering edit mode

10.4 years ago

Daniel Standage 4.1k

I am working on a differential expression analysis, and would like to explore the genes that are the most informative for the contrast I'm studying. I have sorted the genes by magnitude of fold change (1/fold change if < 0) and looked at the top 10, but highest fold change does not necessarily mean most diagnostic between the 2 conditions. I'm reading an older paper that used leave-one-out cross-validation class prediction to identify the most predictive genes in a differential expression analysis. However, the software they cite requires an expensive paid license, and no doubt has changed significantly in 10 years.

What software is available for leave-one-out cross-validation class prediction these days? Can you provide a simple example usage? My preference would be an R-based solution, but I'm flexible on that point.

rna-seq differential-expression • 3.9k views

ADD COMMENT • link 10.4 years ago by Daniel Standage 4.1k

score 2 · Answer 1 · 2013-12-10

Not sure about the out-of-the-box solution, but if I were you, I would try to go with some basic machine learning method, such as LASSO, Elastic Net (glmnet) or Random Forest (randomForest), and extract the most important features for classification using a fold-based validation:

Using CVTools in R, partition your data (training and test)
Train your model on the training set (X = expression matrix, y = classification vector)
With your trained model, try to explain the test data and see how it performs
Repeat this for different folds (i.e. different partitions)

If you think the predictions look reasonable, you can extract the features easily from your fit and use them for new classifications.

score 2 · Answer 2 · 2013-12-13

I ended up using the Weka data mining software. I wrote this script to convert the data to ARFF format, and then I did feature selection using information gain. As I have 11 samples, I set it to do 11-fold cross-validation, which is leave-one-out cross-validation. To confirm, I extracted only the top 25 features (genes) and loaded only these into Weka and got near-perfect classification using several classification methods.