Feature Selection For Regression
2
0
Entering edit mode
12.9 years ago
User 1933 ▴ 340

From sequence one,can derive thousands features, but which feature is more predictive.

Basically we can start with a simple PCA to see which feature is more explanatory and then project other to that one. but some times,there is not any structure behind PCA and seems all features have equal contribution.

Also, in classification problem, we can start by simple tree base classification and prune the tree and evaluate our model. but how would you deal when the problem is regression ?! say prediction of solubility of a protein from its sequence.

Is there any baseline or procedure among bioinformaticians ?!

feature prediction sequence • 3.3k views
ADD COMMENT
4
Entering edit mode
12.9 years ago
Neilfws 49k

Put simply, feature selection can be "manual" (e.g. through PCA inspection) or "automated" (some algorithm which selects the most predictive features). Classification can be "supervised" (e.g. linear discriminant analysis, where classes and features are supplied) or "unsupervised" (again, some algorithm is used to evaluate the classification).

There is no single "baseline" or procedure; you need to consult the statistical literature and decide on an appropriate method, based on the data that you have and what precisely you want to do.

I'd recommend starting with this review: "Penalized feature selection and classification in bioinformatics." It's a good overview. Penalized feature selection is a commonly-used approach; it's an iterative procedure which tests features then as the name implies, penalizes them with a score depending on how well they perform as predictors.

You may also want to look at "Classification of gene microarrays by penalized logistic regression" (PLR). PLR provides estimates of the underlying probability that a feature is a good classifier. That paper also describes recursive feature elimination (the name is self-explanatory).

Another interesting paper: "penalizedSVM: a R-package for feature selection SVM classification", describes methods which "provide automatic feature selection for SVM classification tasks."

ADD COMMENT
2
Entering edit mode

You define your problem more specifically than "the problem is regression" and I might do just that :-)

ADD REPLY
0
Entering edit mode

thanks for the comments but I wish you address "regression" problem more specifically.

ADD REPLY
0
Entering edit mode

Hope the question got more specific

"prediction of solubility of a protein from its sequence", now might be you could edit or add some more comment to your response.

ADD REPLY
2
Entering edit mode
12.9 years ago

You could look into using random forests which does feature selection and regression natively. There is a good implementation in the randomForest R package.

ADD COMMENT

Login before adding your answer.

Traffic: 2761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6