Question

Could you suggest me a proper feature selection method for mixed type variable data?

0

Entering edit mode

7.0 years ago

morovatunc ▴ 550

Hi,

We are working on cancer mutation data and we found that upon TF binding, there has been an enrichment of mutation occurrence happened on these TF binding regions. Since we couldnt a strong reason for this occurrence we wanted to use feature selection methods.

We thought that say;

Y ( mutation no on these regions) ~ DHSites + Histone Marks + Other TF binding events (such as CTCT, EP300 etc) + RNAseq Reads

So in this matrix, we will have a single row for each binding event of our TF and all the variables will be either categorical(such as DHSsite) or numerical ( such RNAseq reads).

I have seen that people have applied random forest algorithms to predict mutation occurrence in specific regions. But our aim is not to predict anything but simple ask " What is the cause of the mutation occurrence". Therefore, I want to separate my data in to two subsets ( train vs test).

Please forgive my ignorance in the terminology and consider me as a frustrated grad student.

Best regards,

Tunc.

machine learning • 1.5k views

ADD COMMENT • link updated 7.0 years ago by cfay • 0 • written 7.0 years ago by morovatunc ▴ 550

1

Entering edit mode

There are many ways in which you can do it. Random forest are a good choice, after training you can look at the "variable importance" which will rank the variables of your model according to their contribution to the prediction. You can check the section of variable importance section of the Caret package.

Another choice is using the lasso regression, which try to set to zero the non-important variables. Just maybe one thing to consider is the normalization of your variables if they have different scales so you can get normalized factors. There are some good tutorials on lasso, for examples here and here

Hope it helps.

ADD REPLY • link 7.0 years ago by Sirus ▴ 820

0

Entering edit mode

@Sirus thank you very much for your comment. The part where dividing data two train and test seems to confuse me a lot. Can only train my data ? and not do any prediction ?? Like a said in the question, i dont wan to predict anything. Is this possible with random forest?

ADD REPLY • link 7.0 years ago by morovatunc ▴ 550

score 2 · Accepted Answer · 2017-04-30

@morovatunc , to avoid over-fitting, you can use all your data but by doing for example 10-fold cross-validation (the Caret package can do that for you). Then you'll get your variable importance. Because theoretically, the signal that you'll find important is supposed to be important in any subset of the genome. A 10-fold CV will help eliminate some of the noise.