Question

Apply multidimensional models with Low number of samples? (What should i do with the train-test?)

0

Entering edit mode

3.8 years ago

galapacheco • 0

I've been working with huge datasets since, well, i started with bioinformatics, but now i face a problem with a new dataset with very little samples.

I have 5 groups, Control, and 4 Diseases, wich frecuencies vary from a set of 10 features corresponding to the $log_2(1+2^{-Delta CT})$ values of gene expression (I had to use a pseudocount, to "nullify" my 0, preventing them to become a NA or -Inf).

Yet i only have a maximum of 20 entries per group (and a minimum of 3, because the data is full of NA's in some features). My plan is to cross use some of this features with clinical values in order to fit a model; i have a complete dataset of 87 of them with very few little Na's.

But I'm stuck with:

a) How do i divide a train-test dataset to fit my models with this very few little data?
b) How can i do the feature selection with my 8 firsts gene-features? I did some ANOVA (despise they are not normal, and the dataset its full of extreme outliers detected by 'identify_outliers()' and easily visible by boxplots; and some manovas with very few little features that are "significant" (Despise the data dont fullfill the asumptions, like normality).
c) Should i use a multinomial logistic regression? By the rule of thumb, i need about 10 samples per feature, but i dont know any more multiclass models that assign a probability.

Any recommendations?

r stats multidimensional logistic • 611 views

ADD COMMENT • link updated 3.8 years ago by zx8754 11k • written 3.8 years ago by galapacheco • 0