Question

Negative dataset selection

0

Entering edit mode

5.2 years ago

Saad Khan ▴ 440

Hi,

I am working on a classification problem using a sequence data. I have the positive data which belongs to a gene region I also have negative data which I have selected based upon the most common 5 nucleotides in the center. It seems like my model is over fitting and giving me very high accuracy I am not convinced if I have chosen my negative data correctly. I was wondering if any of the machine learning expert in bioinformatics could provide some wisdom or point me to a best practices paper. Would doing a blast against the positive sequence fasta db versus the rest of the gene regions and selecting the top x matches that have the motif be a better solution?

mRNA seq classification positive and negative samples

I am using one hot CNN inception model to do the training and prediction. My problem is that my data is imbalanced and the model trained on the current data set is biased. Above is the web logos of positive and negative sequences of the data

I feel that may be there is a better way to choose negative data. Currently I have taken sequences from gene that do not overlap the positive mRna fragments. I was wondering if there is a better approach that can be used for negative data selection.

Thanks

PS: I do use CD-hit to remove sequence redundancy and reduce sequence similarity

machine-learning sequence classification RNA DNA • 1.6k views

ADD COMMENT • link 5.2 years ago by Saad Khan ▴ 440

0

Entering edit mode

I don't think you're giving enough information for anyone to help you. It seems that you're not convinced that you're taking the right approach so what is the biological question you're trying to address ? What is the data ? How is it represented (i.e. what kind of features do you use) ? Which classifier do you want to use ?

ADD REPLY • link 5.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi @Jean-Karim Heriche I have edited the question to provide more information. Looking forward to your reply. Thanks!

ADD REPLY • link 5.2 years ago by Saad Khan ▴ 440

1

Entering edit mode

Just to add to @Jean-Karmin Heriche said, its difficult to know what makes a good negative set if we don't know what it is you are trying to train. Where have these positive examples come from? What features of them do you wish your net to learn? Are you sure you are overfitting? What is your performance on test data not used in training?

Why is your data imbalanced? If you are "creating" negatives, then presumably you can generate the same number as you have positives?

ADD REPLY • link 5.2 years ago by i.sudbery 19k

0

Entering edit mode

You still haven't told what it is you're trying to do, i.e. what are you trying to predict and what question is this suppose to answer ? If you're trying to detect a motif in sequences, you should try standard approaches first e.g. HMM-based. Just because deep learning is fashionable and applicable to many problems doesn't mean it's always a good idea. Also very deep networks are known to be susceptible to overfitting. There are many tricks to overcome overfitting in CNNs but the main one is simply to get gigantic training data sets.

ADD REPLY • link 5.2 years ago by Jean-Karim Heriche 27k