I am working on a classification problem using a sequence data. I have the positive data which belongs to a gene region I also have negative data which I have selected based upon the most common 5 nucleotides in the center. It seems like my model is over fitting and giving me very high accuracy I am not convinced if I have chosen my negative data correctly. I was wondering if any of the machine learning expert in bioinformatics could provide some wisdom or point me to a best practices paper. Would doing a blast against the positive sequence fasta db versus the rest of the gene regions and selecting the top x matches that have the motif be a better solution?
I am using one hot CNN inception model to do the training and prediction. My problem is that my data is imbalanced and the model trained on the current data set is biased. Above is the web logos of positive and negative sequences of the data
I feel that may be there is a better way to choose negative data. Currently I have taken sequences from gene that do not overlap the positive mRna fragments. I was wondering if there is a better approach that can be used for negative data selection.
PS: I do use CD-hit to remove sequence redundancy and reduce sequence similarity