Does it make sense to use publicly available RNA- Seq data from GEO to train a machine learning model to classify subjects into cases and controls and use the model to predict cases and controls in a completely different dataset? For instance let's say I have a coronary artery disease RNA- Seq data that I want to classify NASH/ noNASH can I use GEO NASH related dataset to train a Random Forest classifier and test it on my coronary artery disease RNA- Seq data?
Instead of reprocessing, why not use one of the reprocessed data sources such as recount (others also exist) https://jhubiostatistics.shinyapps.io/recount/
These generally only exist for things in GEO or public SRA of course.