Hello,
I would like to apply machine learning methods on RNA-seq data from the TCGA dataset for the purpose of survival time analysis. The samples have to be comparable, so I understand I should use between-sample normalization methods like DESeq2.
I would like to split my dataset to train/test subsets.
1) Is it possible to normalize the training set only using DESeq2 and later use it for normalizing the test set, so the test samples will not affect the normalization of the training set?
2) Will normalizing the training and test subsets separately result in non-comparable samples between train and test?
3) Are there other between-sample normalization methods, which are better than upper quartile normalization, for this purpose?
Thanks
Thank you for your answer.
3) Are there another between-sample normalization methods, which can allow me to fit a normalizer to the training set, and later I will be able to use it to normalize the test set?
2) I would like the absolute value to be comparable, i.e. if two samples have a given gene with a value of X, then they have the same meaning.
1) I do not want a leakage of information from the test set to the training set, but I still would like them to be comparable. Is it acceptable to use only some genes for getting the scaling factors and use them for the normalization of the other genes? I assume that other genes should have different scaling factors.
Thanks
You can have a look at this paper for normalization methods: https://academic.oup.com/bib/article/14/6/671/189645
The raw counts can tell a lot depending on the machine learning you're using. If you take the library depth as an input then, again, depending on the algorithm, you might be okay with raw data.
I agree that separating the test and train will be best, it will also mean that you could use the tool on a new dataset. I would suggest to use a set of predefined genes for the normalization. I don't know if all the samples are from the same tissue (or organism?) so that you'll have such a set.