DESeq2 - Between sample normalization of train/test subsets
1
3
Entering edit mode
5.1 years ago

Hello,

I would like to apply machine learning methods on RNA-seq data from the TCGA dataset for the purpose of survival time analysis. The samples have to be comparable, so I understand I should use between-sample normalization methods like DESeq2.

I would like to split my dataset to train/test subsets.

1) Is it possible to normalize the training set only using DESeq2 and later use it for normalizing the test set, so the test samples will not affect the normalization of the training set?

2) Will normalizing the training and test subsets separately result in non-comparable samples between train and test?

3) Are there other between-sample normalization methods, which are better than upper quartile normalization, for this purpose?

Thanks

RNA-Seq normalization DESeq2 TCGA survival • 2.6k views
ADD COMMENT
0
Entering edit mode
5.1 years ago
Asaf 10k

I'll answer in reverse order:

3) DESeq doesn't use upper quartile normalization, it uses another, better method.

2) It all depends whether you are using the actual values of gene expression in the ML or just the ranks or relation between them. You might as well not normalize the data then.

1) I would say normalize everything together or select a set of genes that will be used for normalization, you might be able to use it in separate runs if you assume that these genes have the same expression level overall.

ADD COMMENT
0
Entering edit mode

Thank you for your answer.

3) Are there another between-sample normalization methods, which can allow me to fit a normalizer to the training set, and later I will be able to use it to normalize the test set?

2) I would like the absolute value to be comparable, i.e. if two samples have a given gene with a value of X, then they have the same meaning.

1) I do not want a leakage of information from the test set to the training set, but I still would like them to be comparable. Is it acceptable to use only some genes for getting the scaling factors and use them for the normalization of the other genes? I assume that other genes should have different scaling factors.

Thanks

ADD REPLY
1
Entering edit mode

You can have a look at this paper for normalization methods: https://academic.oup.com/bib/article/14/6/671/189645

The raw counts can tell a lot depending on the machine learning you're using. If you take the library depth as an input then, again, depending on the algorithm, you might be okay with raw data.

I agree that separating the test and train will be best, it will also mean that you could use the tool on a new dataset. I would suggest to use a set of predefined genes for the normalization. I don't know if all the samples are from the same tissue (or organism?) so that you'll have such a set.

ADD REPLY

Login before adding your answer.

Traffic: 2797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6