Dear all,
I wonder about the cleaning of rna-seq counts in the context of tumor deconvolution based on rna-seq data.
When performing classic rna-seq differential expression analysis, it is common to remove the genes that are not and almost not expressed across the samples, leading to the removal of ~5-30% of the genes I would say. This filtering step is the only one I am aware of that is commonly performed for DEA. I am interested in the question of tumor deconvolution. In this context, one starts with an expression matrix. I am wondering if this matrix should be preprocessed (more extensively than just removing lowly expressed genes) to remove non informative genes and potential noise.
I recently had an introduction to the analysis of methylation data, what to do once you have the percentage of methylation per CpG. Some people remove CpG that have little variance across samples (mainly unmethylated CpGs, but not only), CpGs on X, Y chromosomes and also try to see if the methylation of certain CpGs correlates with clinical variables (when available) (like age, gender, ...) to filter or adjust them.
I wonder why there are many criteria on methylation data that are not (to my knowledge) use on rnaseq data. Do you know why and do you think they should be use for the question of tumor deconvolution based on rna-seq data?
Thank you in advance for your comments. Jane
Sorry, I just saw your answer. Yes I mean identification of cell populations in bulk by tumor deconvolution.
Using pure cell populations is an interesting approach, used in supervised methods, as CIBERSORT, EPIC, xCell, ... on gene expression data. I assume that when using supervised approaches, the "cleaning of the dataset" might have less importance than when using unsupervised approaches. I can use both approaches in parallel, but my focus here is mainly on unsupervised methods. That is why I would like to start with a clean and meaningful matrix.
Z-scores are intuitive. What I cannot figure out is (if/why) their use might improve in some way the analysis, besides the interpretation.