Biostar Beta. Not for public use.
Fit GLM using part of the data
Entering edit mode
16 months ago

I have a single-cell RNA-seq experiment with five different treatments. The treatments are likely to result in different cell types, although this isn't known at this stage of the analysis. Regardless, they can be assumed to be quite different. Due to technical problems the size factors are also very different for the different treatments.

I now want to use edgeR for finding differentially expressed genes between the treatments, but at this stage it's only treatment 1-3 that are of interest. I wonder if I should use the full dataset for estimating the dispersion and fitting the model, or only the treatments I'm interested in comparing at this stage. On the one hand you should get better tag wise estimates with the full data, but given that this is single cell data on FACS sorted cells that represent different cell types you could very well have zero expression in the treatments I'm interested in and quite high in some of the others (or the opposite). What would be the statistically more correct approach here? I would like to err on the side of caution. Thanks!

Entering edit mode
14 months ago
Republic of Ireland

There is no real right or wrong here. You should start by using the entire dataset and doing normal filtering for low count transcripts and transcripts with many zeros. With scRNA-seq, as I understand, there are also imputation methods available, which you may consider.

If you run into a brick wall by using the entire dataset, then consider reducing the dataset in size. In some cases, if the covariation between groups within your dataset is so great and / or inconsistent (or heteroskedastic), then splitting the dataset may be the only way.



Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3