Question

Fit GLM using part of the data

0

Entering edit mode

5.6 years ago

rasmus.agren • 0

I have a single-cell RNA-seq experiment with five different treatments. The treatments are likely to result in different cell types, although this isn't known at this stage of the analysis. Regardless, they can be assumed to be quite different. Due to technical problems the size factors are also very different for the different treatments.

I now want to use edgeR for finding differentially expressed genes between the treatments, but at this stage it's only treatment 1-3 that are of interest. I wonder if I should use the full dataset for estimating the dispersion and fitting the model, or only the treatments I'm interested in comparing at this stage. On the one hand you should get better tag wise estimates with the full data, but given that this is single cell data on FACS sorted cells that represent different cell types you could very well have zero expression in the treatments I'm interested in and quite high in some of the others (or the opposite). What would be the statistically more correct approach here? I would like to err on the side of caution. Thanks!

edger glm scRNA-seq • 1.0k views

ADD COMMENT • link updated 5.6 years ago by Kevin Blighe 87k • written 5.6 years ago by rasmus.agren • 0

score 1 · Answer 1 · 2018-09-21

There is no real right or wrong here. You should start by using the entire dataset and doing normal filtering for low count transcripts and transcripts with many zeros. With scRNA-seq, as I understand, there are also imputation methods available, which you may consider.

If you run into a brick wall by using the entire dataset, then consider reducing the dataset in size. In some cases, if the covariation between groups within your dataset is so great and / or inconsistent (or heteroskedastic), then splitting the dataset may be the only way.

Kevin