Fit GLM using part of the data
1
0
Entering edit mode
5.6 years ago

I have a single-cell RNA-seq experiment with five different treatments. The treatments are likely to result in different cell types, although this isn't known at this stage of the analysis. Regardless, they can be assumed to be quite different. Due to technical problems the size factors are also very different for the different treatments.

I now want to use edgeR for finding differentially expressed genes between the treatments, but at this stage it's only treatment 1-3 that are of interest. I wonder if I should use the full dataset for estimating the dispersion and fitting the model, or only the treatments I'm interested in comparing at this stage. On the one hand you should get better tag wise estimates with the full data, but given that this is single cell data on FACS sorted cells that represent different cell types you could very well have zero expression in the treatments I'm interested in and quite high in some of the others (or the opposite). What would be the statistically more correct approach here? I would like to err on the side of caution. Thanks!

edger glm scRNA-seq • 1.0k views
ADD COMMENT
1
Entering edit mode
5.6 years ago

There is no real right or wrong here. You should start by using the entire dataset and doing normal filtering for low count transcripts and transcripts with many zeros. With scRNA-seq, as I understand, there are also imputation methods available, which you may consider.

If you run into a brick wall by using the entire dataset, then consider reducing the dataset in size. In some cases, if the covariation between groups within your dataset is so great and / or inconsistent (or heteroskedastic), then splitting the dataset may be the only way.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1531 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6