Entering edit mode

Hi,

I'm still a bit new to computational analysis of single cell data, but I'm doing my best to understand why things are done.

As I understand it, when people cluster their data, typically they do some feature selection, usually taking the most variable genes across the entire dataset or through iteratively subsetting their data and doing this with each subset. The expression of these variable genes is then fit to a negative binomial distribution to estimate scaled expression values, which will then be fed into dimensional reduction and/or clustering algorithms. I'm having a difficult time trying to understand what the purpose of fitting to a negative binomial distribution is. Is it that this takes into account relative abundances better? Please tell me if I'm on the right track or way off: Say gene A is expressed lowly in most cells -- it's 1 or 2 copies in some cells, but relatively highly at 5 copies in a few cells. Gene B is comparatively expressed much higher -- several cells express it at 10-20 copies, while other cells express it relatively highly at 50 copies per cell.

So this fitting to a negative binomial distribution in essence helps take into account the nature of the expression of Gene X to provide a normalized, scaled, and centered value of 2 or 3 for both of these genes, despite the differences in overall expression? And its fit to a negative binomial distribution because gene expression follows this distribution? I've heard this but don't know what paper showed this.

I'd appreciate any explanations or links that might clarify this more.

Thanks, Eric

Entering edit mode

Yes, the fact of the matter is that RNA-seq count data naturally follows a negative binomial distribution, so, one has to *model* the data as such if one wishes to derive statistics from it. Original analysis methods modeled it as a Poisson but it was found that this still resulted in false-positive associations after having fitted the model. Microarray gene expression data, on the other hand, follows a normal distribution.

Practically, what does fitting a model actually mean? - What happens is that we literally create a logistic regression model of the data and specify the negative binomial as the family. In pseudocode:

```
glm(outcome ~ gene1 + CounfoundingFactors, family="NegativeBinomial")
```

That is fit for all genes. Other parameters are added for the purposes of nomalisation and dispersion.

With a model of the RNA-seq data, we can then make statistical inferences.

Some of your other questions are specific to the normalisation method that one actually uses. Typically, what are known as *size factors* are calculated, which will cater for the scenario that you mentioned.

Kevin