Question

scRNA-seq, SEURAT, NormalizeData, ScaleData, PCA, CCA ...

4

Entering edit mode

5.4 years ago

Bogdan ★ 1.4k

Dear all,

as I have just started reading the documentation on SEURAT for scRNA-seq (among a few other packages), I would appreciate having your answers and insights please on the following :

after NormalizeData() function, why ScaleData() function is needed ?
is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?
is ScaleData() working on RAW_DATA or on NORMALIZED_DATA ?
is RunCCA() working on Normalized_Data or on Scaled_Data ?

an example of R code is at : https://satijalab.org/seurat/immune_alignment.html

thanks a lot !

-- bogdan

scRNA-seq • 9.8k views

ADD COMMENT • link 5.4 years ago by Bogdan ★ 1.4k

score 8 · Answer 1 · 2018-11-16

8

Entering edit mode

5.4 years ago

igor 13k

The alignment tutorial focuses on the alignment steps. You should consult the more basic PBMCs tutorial that explains the other steps in more detail.

after NormalizeData() function, why ScaleData() function is needed ?

ScaleData() scales and centers genes in the dataset, which standardizes the range of expression values for each gene. The function additionally regresses out unwanted sources of variation such as technical noise.

is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

Scaled data is used for dimensionality reduction and clustering.

is ScaleData() absolutely needed in the scRNA-seq analysis ?

It is recommended. Technically, you could skip that step and set scale.data slot to anything if you would like to see the results without that step. However, the results would be dominated by the signal of a few highly-expressed genes.

is RunCCA() working on Normalized_Data or on Scaled_Data ?

Scaled data is used for dimensionality reduction.

ADD COMMENT • link 5.0 years ago by igor 13k

2

Entering edit mode

"Data is scaled to regress out "uninteresting" sources of variation such as technical noise."

I believe this is not quite right, data is scaled so that each feature (gene in this context) contributes similarly to the downstream steps. Regressing out unwanted signal, which by the way should be used with caution(1), is optional and is not the primary objective for data scaling.

1) A blog post on regression on scRNA-seq datasets

ADD REPLY • link 5.4 years ago by Haci ▴ 680

1

Entering edit mode

Yes, your interpretation is true. The main purpose of scaling is to make data comparable across the genes. Regression is a secondary (and optional) effect of scaling. From ?ScaleData

Scales and centers genes in the dataset. If variables are provided in vars.to.regress, they are individually regressed against each gene, and the resulting residuals are then scaled and centered.

Indeed, in the past versions, REgression was a separate function than ScaleData

ADD REPLY • link 5.4 years ago by Santosh Anand 5.7k

1

Entering edit mode

I guess I (and the Seurat tutorial) did not explicitly mention the primary objective. Yes, the scaling adjusts the range of expression values across all the genes, which will likely impact the downstream analysis far more than any additional regression. When I originally wrote the answer, I was thinking specifically in the context of the Seurat workflow in addition to the default scale function.

ADD REPLY • link 5.4 years ago by igor 13k

0

Entering edit mode

Hi Igor, thank you for your reply. If I may add a question please :

is ScaleData() working on RAW_DATA or on NORMALIZED_DATA ?

thank you !

ADD REPLY • link 5.4 years ago by Bogdan ★ 1.4k

0

Entering edit mode

You go from raw to normalized to scaled.

ADD REPLY • link 5.4 years ago by igor 13k

0

Entering edit mode

ScaleData() scales and centers genes in the dataset, which standardizes the range of expression values across all the genes. The function additionally regress out unwanted sources of variation such as technical noise.

I think it would be more accurate to say "which standardizes the range of expression values for each gene." I think ScaleData() adjust the expression value gene by gene. For each gene, it build a regression model using that gene's expression level across all cells, and then shift the residual to zero and divided it by standard deviation. The "across all the genes" is not accurate.

Am I right?

ADD REPLY • link 5.0 years ago by chansigit • 0

0

Entering edit mode

I edited the statement to make it more clear. I originally meant that all genes (as opposed to all cells) are scaled, but I can see how it can be interpreted in a different way.

ADD REPLY • link 5.0 years ago by igor 13k