Single cell RNA-seq: how many PCs to use for t-SNE/UMAP?
3
5
Entering edit mode
3.0 years ago

How many principle components (PCs) to embed in the neighbourhood graph for t-SNE/UMAP in the context of single cell RNA-seq analysis?

Note here I'm not so concerned about performance in terms of speed/memory but am interested in accuracy in terms of de noising the data without removing relevant biological signal.

Options I have seen:

  • A fixed lower number, e.g. 10
  • A fixed higher number, e.g. 50 (e.g. default in Scanpy)
  • Elbow plot visual estimate
  • Elbow plot statistical cutoff (which one?)
  • JackStraw (e.g. in Seurat )
  • Molecular Cross Validation (https://www.biorxiv.org/content/10.1101/786269v1 )
  • Use ICA instead of PCA (but how many ICs?)
  • It doesn't matter (show me the evidence)
  • Other
umap pca scRNA-seq t-sne RNA-seq • 5.8k views
ADD COMMENT
0
Entering edit mode

From my experience the elbow method tends to underestimate to required number of PCs to get a detailed clustering. I personally prefer to start with the default 50 and see how clustering and manifold embeddings (UMAP, or whatever you use) look. Check marker genes and see whether things meet biological expectations, so do you at least see the celltypes being separated that you know must be present. Do unexpected clusters make sense? If not you can always go back and tune number of PCs, number of nearest neighbors etc. Be sure to set an explicit seed to make UMAP reproducible. I think there is no correct answer to this, many possible ones, and many wrong ones. In the end it comes down to whether you can make sense of whatever you see and can transform it to some biological message that fits your research question. What good are many "interesting" new clusters (or celltypes, or just noise) if you can't validate any of it, heh ;-)

Further reads:

https://bioconductor.org/packages/release/bioc/vignettes/scran/inst/doc/scran.html#5_Automated_PC_choice

http://bioconductor.org/books/release/OSCA/dimensionality-reduction.html#choosing-the-number-of-pcs

There are also a number of similar questions at support.bioconductor.org if memory serves.

ADD REPLY
0
Entering edit mode

I take your point ATpoint, but it's difficult to automate such a procedure. Also, do you do this for all the other parameters? e.g. different filtering of genes/cells/mt content, different number of HVGs, different scaling/normalisation, different clustering resolutions, etc. the parameter space and time involved becomes rather large

ADD REPLY
1
Entering edit mode

I would not automate these processes, or at least critically check everything and change parameters if needed. It is the most critical step as the entire project will be based on these initial analysis.

ADD REPLY
0
Entering edit mode

It depends, checks are essential but it can become too subjective if you're tweaking parameters to find 'what you want'- I also need standardised processing of many datasets where consistency is important

ADD REPLY
3
Entering edit mode
3.0 years ago

Via Aaron Lun, my PCAtools R package provides 4 ways to infer optimal number of PCs - please see https://github.com/kevinblighe/PCAtools#determine-optimum-number-of-pcs-to-retain

These are:

  • Elbow method (findElbowPoint())
  • Horn’s parallel analysis (parallelPCA())
  • Marchenko-Pastur limit (chooseMarchenkoPastur())
  • Gavish-Donoho method (chooseGavishDonoho())

With regard to which is 'better', well, how long is a piece of string? Neither is probably better than the next, but, hopefully, that, by using each, you can get a range of PCs that are optimal.

Kevin

ADD COMMENT
0
Entering edit mode

Nice, thanks, is there a Python port?

ADD REPLY
0
Entering edit mode

Unfortunately, no, sorry. If I had my own lab group with funding, this is something that I'd have. I am currently maintaining these packages in spare time.

ADD REPLY
2
Entering edit mode
3.0 years ago
Mensur Dlakic ★ 27k

An empirical answer to your question is to use the number of PCs that explain most of the variance in your data. By that I don't mean 51% of variance even though that would technically fit "the most" definition, but rather 90-95% of variance.

If you want a completely unbiased dimensionality reduction method, a bottleneck part of an autoencoder can serve as input to t-SNE/UMAP. Autoencoders learn compressed data representations by going through a series of layers with progressively smaller numbers or neurons, and the bottleneck part is where the data is most compressed. Since this is a neural network, it will "learn" how to throw away parts of your data data are not found in most data points, which is the equivalent of removing the noise. I suspect 16-32 latent space factors (equivalent to PCs) would be enough for your purpose. See here and here for more details.

It may not work for all data, but you can use an autoencoder with bottleneck size of 2 to do the same thing that t-SNE and UMAP are doing. That may already give you good enough clustering.

ADD COMMENT
1
Entering edit mode

If you want a visual for what I said above, below is an example of hand-written digits (top row), followed by the same data corrupted by 70% of random noise (middle row). With a bottleneck of 2, this autoencoder still learns to ignore the noise and reconstructs the images correctly (bottom row).

enter image description here

ADD REPLY
0
Entering edit mode

Indeed, that's my usual approach, i.e., take the # PCs achieving > 80% or 90%. In many single cell datasets, one must wonder just how much variation is being explained by the default # PCs; for example, 30 PCs out of the many 1000s of cells may only end up explaining 15% variation.

ADD REPLY
0
Entering edit mode

You're citing DCA from the Theis lab, but they're not recommending autoencoders over t-SNE/UMAP for visualisation- can you provide further justifications?

ADD REPLY
0
Entering edit mode

I am not necessarily recommending autoencoders for visualization over t-SNE/UMAP either. In fact, I said above it may not work for all data, and t-SNE has a clear advantage because one can fine-tune the perplexity values to alter the plot granularity. By the way, I don't use DCA, and was using their work only as an example that is directly applicable to this problem.

I show below for this dataset how an autoencoder embeds the data directly in 2D, followed by an autoencoder-derived 32 latent factors that are embedded by t-SNE in 2D. It is obvious that the latter approach is superior in this case. Nevertheless, I always take a look at what 2D embedding by an autoencoder looks like because it is easy enough to do even for very large datasets.

enter image description here

enter image description here

ADD REPLY
0
Entering edit mode

Thanks, I have used autocoders this way (was introduced by this tutorial). However, I've never actually found an autoencoder viz that 'looks better' than t-SNE/UMAP (or diffusion maps/FDGs for some cases)- do you have examples/studies where autoencoders perform better?

ADD REPLY
1
Entering edit mode
3.0 years ago
tim.daley ▴ 10

The idea behind the elbow plot based methods is that the noise will typically be small along the PC's and signal will be large, thus the small PCs in the tail should correspond to noise. It sounds like this is what your are looking for.

ADD COMMENT
0
Entering edit mode

Yes, that's the basic idea behind PC cut-offs, including but not limited to the elbow plot. Also, there is no clear point where biological signal ends and noise begins (and batch effects can contribute to high PCs). Why are you recommending the elbow plot over other methods?

ADD REPLY

Login before adding your answer.

Traffic: 2288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6