Question

Single cluster spanning in different locations in UMAP in scRNAseq data

0

Entering edit mode

5 weeks ago

sarahmanderni ▴ 100

Hi,

When a single cluster appears to span across two different locations in a UMAP (relatively farther away) plot in scRNA-seq data, what would that mean? How would you assess if it is real biological process or more of a technical issue. I already tested different integration methods (seurat RPCA, CCA and harmony) and looks like the issue persists.

UMAP scRNAseq • 470 views

ADD COMMENT • link 5 weeks ago by sarahmanderni ▴ 100

1

Entering edit mode

It means that there is a disagreement between UMAP and clustering algorithm, but since UMAP is mostly just visualization you could ignore it or try different UMAP parameters to visually remove it.

ADD REPLY • link 5 weeks ago by ATpoint 82k

0

Entering edit mode

In general, is there trick to have neat and fully cluster-wise united UMAPs? When I look at the papers, the UMAPs are all perfect and clean, for me there are always few cells mixed with other clusters and sometimes there are some residual cells from couple of clusters. Do you play with parameters to have a good looking UMAP?

ADD REPLY • link 5 weeks ago by sarahmanderni ▴ 100

score 3 · Accepted Answer · 2024-03-19

Hello sarahmanderni,

I like @atpoint's answer, which captures an uncertainty here relating to 2 algorithms here.

I agree - the problem you describe is likely traceable either to the paramerization/use of the clustering algorithm, the visualization algorithms, or both. I'll try to explain in greater detail:

Whenever an object, or a dataset that has a certain dimensionality is mapped onto a plane that has lower dimensionality, some level of distortion or deformation occurs (if interested, read more about the data processing inequality, or the manifold hypothesis or curse of dimensionality). Consider, for instance, the problem faced by cartographers in the 16th century who wanted to represent the 3D surface of the earth on a 2D object like a sheet of paper.

They succeeded in making many useful maps, however, each of these types of maps is distorted in some way(s). Consider for instance, the well known Mercator Projection. On the map, Alaska and Russia appear to be on opposite sides of the earth, but in reality, at the nearest point, Alaska is only about 89km from Russia. This is because, when putting the sphere onto the 2D surface, at some point it became necessary to introduce a boundary ...

Let's continue this metaphor using your exact question ... Note that Russia is "cut in half", so to speak. Is that part of Russia far from the part on the other side of the map? Evidently not... Dots belonging to a given cluster on a UMAP can also be subject to issues like this. If the dots of a given cluster are truly very tightly grouped, then typically the UMAP will draw them as close together.. but if the boundary is drawn in just the wrong place ... well ... you may see them as far away in the UMAP figure. This is one possibility for you.

However, it is not entirely certain that this is the correct explanation. When you performed your clustering, you may have specified certain assumptions. For instance, you may have told the algorithm "I want 11 clusters". If this is true, then there is another possibility - that placing the constraint of having 11 clusters in the dataset has caused data points that are actually different (in some sense like cellular identity) to be lumped into one cluster...

We cannot know which of these possibilities is true without knowing more about what you did, which is why it is always a good practice to include your code - it helps us help you, and also helps us to not mislead you....

You did leave us a clue - that different visualization softwares all lead to the same lumping. This result sort of undermines the UMAP as the explanation ... however, this is hand waving, not definitive. After all, if you gave all the viz algorithms the same starting assumptions, then even though in general different visualizations will be typically be obtained, it is not impossible you could get a similar result each time, even if the issue is introduced during the viz step... In addition, note that if the clustering algorithm has been forced to make decisions that dont reflect the data (like, make 11 clusters even though there are more or fewer than 11 cell types in the data set) then it is not surprising if the outcome itself is repeatedly non-sensical as it repeatedly depends on the same imprecise assumption ...

This is why the safest answer we can give you is, it is most likely the result you describe flows either from the parameterization of the clustering algorithm, or from that of the UMAP algorithm ... but we cannot say more with certainty.