Question

Threshold to exclude samples based on sample-to-sample distances

1

Entering edit mode

7.7 years ago

Tobias.Wohland ▴ 70

Hi, I have a RNA-Seq experiment and I'm wondering at which point I should exclude samples from my analysis? I couldn't find a good answer with googles help.

I clearly see (at least I would say it) in the sample-to-sample distance heatmap and a simple PCA-plot that one of my samples within one of the treatment groups differs from the other samples of the same group. But is there a threshold which I can use to say "yes" or "no" to the exclusion?

Please find attached both plots. The guy in the bottom left corner of the PCA-plot is the sample "SD_2" (see Distance plot).

sample-to-sample distance

enter image description here

Thanks for your help in advance.

Best, Tobi

RNA-Seq R • 3.3k views

ADD COMMENT • link updated 7.4 years ago by DG 7.3k • written 7.7 years ago by Tobias.Wohland ▴ 70

0

Entering edit mode

Hi Tobias,

Have you ever figured out an answer to your problem? I'm wondering about this as well...

Best, Janne

ADD REPLY • link 7.4 years ago by Janne.Swaegers • 0

score 0 · Answer 1 · 2016-11-14

0

Entering edit mode

7.4 years ago

DG 7.3k

Seeing this question now thanks to Janne's bumping. I don't think there is a good threshold you can use for this sort of thing that would work in all cases. Most people would simply determine that a sample is an outlier based on the visualizations and then manually remove it from their analysis. This is part of the "art" of data analysis. Machine learning approaches and clustering can achieve what you want, but it seems like overkill versus simple visualization like you have already done.

ADD COMMENT • link 7.4 years ago by DG 7.3k

0

Entering edit mode

I agree, but where do you draw the line in this. When does removing outliers become hypothesis-driven data-manipulation?

Given that (in this example and many others) the sample size is small you are sampling individuals from a larger population without knowing the true heterogeneity in the population.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

absolutely agree. In RNA-Seq experiments, when I see this sort of pattern I tend to run parallel analyses with and without the sample included and look for other evidence that it is really behaving oddly. If I'm dealing with clinical samples I (or someone) goes back to look at clinical data to see if there is anything about the sample. Usually I find something here and can simply drop it from the analysis because it was mischaracterized at the phenotype level.

That said a few points 1) For an RNA-Seq experiment 6 samples in a group isn't really a small number of samples. I realize that as costs drop the number of samples people run is increasing but this is about double the number of samples you usually have in a group. 2) That light blue group of samples is definitely more variable than the red cluster in the PCA, but even then the sample in the lower left is quite far outside the distribution of normal variances.

I'd think hard about removing it. IN this case I would probably do the parallel analysis approach.

ADD REPLY • link 7.4 years ago by DG 7.3k