Question

RNAseq continuous variable outliers

1

Entering edit mode

7.8 years ago

e.antoun ▴ 20

Hi,

I am attempting to analyze some RNAseq data with respect to a couple of continuous variables. I am using DESeq2 in R to do this, but am running into a problem. Below is the code I am using:

dds <- DESeqDataSetFromHTSeqCount(samples, directory=".", design=~fast)
dds <- DESeq(dds)
res <- results(dds)

This has worked out fine for some of my other continuous variable, giving me a nice list of genes that change with respect to the variable in question. However, for 2 of the variables, there seems to be one samples that is completely skewing the data as you can see below:

       count     fast
2    5.194365 65.25974
4    8.032771 65.79634
5   10.929044 35.18518
6    3.501335 63.21429
7   13.352367 53.29342
10   8.261876 59.53079
14  20.103149 45.50562
16   6.315940 64.55331
17  10.014749 53.15985
19   7.377103 46.86469
24   5.593491 58.26772
26  11.172046 67.38461
27   9.525122 62.40000
31   2.556560 76.26373
33   3.521462 61.88679
39   5.633191 58.42697
40   5.482473 54.71698
1   10.567494 55.12144
12   7.319713 49.79920
13   4.362853 53.90836
15  12.794649 76.51869
18   9.682205 55.38462
20   6.072752 64.04494
22  12.648017 61.78660
23   3.287383 55.73034
25  10.516274 24.82269
28  20.266891 39.63636
29   2.744838 74.68750
30  14.990684 55.26316
32   9.224983 51.36364
34   4.702022 45.55874
35   3.972492 48.58657
36   2.542509 51.83246
37   7.500402 44.91228
3    6.942850 74.48649
8    4.000244 63.24786
9   10.290107 67.70187
11   1.928383 61.51079
21   7.866473 76.54958
38 108.088894 11.69065

As you can see, the very last sample seems to be much greater than the rest and is skewing the data, and looking at the rest of the values, this gene shouldnt be differentially changed. If I remove this sample, then I get the same issue but with another sample being the problem. This example doesn't seem too extreme, but some of the genes, the values are all around 10, then there is the one in the thousands, skewing the analysis.

Phenotype table:

       files                slow                 fast            status
1   2.counts              34.74026              65.25974          con
2   4.counts              34.20366              65.79634          con
3   5.counts              64.81481              35.18518          con
4   6.counts              36.78571              63.21429          con
5   7.counts              46.70658              53.29342          con
6  10.counts              40.46921              59.53079          con
7  14.counts              54.49438              45.50562          con
8  16.counts              35.44669              64.55331          con
9  17.counts              46.84015              53.15985          con
10 19.counts              53.13531              46.86469          con
11 24.counts              41.73228              58.26772          con
12 26.counts              32.61538              67.38461          con
13 27.counts              37.60000              62.40000          con
14 31.counts              23.73626              76.26373          con
15 33.counts              38.11321              61.88679          con
16 39.counts              41.57303              58.42697          con
17 40.counts              45.28302              54.71698          con
18  1.counts              44.87856              55.12144          pre
19 12.counts              50.20080              49.79920          pre
20 13.counts              46.09164              53.90836          pre
21 15.counts              23.48131              76.51869          pre
22 18.counts              44.61538              55.38462          pre
23 20.counts              35.95506              64.04494          pre
24 22.counts              38.21340              61.78660          pre
25 23.counts              44.26966              55.73034          pre
26 25.counts              75.17731              24.82269          pre
27 28.counts              60.36364              39.63636          pre
28 29.counts              25.31250              74.68750          pre
29 30.counts              44.73684              55.26316          pre
30 32.counts              48.63636              51.36364          pre
31 34.counts              54.44126              45.55874          pre
32 35.counts              51.41343              48.58657          pre
33 36.counts              48.16754              51.83246          pre
34 37.counts              55.08772              44.91228          pre
35  3.counts              25.51351              74.48649         sarc
36  8.counts              36.75214              63.24786         sarc
37  9.counts              32.29814              67.70187         sarc
38 11.counts              38.48921              61.51079         sarc
39 21.counts              23.45041              76.54958         sarc
40 38.counts              88.30935              11.69065         sarc

Is there anyway around this? I have tried having a look but can't seem to figure out how to overcome this problem.

Thanks for any help

rnaseq deseq2 outlier continuous • 2.1k views

ADD COMMENT • link 7.8 years ago by e.antoun ▴ 20

0

Entering edit mode

Which is the "very last sample" ? The fast seems to be a condition here. How many samples do you have ? What is your design ? Did you do any exploratory analysis to look for PCA etc ?

ADD REPLY • link 7.8 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks for the reply. The 'very last sample' just happens to be the last sample in the data frame above, sample 38. Fast is indeed the variable, it is the percentage of fast fibres in a muscle sample. Total, there are 40 samples, the samples are grouped based on disease status, with the first 17 as controls, next 17 are disease 1 and last 6 are disease 2. I have done categorical analysis to look at differences between the different groups, but now want to do analysis looking at continuous variables, fast in this case.

I have done PCA, plotted heatmaps of the most variables genes, boxplots, etc, and there is nothing out of the ordinary between the samples. They do not group on the PCA plot, but none of the individual samples is goruping as an outlier.

Thanks

ADD REPLY • link 7.8 years ago by e.antoun ▴ 20

1

Entering edit mode

Can you append your post with a phenotype table, to illustrate your experimental design?

ADD REPLY • link 7.8 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

I have added it to the original post. Hopefully that'll make things a bit clearer

ADD REPLY • link 7.8 years ago by e.antoun ▴ 20

0

Entering edit mode

Sorry for posting this, but does anyone have any ideas on what I can do with this? Thanks

ADD REPLY • link 7.8 years ago by e.antoun ▴ 20

0

Entering edit mode

Editing your original post "bumps" it up to front page. No need to Submit Answer to achieve the same result.

ADD REPLY • link 7.8 years ago by GenoMax 141k

score 0 · Answer 1 · 2016-06-20

0

Entering edit mode

7.8 years ago

e.antoun ▴ 20

Just to add, I have tried using edgeR, voom+limma and I have the same problem, where there is 1 or 2 samples that seem to be skewing the data and appears to be an outlier, but still coming up and highly differentially changed.

Thanks

ADD COMMENT • link 7.8 years ago by e.antoun ▴ 20