Question

discrepancy between t-test results in Excel and R

1

Entering edit mode

9.2 years ago

nazaninhoseinkhan ▴ 520

Hi

I am trying to find differentially expressed genes in microarray data. I preferred to do this using excel because I did not know how to do it in R (performing large number of t-tests simultaneously). After performing t-test in excel and obtaining p-values, I try t-test for a few number of rows in R, but the results of R and Excel are not the same.

Does anyone any idea about it?

Regards
Nazanin

R • 12k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by nazaninhoseinkhan ▴ 520

Ram · Answer 1 · 2015-02-03

9

Entering edit mode

9.2 years ago

Michael 54k

Well, Excel is wrong obviously!

Sincerely, now:

There is not a single (student's) 't-test' but several variants:

single sample,
two-sample with equal or unequal variance
two-sample with paired values
alternative: greater?, two-sided?, less than?

Which one did you use? the tests you used to compare were not the same.

Further side notes:

Excel is not a Bioinformatics application
Do not use t-test for differential expression analysis
....

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Michael 54k

1

Entering edit mode

Other answers contain references for my first sentence. It sounds like a joke, but it is true most of the time, I should explain that maybe a bit better. Why to trust R over Excel?

R is open source, Excel is closed source, you can easily inspect the code for the t.test in R if you have doubts, while in Excel you cannot
R uses open-source libraries, these provide algorithms with very well-understood numerical properties, Excel uses whatever
R's build process contains automated tests, we do not know which test model Excel developers use
R's main field of application is statistics, and possibly t.test is one of the most used functions, while Excel is used for all kinds of things and t-test and statistics is a niche application.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Michael 54k

Ram · Answer 2 · 2015-02-03

Did you used the same input in both tests?

If yes did you check the parameters each t-test uses? In R and Excel the standard parameters may differ, which causes different results.

In R some parameters you can set are for example:

alternative   a character string specifying the alternative hypothesis,
              must be one of "two.sided" (default), "greater" or
              "less". You can specify just the initial letter.
paired        a logical indicating whether you want a paired t-test.
var.equal     a logical variable indicating whether to treat the two
              variances as being equal. If TRUE then the pooled variance
              is used to estimate the variance otherwise the Welch (or
              Satterthwaite) approximation to the degrees of freedom is
              used.

Ram · Answer 3 · 2015-02-03

You should also read a few papers that discuss problems with t-tests and other statistics in Excel, for example Heiser 2006. Statistical tests, tests of significance, and tests of a hypothesis using Excel. Journal of Applied Statistical Methods, 5:155-171 or McCullough and Heiser 2008. On the accuracy of statistical procedures in Microsoft Excel 2007. Computational Statistics & Data Analysis 52:4570-4578. Here's a snippet from the second paper listing a few problems...

Inaccurate t-test results in the presence of missing values. Does not correctly calculate t test values when there are missing data cells in the range. Microsoft identified this fault (KBA 829252) in previous Excel versions, but did not fix it.

Inaccurate p-values from a t-test. Excel uses the Welch test method, which calculates a non-integer degrees-of-freedom value. The Excel algorithm uses the TTEST function, which truncates the degrees-of-freedom input to an integer. This gives an incorrect p value for the test.

Ram · Answer 4 · 2015-02-03

0

Entering edit mode

9.2 years ago

biogirl ▴ 210

You say that you tested a few rows in R, suggesting that the input data for the t-test in R is not the same as the input for the t-test in Excel?

Another source of difference is that R will handle floating point numbers differently, which will influence the outcome of your statistic. This is not a negative against R, however; R is better.

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by biogirl ▴ 210

0

Entering edit mode

this is an interesting point actually - though personally I think the original poster simply is not using the test the same way.

back to the issue - internally Excel seems to use a much larger precision library to compute the probability distributions, and as consequence it will compute very tiny p-values well beyond what actually makes sense. This is reserved strictly to p-values for all other computation the representation double precision.

This can be extremely annoying when communicating with some life scientists as they hate the idea of seeing 0 for p-value and much prefer something like 10^(-100) as that looks more scientific. Educating someone on the spot in the intricacies of floating point representation is hopeless.

Anyway just something popped into my mind - that feeling of having to do something in Excel because it is "more accurate" - thankfully I can mostly refuse doing that

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Istvan Albert 100k