Welch’s t-test for comparing two groups: An Excel calculator (1.0.1)

In this post, I wish to promote my latest Excel calculator, Welch’s t-test for comparing two groups (version 1.0.x). It was released originally 22nd October 2018, and released again with a few slight adjustments as Version 1.0.1 on 5th April 2019.

Version update: 5 April 2019

In Version 1.0, there were a couple of minor bugs related to the calculator’s ancillary output. These have been corrected in Version 1.0.1.

I am currently working on a new version release, which will have the additional functions:

Users will be able to compare more than two groups via Welch’s ANOVA
Users will be able to perform analyses based on either raw measures for two groups, or alternatively, just the aggregate measures (M, SD, and n)
Users who opt for raw score input will be able to perform additional normality checks (e.g. skewness and kurtosis tests)
Users will be able to measure effect sizes as they relate to Student’s t-test, as well as to Welch’s t-test

When this version becomes available, I shall update this page and provide a link to a new notes page.

Who is it for?

I designed this with one of my students in mind, who needed to correct his analyses for the fact that sample size was unbalanced between two comparative groups. However, this calculator is for anyone who wants to compare aggregate measures between any two groups.

What do you need?

Knowledge

Some prior knowledge of what a t-test is, and how to interpret it, is recommended. You also need to know that when comparing differences between two groups using a t-test, there is an optimal variant of the test called Welch’s procedure. Welch’s t-test corrects for measurement bias caused by the two groups’ having different sample sizes and sample variances, whereas your classic Student’s t-test makes no such attempt to correct this bias.

Data

All you need is a set of average and standard deviation scores for two groups. You also need a number describing how many independent scores were made by each group:

In the above demonstration, imagine you are comparing the average height (in cms) of students in two tutorial classes. Group 1 has 13 students, whose average height is 164.2 cm, with a standard deviation of 14.1. Group 2 is larger (n = 25), and consists of taller students on average (M = 173.9), albeit with heights that are less variable (SD = 11.3).

Parameters

By default, the calculator assumes a number of things that you are free to challenge with your own specs:

Mean difference constant is zero. You wish to compare the group mean difference to a baseline of zero
Criterion of significance is five percent. You believe outcomes that have less than a 5 in 100 chance of being caused at random are probably not false positives
Two tails are better than one. You want to run a two-tailed comparison of the groups’ scores
Control event rate is 20 percent. You guess that around 20 in 100 of the individual scores from the control group are more like scores from the experimental group. This parameter might matter if your experiment involved the administration of a certain treatment, and you want to know what effect that treatment is having relative to a control group. If that or something similar is the case, then my calculator will produce you an effect size known as number needed to treat. Otherwise, you can probably ignore this assumption.

Here is a screen capture of what these parameters look like in my calculator, bearing in mind that you are free to swap in your own custom values:

Note that the calculator does not assume that your two groups are the same size, nor does it care about their variances being the same size. Such is the beauty of the Welch procedure – it corrects for these discrepancies ‘just in case‘, whether or not they are large enough to be meaningful.

Look, if you are still paranoid about your groups’ variances being unequal, my calculator has an inequality of variances test, built-in and free of charge. Go nuts! With the demo input data above, you’ll see in my calculator’s ash-coloured box that variances are not statistically different, F(12, 24) = 1.56, p = 0.172; all that worry for nothing!

What do you get?

With your data entered and the parameters set to your liking, my Excel calculator will calculate automatically a range of useful outputs.

In a nutshell, you get your test outcomes, t(df) and p, both with and without the Welch correction. Extending on the demonstration above, here’s what the Welch t-test output looks like:

And here’s what the output looks like without Welch’s magic touch:

Do you see the difference?

With Welch’s correction, the average height of the second tute’s students is statistically higher at the 0.05 significance level, t‘(20.23) = -2.15, p‘ = 0.044, 95% CI_diff [-19.12, -0.28]. Note that without the Welch correction, the t value is biased, which in turn produces an apparently ‘less likely null’ effect (p = .027).

What else? Plus, a warning…

As a bonus, you also get mean difference confidence intervals, plus a range of effect sizes – Cohen’s d and his somewhat misfit cousins, including Cohen’s U₃, which you can read about here.

Note that the inequality of variances test and all of the bonus effect sizes will always be the same, regardless of whether you prefer Welch or Student’s t-test. For instance, you may report a certain Cohen’s d value as an accompaniment to either test, without the need for additional correction.

Continuing with the running demonstration, the collective height advantage of students in Group 2 is quite large (d = 0.79). About four fifths of Group 2’s students are taller than Group 1’s students (Cohen’s U₃ = 78.48%). Almost a third of the two groups’ height measurements do not overlap (100 – OVL = 30.65%). There is a high likelihood that a student picked at random from Group 2 will be taller than someone randomly selected from Group 1 (AUC = 71.14%).

The n Needed to Treat statistic is a little different to the rest. It is relevant in clinical research, where a treatment group is being compared to a control (non-treatment) group on some measure. In this case, we’ll pretend that Group 2 is the ‘treatment group’, where our desired outcome is taller heights. As such, n Needed to Treat expresses the number of new student heights you’ll need to measure, in order to have another taller-than-average student in Group 2. In other words, if another 100 students join Group 2 later in the semester, about 28 will be taller than if they had’ve joined Group 1 instead (100 ÷ 3.5874 = 27.8753)! Of course this is nonsense, because we know that the simple act of joining one or some other tutorial is not going to make us taller (unless…).

The n Needed to Treat statistic depends on user input, specifically the control event rate, or CER. The control event rate is the percentage of control group (in our silly example, students in Group 1) assumed to have ‘favourable outcomes’. By default, my calculator has set this value to 20%, but please do experiment with different values. With the default CER value, we are saying that we expect about 20% of students in Group 1 to measure unexpectedly taller-than-average, when in fact, they are short – again, silly example, but in clinical research, both n Needed to Treat and CER are vital indicators of an intervention’s effectiveness.

How do you use the calculator?

Simple usage instructions are provided in the Excel worksheet. If those are too confusing, feel free to contact me.

If you are preparing a manuscript for publication, please cite this calculator in the body of your text as Gaetano (2019). Please also add the following reference to your reference list or bibliography:

Reference

Gaetano, J. (2019). Welch’s t-test for comparing two independent groups: An Excel calculator (1.0.1). [Microsoft Excel workbook]. doi:10.13140/RG.2.2.14550.91209/2

Please click here to access the calculator hosting page, then click the Download button.

Welch’s t-test for comparing two groups: An Excel calculator (1.0.1)

Version update: 5 April 2019

Who is it for?