What Is a T-Test?
William Sealy Gosset published the t-test in 1908 under the pseudonym "Student" while working at the Guinness brewery in Dublin, where small batch sizes made normal-distribution assumptions unreliable. The key insight behind his work is that with fewer observations the tails of your distribution get fatter, which means you need larger differences before you can call something meaningful — and once you see that logic, the whole test makes intuitive sense. The t-test appears in virtually every introductory statistics curriculum.
Many published t-tests use small samples, often small enough that the difference between Student's and Welch's correction actually matters for the results. The t-statistic works like a signal-to-noise ratio — the top half measures how far the means sit apart, the bottom half scales that by sampling variability, and the resulting number tells you whether the gap is big enough relative to the noise to take seriously. The p-value then quantifies how often you would see a gap that large if nothing real was going on, and that combination of signal strength plus probability is what makes the t-test the first tool most researchers reach for when comparing groups.
Types of T-Tests
One-Sample T-Test
A one-sample t-test compares your sample mean against a known reference value — such as testing whether student exam scores differ from a national average of 75, or whether a factory's mean fill volume matches its 500 mL label specification. The reason this variant gets its own name is that having a fixed benchmark changes the math: you only need one group of data instead of two. One-sample t-tests are standard practice for batch release decisions in pharmaceutical testing and quality control whenever the target specification is fixed.
t = (x̄ - μ₀) / (s / √n)
Two-Sample T-Test (Welch's)
Welch's t-test adjusts degrees of freedom through the Welch-Satterthwaite equation, removing the equal-variances assumption that Student's original test requires. The practical difference matters more than it sounds: when group variances differ by a factor of three or four, Student's version inflates Type I error rates, which means you end up claiming effects that are not real. Delacre, Lakens, and Leys demonstrated this in a 2017 Psychonomic Bulletin and Review paper, showing that Welch's test maintains accurate error rates even when variance ratios reach 4:1 — which is why R, Python's scipy, and JASP all default to it.
t = (x̄₁ - x̄₂) / √(s₁²/n₁ + s₂²/n₂)
Paired T-Test
A paired t-test works by computing the difference within each matched observation and then running a one-sample t-test on those differences. The reason it exists as a separate method is that pairing removes between-subject variability, which can triple your statistical power — an independent test on the same data might miss an effect entirely because it treats each person as a stranger to every other measurement. Paired designs can substantially reduce residual variance compared to independent designs when repeated measures on the same subjects are feasible, which makes this the obvious choice for before-and-after studies.
t = d̄ / (s_d / √n)
How to Interpret T-Test Results
| P-Value | Meaning | Decision |
|---|---|---|
| p < 0.05 | Statistically significant | Reject the null hypothesis |
| p ≥ 0.05 | Not statistically significant | Fail to reject the null hypothesis |
Confidence intervals show the plausible range for the true mean difference, and they answer the question that p-values alone never do: how big is the effect in practical terms? A t-statistic of 2.4 tells you the result is significant, but a confidence interval gives you a range you can actually act on. When a 95% interval for a two-sample comparison excludes zero, the result aligns with p < 0.05 significance. The APA Publication Manual (7th edition) requires confidence intervals alongside test statistics, and reporting both intervals and p-values has become standard practice in well-regarded journals.
Assumptions of the T-Test
The t-test rests on four core assumptions, though their practical strictness varies with sample size. The one that causes the most unnecessary panic is normality — if Shapiro-Wilk rejects normality on a dataset of 85 observations, the Central Limit Theorem has already kicked in and the t-test will still perform fine. Switching to a nonparametric test at that point would sacrifice power for no real gain.
- Continuous data: The outcome variable must sit on a continuous scale — interval or ratio level. Applying t-tests to ordinal Likert data can inflate Type I error rates noticeably, which is why most methodologists discourage the practice.
- Random sampling: Observations should be drawn randomly from the target population. Convenience sampling is a common source of bias that can undermine t-test conclusions.
- Normal distribution: The Central Limit Theorem stabilizes the sampling distribution around n = 30, so moderate departures from normality matter far less than textbooks suggest. Simulation studies consistently show the t-test maintains accurate Type I error rates with skewness values up to 2.0 when n exceeds 40, which means moderate non-normality is rarely a practical concern with decent sample sizes.
- Independence: Two-sample tests require fully independent groups with no shared subjects. Paired tests require independence between pairs, though measurements within each pair are inherently dependent — that within-pair correlation is exactly what gives the paired design its statistical power advantage.
Frequently Asked Questions
When should I use a t-test vs a z-test?
The practical difference between a t-test and a z-test comes down to one thing most textbooks bury in a footnote: you almost never know the population standard deviation in real research, which makes z-tests largely a pedagogical tool. T-tests are built for small samples (n < 30) and unknown population standard deviations — which describes virtually every applied research scenario. Outside of homework problems, the z-test barely exists in practice.
What is Welch's t-test and why is it the default?
Welch's version adjusts degrees of freedom through the Welch-Satterthwaite equation, dropping the equal-variances assumption that Student's original test requires. The reason it has become the default is that equal variances fail more often than most people realize in real data, and Welch's test maintains accurate Type I error rates even when group variances differ by a factor of four. Delacre, Lakens, and Leys demonstrated this in their 2017 Psychonomic Bulletin and Review paper, and R's t.test() function, Python's scipy.stats.ttest_ind(), and JASP all default to Welch's test for exactly this reason.
How do I run a t-test in Excel?
The formula is =T.TEST(array1, array2, tails, type) where tails is 1 or 2, and type selects paired (1), equal variance (2), or Welch's (3). For example, =T.TEST(A1:A10, B1:B10, 2, 3) runs a two-tailed Welch's test. Most people spend way too long looking for a menu option before discovering the formula approach, which takes about thirty seconds once you know the syntax. Microsoft's official support documentation for Excel 365 confirms that T.TEST returns the p-value directly, saving you from manual t-statistic lookup.
One-tailed or two-tailed: which should I use?
Unless you locked in a directional hypothesis before data collection, two-tailed tests are the safer and more widely accepted choice. The reason journals care about this is straightforward: picking a direction after seeing your results is the definition of p-hacking, and reviewers will catch it. Many journals explicitly state in their submission guidelines that one-tailed tests require pre-registered justification.