What Is Power Analysis?
"Statistical power" sounds more menacing than it actually is. Power is just the probability your study will detect a real effect when one exists. Set the target to 0.80 and you are saying you want an 80% chance of catching the effect you are looking for. Drop below 0.50 and you are flipping a coin on whether the study can find anything at all, which is a terrible basis for spending months on data collection.
The calculator solves for any one of the four parameters when you supply the other three. Modes above: required sample size, achieved power, or minimum detectable effect. The math family covered here is t-tests — one-sample, two-sample independent (equal variance and group size), and paired — with reference values tracking G*Power 3.1.
Effect Size: The Conventions and Their Limits
Cohen's d is what everybody reaches for, and it's had its share of criticism. The Power Primer (Psychological Bulletin 112(1):155-159, DOI 10.1037/0033-2909.112.1.155) set d = 0.2 / 0.5 / 0.8 as small / medium / large. Use those cutoffs only when no published meta-analysis gives you something better.
There's been pushback on applying Cohen's labels across every field. Funder and Ozer wrote up the case in 2019 — a d that matters in mortality data is often noise in a psychophysics study. Pick your benchmarks from work done on your specific outcome, not from a 1988 table.
How the Math Works
Non-centrality parameter δ
Two-sample (equal n): δ = d · √(n / 2)
One-sample and paired: δ = d · √n
δ is how far the t-statistic's distribution shifts away from zero when the alternative hypothesis is true. Bigger d or bigger n pushes δ higher, which makes the test more likely to reject H₀.
Degrees of freedom
Two-sample: df = 2n − 2
One-sample and paired: df = n − 1
n is per group for the two-sample test and total (or number of pairs) otherwise.
Normal approximation (seed)
n ≈ 2 · (zα/2 + zβ)² / d² (per group, two-sample)
n ≈ (zα/2 + zβ)² / d² (one-sample, paired)
The calculator seeds with this closed form, then refines using the actual t critical value at the implied df. Final n is bumped by one if needed so the target power is genuinely met rather than narrowly missed.
Running It Before You Collect Data
Power analysis is the cheapest insurance in research. A 15-minute calculation before recruitment tells you whether your planned sample has any realistic chance of finding the effect you care about. Skip it and you can still run the study, analyze the data, see p > 0.05, and not know whether the effect is small or just invisible at that sample size — a waste of time that a pre-study calculation would have flagged on day one.
The sample size it asks for can be uncomfortable. Detecting a d = 0.2 effect at 80% power in a two-sample t-test takes 394 participants per group — 788 total. If that is outside your budget the analysis is still useful: it tells you the real options are to measure a more sensitive outcome, run a more efficient design (within-subjects beats between-subjects for power), or accept that the literature's effect sizes do not apply to your specific setup.
The Post-Hoc Power Trap
Skip post-hoc power. Hoenig and Heisey laid out the problem in 2001 — retrospective power computed from observed effect size is just the p-value in disguise. It tells you nothing you didn't already know from the significance test itself. Their paper The Abuse of Power in The American Statistician 55(1):19-24 (DOI 10.1198/000313001300339897) is still the cleanest write-up.
The useful question after a non-significant result is different: given the sample size you actually collected, what effect would the study have been able to detect? Plug your n into the Detectable effect mode above. The answer puts a concrete floor on the effects the study could have caught, which is a far more honest report than any post-hoc power number.
Common Mistakes
- Using the observed effect size to plan a follow-up study: Observed effect sizes from small studies are noisy and upwardly biased under publication selection. Either widen your confidence interval on the effect before plugging it back in, or pick a smaller, theory-grounded d and accept that the replication needs more participants.
- Treating Cohen's 0.5 as "medium effort": d = 0.5 is medium on Cohen's convention chart. That tells you nothing about whether it is a plausible effect size in your subfield. Published meta-analyses are almost always a better anchor.
- Assuming equal-variance, equal-n when neither is true: This calculator assumes both. If your groups have unequal n or unequal variance, Welch-style adjustments give slightly different sample-size targets — usually within a few participants but worth knowing.
- Computing power after the test: See the post-hoc trap section above. Use the Detectable effect mode instead, which answers a question that actually has an answer.
- Ignoring measurement reliability: Attenuation from unreliable measures inflates the sample size you actually need. If your instrument has a reliability of 0.7, a d = 0.5 on the true construct looks more like d = 0.35 on the observed scale, and the n required jumps accordingly.
Frequently Asked Questions
What does statistical power mean?
Power is the probability your study will detect a real effect when one exists. Set it to 0.80 and you are saying you want an 80% chance of a significant result given a true effect of the size you assumed. The remaining 20% is Type II error — failing to detect something that is genuinely there.
Why is 0.80 the usual target?
Cohen proposed 0.80 as a reasonable floor that balances false negatives against the cost of data collection, not because the universe favors that number. Journals in several fields have since baked it into reviewer expectations. Targeting 0.90 or 0.95 is fine when missing the effect would be costly — drug trials and diagnostic studies commonly do.
What effect size should I plug in?
Recent meta-analyses in your subfield are the best source. If nothing comparable exists in the literature, Cohen's 0.2 / 0.5 / 0.8 is the fallback — but treat it as a guess, since Cohen himself called it that.
Does this calculator match G*Power?
For t-tests, yes — within ±1 participant across the standard α and power combinations. The algorithm uses a central-t shift approximation to the non-central t CDF that is within about 0.2% of the exact version for n > 10. Any occasional one-person difference comes from different approximations, not from a methodological disagreement.
Can I use this for ANOVA, correlation, or chi-square?
Not yet — this page covers the t-test family only (one-sample, two-sample independent, paired). For F-tests and categorical methods the non-centrality parameters are different — Cohen's f for ANOVA, w for chi-square, q for proportions — and deserve their own calculator, which is planned.
Is post-hoc power analysis worth running?
No. Hoenig and Heisey showed retrospective power is just the p-value in a different hat — it cannot separate "low power" from "small real effect." Use the Detectable effect mode above instead; that one answers a question with an actual answer.