UI/UX Atlas
UX Research Intermediate

Sample Sizing & Statistical Foundations

Knowing how many participants you actually need — and why — is the difference between research that drives decisions and research that merely fills a slide deck.

11 min read

The full lesson

Getting sample size wrong is the most quietly damaging mistake in UX research. Too few participants in a quantitative benchmark and your task-success numbers are noise dressed up as signal. Too many participants in a qualitative study and you’ve burned four weeks of recruiting budget on diminishing returns. This lesson gives you the mental model to match sample size to study type, interpret statistical results honestly, and push back when stakeholders confuse “n=5” with proof.

Why Sample Size Is a Research Design Decision, Not an Afterthought

Sample size determines two distinct things: what claims you can legitimately make, and how much confidence you can place in those claims. Those are not the same question.

A qualitative study with five participants can legitimately say “these are plausible usability problems worth investigating.” It cannot say “70% of our users experience this.” A quantitative benchmark with forty participants at 95% confidence can say “our task success rate is between 62% and 84%.” Both statements are valid — but only when the sample matches the claim.

The outdated habit of applying the “five-user rule” to every study collapses this distinction. That rule comes from Nielsen and Landauer’s 1993 work on qualitative problem-finding. It has a specific scope: one usability study, one homogenous user segment, looking for problems — not measuring rates. Applying it to a benchmark, a survey, or a card sort is a category error.

Qualitative vs. Quantitative: Different Goals, Different Math

Start by asking: “What kind of claim do I need this research to support?” That question almost always splits into two modes.

Qualitative research — user interviews, moderated usability tests, diary studies, contextual inquiry — generates hypotheses, surfaces unforeseen problems, and reveals the texture of user experience. The goal is saturation: the point where additional participants stop surfacing new themes. Saturation typically arrives around:

  • 5–8 participants per distinct user segment for problem-finding usability studies
  • 8–12 participants per segment for generative interview studies
  • 15–20 participants total if you have two clearly distinct segments and want cross-segment coverage

These are not magic numbers — they are experience-based heuristics. If you reach your fifth interview and every session is introducing major new themes, keep going. If you reach your twelfth and nothing is new, stop.

Quantitative research — unmoderated benchmark tests, surveys, A/B tests, analytics — measures rates, tests hypotheses with statistical confidence, and detects differences between conditions. It requires enough participants to produce narrow confidence intervals. Three key variables drive that requirement:

  • Confidence level: typically 90%, 95%, or 99% — how often your interval would contain the true value if you repeated the study many times
  • Margin of error: how wide an interval you can tolerate around your point estimate
  • Expected proportion: your estimate of the true rate (used to calculate variance)

For 95% confidence and a ±10% margin of error around an estimated 50% proportion, you need roughly 96 participants. Tighten to ±5% and you need around 384. This is why most UX survey research is systematically under-powered.

Study TypeTypical MinimumWhat You Can Claim
Qualitative problem-finding (1 segment)5–8Plausible usability problems worth investigating
Qualitative generative interviews (1 segment)8–12Themes and patterns in user mental models
Unmoderated benchmark (directional)20–30Directional task success rate (wide CI)
Unmoderated benchmark (95% confidence, ±10%)40–50Reliable benchmark; CI narrow enough to track change
Survey (95% confidence, ±5%)384+Statistically reliable prevalence estimates
A/B test (80% power, 5% significance, 20% lift)Calculated per effect sizeCausal attribution of metric change to treatment

Confidence Intervals: Report Ranges, Not Just Percentages

A task success rate of 72% means almost nothing without its confidence interval. As a bare percentage, it looks like a fact. Reported as “72% (95% CI: 58–83%)” it tells you the estimate is compatible with anything from a majority failure to a strong success — a very different message.

Confidence intervals are not optional decoration. They are the honest translation of your sample size into a claim precision. Here is how to construct one for a binary outcome (success/fail):

  1. Compute the sample proportion: p = successes / n
  2. Use the Wilson score interval (preferred over the Wald interval for proportions near 0 or 1, and for small samples)
  3. Report as: “Task success rate: 72% (95% CI: 58–83%, n=40)”

Most research platforms (Maze, UserTesting) and spreadsheet tools can compute Wilson intervals with a formula. If you are reporting to stakeholders who aren’t statistically fluent, translate it plainly: “We are 95% confident the true success rate is between 58% and 83%.” Then explain what that range means for the decision at hand.

The minimum-detectable-effect implication: to detect a 10-percentage-point improvement (say from 65% to 75%) with 80% statistical power at 95% confidence, you need roughly 200 participants per condition in an A/B comparison. Many teams run A/B tests on far smaller samples, observe a non-significant result, and conclude the change had no effect — when in reality they simply lacked the power to detect it.

Survey Design and the Sample Math Behind It

Surveys are the most commonly misused quantitative method in UX research. They are fast to deploy and easy to read — which masks the precision problem: most UX surveys run on 50–200 respondents and get reported as if they were census data.

The margin-of-error formula for a proportion survey (simplified) is:

ME = z * sqrt( p*(1-p) / n )

Here z is 1.96 for 95% confidence, p is the estimated proportion, and n is sample size. At n=50, you have a margin of error of ±14% around a 50% proportion. That means “half of users do X” is actually compatible with anywhere from 36% to 64% doing X. That is not useful.

What this means in practice:

  • A survey of 50 people reporting product satisfaction is decorative, not diagnostic.
  • A survey of 200 people supports claims with ±7% margin at 95% confidence — useful for prioritization, not precise measurement.
  • A survey of 400+ people supports ±5% claims — appropriate for benchmarks and longitudinal tracking.

Validated questionnaires sidestep much of this problem. The System Usability Scale (SUS), UMUX-Lite, and Single Ease Question (SEQ) are normed instruments with published benchmark data. A SUS score of 74 means something because you can compare it against thousands of previously scored products. A five-question homegrown satisfaction scale has no such anchor. Use validated instruments wherever one exists.

Do

  • Match sample size to the claim you need the research to support, not to a rule of thumb.
  • Report quantitative results as ranges (confidence intervals), not bare percentages.
  • Use validated questionnaires (SUS, UMUX-Lite, SEQ) rather than homegrown satisfaction questions.
  • Calculate required sample size before fielding a study — tools like G*Power or online calculators for proportions take minutes to run.
  • Combine qualitative and quantitative methods sequentially: quantitative for “what” and “how much,” qualitative for “why.”

Don't

  • Apply the 5-user rule to quantitative benchmarks, surveys, or cross-segment studies.
  • Report a task success rate without a confidence interval.
  • Treat a survey of 50 respondents as statistically reliable evidence of user prevalence.
  • Conflate statistical significance (p less than 0.05) with practical significance (the effect matters).
  • Use engagement metrics — DAU, time-on-page, click-through rate — as proxies for usability without pairing them with task-success or friction measurements.

Segmentation and the Multiple-Segment Problem

Most product research involves more than one distinct user type. When you have multiple segments, the sample math compounds. If you need 8 participants per segment for a qualitative study and you have three segments, you need 24 participants — not 8. Mixing segments into a single analysis pool and treating the combined group as one homogenous audience produces findings that may not be accurate for any of your actual users.

Segment definitions matter more than segment count. A segment is only a distinct group if the users in it have meaningfully different mental models, task flows, or goals. “Enterprise users” and “SMB users” often qualify. “Users who signed up in Q1” and “users who signed up in Q4” usually do not, unless you have evidence that onboarding timing predicts behavior.

Practical heuristics for segmented studies:

  • If two segments have fundamentally different primary tasks or goals, treat them as separate studies. Combining them masks the most important differences.
  • If a segment represents less than 15% of your user base and the product decisions aren’t specifically designed for them, note them as a secondary segment. Recruit 2–3 qualitative participants to check for major divergences — not a full primary sample.
  • For quantitative studies, segment size affects power independently. A 40-participant study split evenly across two segments gives you 20 per segment, which is a directional benchmark at best.

Benchmarking: Measuring Change Over Time

Benchmarking means measuring the same metrics at regular intervals to track whether product changes improve or degrade user experience. It requires the most rigorous sample discipline of any UX measurement approach, because the entire value of a benchmark is comparability across time.

What makes a benchmark valid:

  • Consistent participant criteria across rounds (same screener, same platform)
  • Consistent tasks (identical wording — no “improvements” to the tasks themselves)
  • Consistent measures (same questionnaires, same success criteria)
  • Sufficient sample per round to detect the minimum change that would affect a product decision

That last point is frequently overlooked. Before establishing a benchmark, calculate the minimum detectable effect: how large a change in task-success rate, SUS score, or error rate would actually change a product decision? Then work backwards to find the sample per round that would detect that change with adequate power. A benchmark that can only detect a 25-percentage-point shift — when you care about a 10-point shift — is just recording noise.

Common benchmark metrics:

MetricWhat It MeasuresValidated?
Task success rateWhether users can complete a defined goalYes — binary completion is unambiguous
Task time (median)Efficiency of the primary flowYes — report median, not mean
SUS scoreOverall perceived usabilityYes — normed against thousands of products
UMUX-LitePerceived ease and capabilityYes — validated against SUS
SEQ (Single Ease Question)Per-task perceived difficultyYes — 7-point scale, normed data available
Error rateFrequency of mistakes in a taskDepends on how errors are defined — define explicitly

The Say/Do Gap and Why Behavioral Data Wins

Self-report research — surveys, interviews, rating scales — measures what people say they do, think, or prefer. Behavioral data — clickstreams, task completion logs, session recordings, A/B test results — measures what people actually do. When these conflict, behavioral data is almost always more accurate for predicting future behavior.

This is not an argument against qualitative research. Interviews and surveys are often the only way to access mental models, past experiences, and latent needs that never show up as observable behavior. But it is an argument against using attitudinal data as a substitute for behavioral evidence when behavioral data is obtainable.

The modern research practice is triangulation: use behavioral data to establish what is happening, and qualitative methods to understand why. A task-success rate tells you that 60% of users fail the checkout flow. An interview study tells you they fail because they don’t trust the unknown payment processor. You need both facts to design the fix.

Specific situations where the say/do gap is most dangerous:

  • Feature prioritization surveys (“which of these would you use?”) — users consistently over-report intent to use features they ultimately ignore.
  • Preference testing without behavioral grounding — stated preference for a design variant has weak correlation with actual usage patterns.
  • Satisfaction ratings after a failed task — users who fail will often still rate an experience favorably, especially if they blame themselves.

Practical Power Calculations Before You Field a Study

Running a power calculation before fieldwork is the difference between a study that can answer your question and one that cannot. Most researchers skip this step because it feels academic. It is not — it is the single most important design decision in quantitative research.

For a single-proportion benchmark (measuring task success rate):

  1. Decide your confidence level (use 95% as the standard).
  2. Decide your acceptable margin of error (±10% for directional, ±5% for reliable benchmarks).
  3. Estimate the expected proportion (use 50% if unknown — it maximizes required sample).
  4. Use the formula or an online calculator to get required n.

For an A/B comparison (detecting a difference between two designs):

  1. Decide on statistical power (use 80% as the convention — 20% chance of a false negative).
  2. Decide on significance level (use 5% — 5% chance of a false positive).
  3. Estimate the minimum effect size you care about detecting (in percentage points).
  4. Use G*Power, Statsig’s sample size calculator, or Evan Miller’s A/B testing calculator.
  5. Plan your study to run until that sample is reached — stopping early when results look promising inflates false-positive rates.

The core principle: calculate what you need to detect what matters, then recruit to that number. Under-powered studies waste resources and mislead decisions. Over-powered qualitative studies waste time that could be spent iterating.