Surveys & Standardized Questionnaires (SUS, SUPR-Q, UEQ)

Key takeaways

SUS, SUPR-Q, and UEQ measure different dimensions: SUS covers overall usability, SUPR-Q adds trust and loyalty for websites, and UEQ separates pragmatic from hedonic quality.
Valid quantitative benchmarking requires 40-plus participants for SUS at useful precision — the 5-user rule applies only to qualitative problem-finding.
Administration conditions (item order, scale anchors, post-task timing) must be preserved exactly to keep scores comparable to normative databases.
Attitudinal scores are a signal, not a verdict — always triangulate with task-completion rates, time-on-task, or behavioral analytics to close the say/do gap.
Homegrown satisfaction questions look like data but are statistically indefensible without psychometric validation; exhaust the validated instrument library before writing your own.

The full lesson

Standardized questionnaires give UX teams a shared language for measuring experience quality. Unlike custom satisfaction questions that change from study to study, validated scales produce scores you can benchmark against industry data, track across product releases, and compare across teams.

But getting that right takes more than downloading a scoring spreadsheet. You need to choose the right instrument for your research question, run it under conditions that keep it valid, and read the results alongside behavioral evidence — not in isolation.

Why Standardized Scales Exist

Writing a good survey question is hard. Small wording differences change how people respond. Scale direction and label choices introduce systematic bias. Question order primes the answers that follow.

Validated questionnaires solve this by locking down every variable. They go through extensive development — factor analysis, test-retest reliability studies, and large normative databases built from hundreds of products.

The payoff is threefold:

Comparability. A SUS score of 72 on your checkout flow today can be compared to the same flow six months ago, or to competitor benchmarks, because the instrument is identical each time.
Efficiency. Five to ten items replace the exploratory question-writing process you’d otherwise need.
Credibility. Validated scales carry peer-reviewed citations, which give stakeholders and procurement offices confidence in the numbers.

The tradeoff is sensitivity. Standardized items are broad by design, so they won’t surface the specific friction points a custom survey could. Use them for measurement, not discovery.

The Three Instruments You Need to Know

System Usability Scale (SUS)

John Brooke introduced SUS in 1986 as a quick post-task measure. It now has one of the largest normative databases in the field.

SUS uses ten items on a five-point Likert scale (a rating scale from “strongly disagree” to “strongly agree”). The items alternate polarity — five are worded positively, five negatively. Scores range from 0 to 100, but the scale is not a percentage.

Scoring: For odd-numbered items (positive), subtract 1 from the raw response. For even-numbered items (negative), subtract the raw response from 5. Add up all 10 adjusted values, then multiply by 2.5.

Interpreting the number: Sauro and Lewis (2016) mapped SUS scores to letter grades using a large normative pool. A score above 80.3 = A / Excellent. A score of 68 is the industry average = C / OK. A score below 51 = F / Awful. Always report the grade alongside the number — the raw score alone doesn’t mean much to non-researchers.

Where SUS fits: Short usability tests (moderated or unmoderated), quick post-release benchmarks, regression testing after redesigns. It says nothing about aesthetics, trust, or loyalty.

SUPR-Q (Standardized User Experience Percentile Rank Questionnaire)

Jeff Sauro developed SUPR-Q specifically for website experiences. It fills the gaps SUS leaves open — trust, appearance, loyalty, and findability. Eight items produce an overall score plus four subscale scores on a 1–5 scale. The overall score converts to a percentile rank against a normative database of 150-plus websites.

The four subscales:

Subscale	Items	What it measures
Usability	2	Ease of use and task efficiency
Trust & Credibility	2	Perceived security and dependability
Appearance	2	Visual design quality
Loyalty	2	NPS-derived likelihood to recommend and return

When to use SUPR-Q over SUS: Choose SUPR-Q whenever the experience extends beyond just task completion — e-commerce, SaaS marketing sites, content portals. If a stakeholder cares about brand perception or conversion alongside usability, the subscales give you separate levers to discuss.

Normative data caveat: SUPR-Q’s norms are proprietary and require purchase from Sauro’s MeasuringU. Freely available benchmarks are limited to aggregated published research, so percentile rankings require the paid database.

User Experience Questionnaire (UEQ)

The UEQ (Laugwitz, Held, Schrepp, 2008) covers more ground than SUS or SUPR-Q. It’s popular in research and product evaluations where holistic experience quality — not just usability — is the goal. The standard form has 26 items across six scales. A short form (UEQ-S) cuts this to eight items covering the most discriminating dimensions.

Six scales:

Attractiveness — overall impression (not split into pragmatic/hedonic)
Perspicuity — ease of learning and understanding
Efficiency — task speed, no unnecessary effort
Dependability — user feels in control, product meets expectations
Stimulation — interesting and motivating to use
Novelty — creative, inventive, forward-looking

Each item uses a seven-point semantic differential format — pairs of opposite adjectives like “obstructive / supportive.” Scores range from -3 (strongly negative) to +3 (strongly positive).

Interpreting UEQ scores: The free benchmark dataset at ueq-online.org contains 20,000-plus evaluations. It groups results into five categories per scale: Excellent (top 10%), Good, Above Average, Below Average, and Bad. Unlike SUS, UEQ separates pragmatic quality (perspicuity, efficiency, dependability) from hedonic quality (stimulation, novelty). That split is especially useful for enterprise software, which often scores well on efficiency but poorly on engagement.

Sample Size: The Number That Breaks Most Survey Studies

This is where practitioners most commonly go wrong. The “5-user rule” applies to qualitative problem-finding — specifically, moderated usability testing analyzed with affinity mapping. It has nothing to do with quantitative benchmarking.

For a standardized questionnaire to produce scores with usable precision at 95% confidence:

SUS: 26 participants for a ±5-point margin of error; 40-plus for ±3-point precision (which matters when tracking small improvements).
SUPR-Q: 40-plus for reliable subscale scores.
UEQ: 20 is a practical minimum; 50-plus for stable subscale means and valid benchmark comparisons.

Running SUS on eight participants produces a confidence interval of roughly ±15 points — wide enough to miss real regressions or manufacture false improvements. When you can’t reach the minimum, report the score with an explicit confidence interval rather than presenting the number as definitive.

Report SUS scores with a confidence interval and sample size, e.g. “SUS = 74 (95% CI: 69–79, n=42).” Treat score changes as meaningful only when they exceed the margin of error. Combine quantitative scores with qualitative task observations to explain the number.

Don't

Report a single SUS score from 8 participants as a benchmark. Apply the 5-user rule to quantitative surveys. Present a 3-point score increase as a significant improvement without checking whether it exceeds the confidence interval.

Choosing the Right Instrument

Research question	Best instrument	Notes
”How usable is this feature post-launch?”	SUS	Fast, widely normed, pairs well with task metrics
”How does our website compare to competitors on trust and loyalty?”	SUPR-Q	Website-specific norms; requires paid database for percentile rank
”Is this enterprise tool perceived as both efficient and engaging?”	UEQ	Separates pragmatic and hedonic dimensions
”What’s our usability trend over six sprints?”	SUS or UMUX-Lite	UMUX-Lite (2 items) is faster; correlates well with SUS
”Should we use NPS?”	Avoid as sole CX metric	NPS measures loyalty intention, not experience quality; embed the loyalty subscale in SUPR-Q instead

A note on UMUX-Lite: this two-item scale (from Finstad, 2013) correlates at r=0.81 with SUS and works well for continuous in-product measurement where even 10 items would be intrusive. It is not a replacement when you need SUS-level statistical precision for formal benchmarking.

Fielding Conditions That Preserve Validity

Validated instruments were developed under specific conditions. Quietly deviating from those conditions invalidates the norms.

Administer after task completion, not before. All three instruments measure post-interaction impressions. Running SUPR-Q before users explore a site measures expectations, not experience — a different and less useful construct.

Don’t change item order. The alternating polarity in SUS is intentional. Reordering or removing items breaks the validated scoring algorithm. If you need to customize, use a different instrument or build a validated custom scale — not a trimmed SUS.

Preserve the response scale. SUS requires a 5-point agreement scale with specific anchors. UEQ requires a 7-point bipolar format. Switching to a 10-point scale because it “feels more precise” changes the psychometrics.

Use neutral framing. Introducing the survey with “We just redesigned this — we hope you found it easier!” primes positive responses. Neutral framing: “Please rate your experience completing the tasks you just performed.”

Run competitive benchmarks in separate sessions. When comparing your product to a competitor, have each participant evaluate only one product. Exposing them to both in sequence creates contrast effects that inflate the perceived gap.

Triangulation: What the Score Doesn’t Tell You

Attitudinal data — what people say they think — is an imperfect proxy for behavior. Users often rate an experience highly after struggling through it (the “completion effect”), rate it poorly after a single jarring moment that wasn’t representative, or adapt to poor design so thoroughly they no longer notice it.

Treat attitudinal scores as one signal among several:

Behavioral complement: Pair SUS scores with task-completion rate and time-on-task from the same session. A SUS of 78 with a 60% task-completion rate is a worse situation than a SUS of 68 with a 92% completion rate — the first case suggests users feel good about something they’re frequently failing at.
Qualitative complement: Use think-aloud or post-session interview data to understand why a subscale (like UEQ’s Novelty) scored low. “The design feels dated” is something a UEQ score can signal; the specific element driving that perception requires qualitative investigation.
Analytics complement: If your SUPR-Q Trust subscale drops after a product change, cross-reference it with support ticket volume and checkout-abandonment rate from analytics. Convergence across data types builds stakeholder confidence.

Homegrown vs. Validated: When to Roll Your Own

Sometimes none of the standard instruments fits your product type — a physical-digital hybrid, a voice interface, a specialized professional tool. Before writing custom items, exhaust the validated options:

SEQ (Single Ease Question): One item on a 7-point scale, validated for post-task perceived ease. Faster than SUS per task; useful in task-by-task protocols.
CSUQ / PSSUQ: IBM’s computer usability scales. More items than SUS but more diagnostic when you need to distinguish interface quality from information quality.
Validated domain-specific scales: Healthcare UX, gaming, AR/VR each have published domain scales. Search Google Scholar before writing your own items.

If you must write custom items, run at least a pilot reliability check — Cronbach’s alpha (a measure of internal consistency) should be 0.7 or higher before you treat aggregate scores as meaningful. Homegrown unvalidated satisfaction questions produce numbers that look precise but are statistically indefensible. They also can’t be compared across studies.

Reporting to Stakeholders

A SUS score presented as a single number invites misinterpretation. A rigorous report includes:

Score + letter grade/adjective — e.g., “SUS 76 = B / Good”
Confidence interval — e.g., “95% CI: 72–80”
Sample characteristics — are participants representative of the target user population?
Task context — which tasks, in which prototype or live product
Trend or benchmark comparison — how does this compare to the previous release or industry median?
Qualitative annotation — two to three themes from the session that explain the score direction

This format moves the conversation from “is 76 good?” to “we’re above industry average on overall usability, the efficiency subscale is dragging us down, and three participants mentioned the filter interaction as the main friction point” — which is actionable.