Standardized Usability Surveys: SUS, UMUX-Lite & SEQ

Key takeaways

SUS is the most validated and benchmarked instrument (norm ≈ 68); scores above 80 are reliably "good," but always report confidence intervals and sample size alongside the number.
UMUX-Lite (2 items, 7-point scale) is the right choice for in-product micro-surveys and longitudinal pulses; use the conversion formula to align it with SUS benchmarks.
SEQ belongs immediately after each task — not at session end — and its average benchmark of 5.5/7 makes it a fast signal for prioritizing redesign effort.
Quantitative benchmarking needs 40+ participants; applying the 5-user rule to survey studies produces dangerously wide confidence intervals dressed up as precision.
Pair every survey score with at least one behavioral metric — task-success rate or completion rate — to bridge the say/do gap and make findings credible to engineering and product stakeholders.

The full lesson

Standardized usability surveys are among the few UX measurement tools that have been psychometrically validated. That means researchers have tested their reliability and accuracy across thousands of studies. When you use SUS, UMUX-Lite, or SEQ correctly, you get scores you can honestly compare across releases, products, and competitors. When you use them carelessly, the numbers feel rigorous but mislead.

This lesson covers how each instrument works, when to pick one over another, how to score them properly, and how to combine them with behavioral data for an honest picture of usability.

Why Standardized Instruments Exist

Every team is tempted to write their own satisfaction questions. The problem is that homegrown scales have unknown reliability and no external benchmarks. A score of 72 on a custom survey is meaningless without knowing the distribution that score came from.

Standardized instruments solve this in three ways:

Validated psychometric properties — internal consistency (Cronbach’s alpha is typically 0.85–0.92 for SUS), test-retest reliability, and construct validity, all confirmed across many studies.
External benchmarks — decades of published data let you say “our score is at the 60th percentile of enterprise software” instead of just “users seem fairly satisfied.”
Comparability over time — a consistent instrument lets you detect real change across sprints and releases, not just noise from wording variation.

The System Usability Scale (SUS)

John Brooke introduced SUS in 1986 as a “quick and dirty” usability scale at Digital Equipment Corporation. It has since become the most widely cited usability questionnaire in the world, with thousands of published studies providing solid norms.

The 10 Items

SUS uses 10 statements. Participants rate each one on a 5-point Likert scale (1 = Strongly disagree, 5 = Strongly agree). The items alternate between positive and negative framing to reduce acquiescence bias — the tendency to agree with whatever is asked.

#	Statement	Polarity
1	I think that I would like to use this system frequently.	Positive
2	I found the system unnecessarily complex.	Negative
3	I thought the system was easy to use.	Positive
4	I think that I would need the support of a technical person to be able to use this system.	Negative
5	I found the various functions in this system were well integrated.	Positive
6	I thought there was too much inconsistency in this system.	Negative
7	I would imagine that most people would learn to use this system very quickly.	Positive
8	I found the system very cumbersome to use.	Negative
9	I felt very confident using the system.	Positive
10	I needed to learn a lot of things before I could get going with this system.	Negative

Scoring SUS Correctly

The scoring algorithm trips up many practitioners. Follow these steps exactly:

For odd-numbered items (positive): subtract 1 from the raw score.
For even-numbered items (negative): subtract the raw score from 5.
Sum all 10 converted scores, then multiply by 2.5.

The result is a score from 0 to 100. This is not a percentage. Treating a score of 72 as “72% satisfaction” is one of the most common misinterpretations of SUS.

Interpreting SUS Scores

Jeff Sauro and James Lewis (2012) mapped SUS scores to percentile ranks and letter grades using data from 500+ studies:

SUS Score	Grade	Adjective	Percentile
78.9–100	A/A+	Excellent / Best Imaginable	Top 25%
72.6–78.8	B	Good	~70th
62.7–72.5	C	OK	50th
51.7–62.6	D	Poor	~35th
0–51.6	F	Awful / Worst Imaginable	Bottom 15%

The average SUS score across enterprise software is approximately 68. A score above 80.3 is reliably perceived as “good” by most users.

SUS Limitations

10 items feels long for post-task micro-surveys or mobile contexts.
The alternating polarity confuses some participants, leading to response errors that inflate or deflate scores — especially in low-literacy or non-native-language settings.
SUS was designed for overall system usability, not for specific features or flows.

UMUX-Lite: A Two-Item Alternative

The Usability Metric for User Experience (UMUX) was developed in 2010 by Finstad to align with the ISO 9241-11 definition of usability. UMUX-Lite is a two-item version proposed by Lewis, Utesch, and Maher (2013). It trades statistical richness for speed.

The Two Items

Both items use a 7-point Likert scale (1 = Strongly disagree, 7 = Strongly agree):

“[This system’s] capabilities meet my requirements.” (positive)
“[This system] is easy to use.” (positive)

Scoring UMUX-Lite

First, sum both items. That raw score ranges from 2 to 14.

To convert to the same 0–100 scale as SUS, use this formula:

UMUX-Lite score = ((raw score - 2) / 12) * 100

A correction equation also lets you equate UMUX-Lite scores to SUS equivalents (Lewis et al., 2013):

SUS-equivalent = 0.65 * UMUX-Lite + 22.9

Use this correction carefully — it has limits outside the original sample range.

When to Choose UMUX-Lite Over SUS

Situation	Choose
Full usability benchmark study	SUS
In-product micro-survey (post-task)	UMUX-Lite
Longitudinal pulse with minimal friction	UMUX-Lite
Comparing to published benchmarks	SUS
Non-native language audiences (simpler wording)	UMUX-Lite
Need ISO 9241-11 alignment in documentation	UMUX-Lite

Both UMUX-Lite items are positive, so the alternating-polarity confusion that affects SUS is not an issue here. Keep in mind that two items produce less precise estimates. Confidence intervals are wider, and detecting small changes requires larger samples.

SEQ: Single Ease Question

The Single Ease Question was developed by Jeff Sauro and James Lewis as a task-level difficulty rating. SUS and UMUX-Lite assess overall perceived usability after a whole session. SEQ is different — you administer it immediately after each task.

The Item

A single 7-point scale:

“Overall, how would you rate the difficulty of this task?”

Anchors: 1 = Very difficult, 7 = Very easy.

Scoring and Benchmarks

Report the mean SEQ on the 1–7 scale. Sauro and Lewis’s normative data from 500+ tasks across diverse products places the average SEQ at 5.5. A score below 5.0 warrants serious investigation. A score above 6.0 is strong.

Because SEQ is per-task, you can:

Rank tasks by difficulty to prioritize redesign work.
Track SEQ for a specific flow across sprints.
Correlate SEQ with behavioral metrics (completion rate, time) to validate the measure.

SEQ vs. NASA-TLX and After-Scenario Questionnaire

NASA-TLX measures cognitive workload across six dimensions. It is appropriate for safety-critical or high-complexity domains, but overkill for most product UX work.
ASQ (After-Scenario Questionnaire) uses 3 items per task — ease, time, and support needed. It is more granular than SEQ but adds survey fatigue in longer sessions.
SEQ is the default choice for most product UX research: one item, minimal fatigue, and normed data available.

Combining Instruments: A Practical Protocol

No single instrument gives a complete picture. A mature usability measurement protocol triangulates several sources:

Behavioral metrics — task-success rate, time-on-task, error count (collected from observation or logs).
SEQ — administered immediately after each task, before task-specific impressions fade from memory.
SUS or UMUX-Lite — administered at the end of the session, after the full experience.
One open-ended question — “What, if anything, would you change about this experience?” This surfaces the why behind the scores.

This sequence respects a cognitive reality: task-specific impressions fade quickly, while overall impressions persist.

Administer SEQ immediately after each task is completed or abandoned, before moving on. Administer SUS or UMUX-Lite at the end of the entire session, not after individual tasks. Report scores with confidence intervals, not just means. Use at least 40 participants for quantitative benchmarking at 95% confidence.

Don't

Administer SUS after individual tasks — it measures the whole system. Mix SUS and UMUX-Lite scores in the same trend line without converting to a common scale. Apply the 5-user rule to quantitative studies: five users find problems but cannot produce reliable mean scores. Write custom satisfaction questions in place of validated instruments and claim they are equivalent.

Sample Sizing: The Most Common Mistake

The “5 users” heuristic applies only to qualitative, problem-finding research. For quantitative benchmarking with standardized surveys, the numbers are very different:

40+ participants to achieve a 95% confidence interval of roughly ±7 SUS points.
200+ participants for a ±3-point confidence interval — appropriate for major competitive benchmarks or accessibility audits.
For longitudinal tracking, detecting a meaningful change (e.g., 5 SUS points) at 80% power with α = 0.05 requires approximately 68 participants per wave.

Reporting a SUS score of 74 from 8 participants as a “benchmark” is false precision. The confidence interval around that score spans roughly ±20 points.

Reporting and Communicating Scores

Scores gain authority when you frame them correctly for stakeholders:

Always name the instrument and version — “SUS (10-item, Brooke 1986),” not just “usability score.”
Include the sample size and confidence interval — a score of 76 ± 4 (n=52) is credible; “76 from our test” is not.
Anchor to benchmarks — “Our score of 76 places us at the 72nd percentile of enterprise software.”
Show trend, not snapshots — a chart of SUS scores across six sprints communicates progress more powerfully than a single number.
Pair with a behavioral metric — “SUS improved from 68 to 76 (n=45) after the navigation redesign; task completion on the same flows rose from 61% to 79%.” This pairing neutralizes the “but it’s just a survey” objection from engineering and product peers.

Modern Integration: Continuous Benchmarking

Running SUS once a year is the outdated pattern. Modern teams build surveys into their product analytics pipeline:

Intercept surveys trigger UMUX-Lite for a random sample of users completing a key flow, feeding into a dashboard alongside behavioral funnel data.
Release-gated benchmarks run a full SUS study (40+ recruited participants) before and after major launches, building a defensible baseline for product decisions.
Competitive benchmarking uses the same instrument on competitor products with recruited participants who use both — yielding a relative position that resonates with leadership.

The key discipline is instrument consistency: switching from SUS to UMUX-Lite mid-program breaks your trend line. Pick one primary instrument per program and stick with it.