Standardized Usability Surveys: SUS, UMUX-Lite & SEQ
Validated psychometric instruments give design teams comparable, credible usability scores — but only when administered, scored, and interpreted correctly.
9 min read
The full lesson
Standardized usability surveys are among the few UX measurement tools that have been psychometrically validated. That means researchers have tested their reliability and accuracy across thousands of studies. When you use SUS, UMUX-Lite, or SEQ correctly, you get scores you can honestly compare across releases, products, and competitors. When you use them carelessly, the numbers feel rigorous but mislead.
This lesson covers how each instrument works, when to pick one over another, how to score them properly, and how to combine them with behavioral data for an honest picture of usability.
Why Standardized Instruments Exist
Every team is tempted to write their own satisfaction questions. The problem is that homegrown scales have unknown reliability and no external benchmarks. A score of 72 on a custom survey is meaningless without knowing the distribution that score came from.
Standardized instruments solve this in three ways:
- Validated psychometric properties — internal consistency (Cronbach’s alpha is typically 0.85–0.92 for SUS), test-retest reliability, and construct validity, all confirmed across many studies.
- External benchmarks — decades of published data let you say “our score is at the 60th percentile of enterprise software” instead of just “users seem fairly satisfied.”
- Comparability over time — a consistent instrument lets you detect real change across sprints and releases, not just noise from wording variation.
The System Usability Scale (SUS)
John Brooke introduced SUS in 1986 as a “quick and dirty” usability scale at Digital Equipment Corporation. It has since become the most widely cited usability questionnaire in the world, with thousands of published studies providing solid norms.
The 10 Items
SUS uses 10 statements. Participants rate each one on a 5-point Likert scale (1 = Strongly disagree, 5 = Strongly agree). The items alternate between positive and negative framing to reduce acquiescence bias — the tendency to agree with whatever is asked.
| # | Statement | Polarity |
|---|---|---|
| 1 | I think that I would like to use this system frequently. | Positive |
| 2 | I found the system unnecessarily complex. | Negative |
| 3 | I thought the system was easy to use. | Positive |
| 4 | I think that I would need the support of a technical person to be able to use this system. | Negative |
| 5 | I found the various functions in this system were well integrated. | Positive |
| 6 | I thought there was too much inconsistency in this system. | Negative |
| 7 | I would imagine that most people would learn to use this system very quickly. | Positive |
| 8 | I found the system very cumbersome to use. | Negative |
| 9 | I felt very confident using the system. | Positive |
| 10 | I needed to learn a lot of things before I could get going with this system. | Negative |
Scoring SUS Correctly
The scoring algorithm trips up many practitioners. Follow these steps exactly:
- For odd-numbered items (positive): subtract 1 from the raw score.
- For even-numbered items (negative): subtract the raw score from 5.
- Sum all 10 converted scores, then multiply by 2.5.
The result is a score from 0 to 100. This is not a percentage. Treating a score of 72 as “72% satisfaction” is one of the most common misinterpretations of SUS.
Interpreting SUS Scores
Jeff Sauro and James Lewis (2012) mapped SUS scores to percentile ranks and letter grades using data from 500+ studies:
| SUS Score | Grade | Adjective | Percentile |
|---|---|---|---|
| 78.9–100 | A/A+ | Excellent / Best Imaginable | Top 25% |
| 72.6–78.8 | B | Good | ~70th |
| 62.7–72.5 | C | OK | 50th |
| 51.7–62.6 | D | Poor | ~35th |
| 0–51.6 | F | Awful / Worst Imaginable | Bottom 15% |
The average SUS score across enterprise software is approximately 68. A score above 80.3 is reliably perceived as “good” by most users.
SUS Limitations
- 10 items feels long for post-task micro-surveys or mobile contexts.
- The alternating polarity confuses some participants, leading to response errors that inflate or deflate scores — especially in low-literacy or non-native-language settings.
- SUS was designed for overall system usability, not for specific features or flows.
UMUX-Lite: A Two-Item Alternative
The Usability Metric for User Experience (UMUX) was developed in 2010 by Finstad to align with the ISO 9241-11 definition of usability. UMUX-Lite is a two-item version proposed by Lewis, Utesch, and Maher (2013). It trades statistical richness for speed.
The Two Items
Both items use a 7-point Likert scale (1 = Strongly disagree, 7 = Strongly agree):
- “[This system’s] capabilities meet my requirements.” (positive)
- “[This system] is easy to use.” (positive)
Scoring UMUX-Lite
First, sum both items. That raw score ranges from 2 to 14.
To convert to the same 0–100 scale as SUS, use this formula:
UMUX-Lite score = ((raw score - 2) / 12) * 100
A correction equation also lets you equate UMUX-Lite scores to SUS equivalents (Lewis et al., 2013):
SUS-equivalent = 0.65 * UMUX-Lite + 22.9
Use this correction carefully — it has limits outside the original sample range.
When to Choose UMUX-Lite Over SUS
| Situation | Choose |
|---|---|
| Full usability benchmark study | SUS |
| In-product micro-survey (post-task) | UMUX-Lite |
| Longitudinal pulse with minimal friction | UMUX-Lite |
| Comparing to published benchmarks | SUS |
| Non-native language audiences (simpler wording) | UMUX-Lite |
| Need ISO 9241-11 alignment in documentation | UMUX-Lite |
Both UMUX-Lite items are positive, so the alternating-polarity confusion that affects SUS is not an issue here. Keep in mind that two items produce less precise estimates. Confidence intervals are wider, and detecting small changes requires larger samples.
SEQ: Single Ease Question
The Single Ease Question was developed by Jeff Sauro and James Lewis as a task-level difficulty rating. SUS and UMUX-Lite assess overall perceived usability after a whole session. SEQ is different — you administer it immediately after each task.
The Item
A single 7-point scale:
“Overall, how would you rate the difficulty of this task?”
Anchors: 1 = Very difficult, 7 = Very easy.
Scoring and Benchmarks
Report the mean SEQ on the 1–7 scale. Sauro and Lewis’s normative data from 500+ tasks across diverse products places the average SEQ at 5.5. A score below 5.0 warrants serious investigation. A score above 6.0 is strong.
Because SEQ is per-task, you can:
- Rank tasks by difficulty to prioritize redesign work.
- Track SEQ for a specific flow across sprints.
- Correlate SEQ with behavioral metrics (completion rate, time) to validate the measure.
SEQ vs. NASA-TLX and After-Scenario Questionnaire
- NASA-TLX measures cognitive workload across six dimensions. It is appropriate for safety-critical or high-complexity domains, but overkill for most product UX work.
- ASQ (After-Scenario Questionnaire) uses 3 items per task — ease, time, and support needed. It is more granular than SEQ but adds survey fatigue in longer sessions.
- SEQ is the default choice for most product UX research: one item, minimal fatigue, and normed data available.
Combining Instruments: A Practical Protocol
No single instrument gives a complete picture. A mature usability measurement protocol triangulates several sources:
- Behavioral metrics — task-success rate, time-on-task, error count (collected from observation or logs).
- SEQ — administered immediately after each task, before task-specific impressions fade from memory.
- SUS or UMUX-Lite — administered at the end of the session, after the full experience.
- One open-ended question — “What, if anything, would you change about this experience?” This surfaces the why behind the scores.
This sequence respects a cognitive reality: task-specific impressions fade quickly, while overall impressions persist.
Do
Administer SEQ immediately after each task is completed or abandoned, before moving on. Administer SUS or UMUX-Lite at the end of the entire session, not after individual tasks. Report scores with confidence intervals, not just means. Use at least 40 participants for quantitative benchmarking at 95% confidence.
Don't
Administer SUS after individual tasks — it measures the whole system. Mix SUS and UMUX-Lite scores in the same trend line without converting to a common scale. Apply the 5-user rule to quantitative studies: five users find problems but cannot produce reliable mean scores. Write custom satisfaction questions in place of validated instruments and claim they are equivalent.
Sample Sizing: The Most Common Mistake
The “5 users” heuristic applies only to qualitative, problem-finding research. For quantitative benchmarking with standardized surveys, the numbers are very different:
- 40+ participants to achieve a 95% confidence interval of roughly ±7 SUS points.
- 200+ participants for a ±3-point confidence interval — appropriate for major competitive benchmarks or accessibility audits.
- For longitudinal tracking, detecting a meaningful change (e.g., 5 SUS points) at 80% power with α = 0.05 requires approximately 68 participants per wave.
Reporting a SUS score of 74 from 8 participants as a “benchmark” is false precision. The confidence interval around that score spans roughly ±20 points.
Reporting and Communicating Scores
Scores gain authority when you frame them correctly for stakeholders:
- Always name the instrument and version — “SUS (10-item, Brooke 1986),” not just “usability score.”
- Include the sample size and confidence interval — a score of 76 ± 4 (n=52) is credible; “76 from our test” is not.
- Anchor to benchmarks — “Our score of 76 places us at the 72nd percentile of enterprise software.”
- Show trend, not snapshots — a chart of SUS scores across six sprints communicates progress more powerfully than a single number.
- Pair with a behavioral metric — “SUS improved from 68 to 76 (n=45) after the navigation redesign; task completion on the same flows rose from 61% to 79%.” This pairing neutralizes the “but it’s just a survey” objection from engineering and product peers.
Modern Integration: Continuous Benchmarking
Running SUS once a year is the outdated pattern. Modern teams build surveys into their product analytics pipeline:
- Intercept surveys trigger UMUX-Lite for a random sample of users completing a key flow, feeding into a dashboard alongside behavioral funnel data.
- Release-gated benchmarks run a full SUS study (40+ recruited participants) before and after major launches, building a defensible baseline for product decisions.
- Competitive benchmarking uses the same instrument on competitor products with recruited participants who use both — yielding a relative position that resonates with leadership.
The key discipline is instrument consistency: switching from SUS to UMUX-Lite mid-program breaks your trend line. Pick one primary instrument per program and stick with it.