The Simplest Usability Metric
In addition to being expensive, collecting usability metrics interferes with the goal of gathering qualitative insights to drive design decisions. As a compromise, you can measure users' ability to complete tasks. Success rates are easy to understand and represent the UX bottom line.
Numbers are powerful (even though they are often misused in user experience). They offer a simple way to communicate usability findings to a general audience. Saying, for example, that "Amazon.com complies with 72% of the e-commerce usability guidelines" is a much more specific statement than "Amazon.com has great usability, but it doesn't do everything right."
Metrics are great for assessing long-term progress on a project and for setting goals. They are an integral part of a benchmarking program and can be used to assess if the money you invested in your redesign project was well spent.
Unfortunately, there is a conflict between the need for numbers and the need for insight. Although numbers can help you communicate usability status and the need for improvements, the true purpose of a user experience practice is to set the design direction, not to generate numbers for reports and presentations. Thus, some of the best research methods for usability (and, in particular, qualitative usability testing) conflict with the demands of metrics collection.
The best usability tests involve frequent small tests, rather than a few big ones. You gain maximum insight by working with 4–5 users and asking them to think out loud during the test. As soon as users identify a problem, you fix it immediately (rather than continue testing to see how bad it is). You then test again to see if the "fix" solved the problem.
Although small tests give you ample insight into how to improve design, such tests do not generate the sufficiently tight
that traditional metrics require. Think-aloud protocols are the best way to understand users' thinking and thus how to design for them, but the extra time it takes for users to verbalize their thoughts contaminates task time measures. Plus, qualitative tests often involve small tweaks from one session to the next, and, because of that metrics, collected in such tests are rarely measuring the same thing.
Thus, the best usability methodology is the one least suited for generating detailed numbers.
One of the more common metrics used in user experience is task success or completion. This is a very simple binary metric. When we run a study with multiple users, we usually report the success (or task-completion) rate: the percentage of users who were able to complete a task in a study.
Like most metrics, it is fairly coarse — it says nothing about why users fail or how well they perform the tasks they did complete.
Nonetheless, success rates are easy to collect and a very telling statistic. After all, if users can't accomplish their target task, all else is irrelevant. User success is the bottom line of usability.
Levels of Success
Success rates are easy to measure, with one major exception: How do we account for cases of partial success? If users can accomplish part of a task, but fail other parts, how should we score them?
Let's say, for example, that the users' task is to order twelve yellow roses to be delivered to their mothers on their birthday. True task success would mean just that: Mom receives a dozen roses on her birthday. If a test user leaves the site in a state where this event will occur, we can certainly score the task as a success. If the user fails to place any order, we can just as easily determine the task a failure.
But there are other possibilities as well. For example, a user might:
- order twelve yellow tulips, twenty-four yellow roses, or some other deviant bouquet
- fail to specify a shipping address, and thus have the flowers delivered to their own billing address
- specify the correct address, but the wrong date
- do everything perfectly except forget to specify a gift message to enclose with the shipment, so that mom gets the flowers but has no idea who they are from
Each of these cases constitutes some degree of failure.
If a user does not perform a task as specified, you could be strict and score it as a failure. It's certainly a simple model: Users either do everything correctly or they fail. No middle ground. Success is success, without qualification.
However, we sometimes grant partial credit for a partially successful task. It can seem unreasonable to give the same score (zero) to both users who did nothing and those who successfully completed much of the task. How to score partial success depends on the magnitude of user error.
In the flower example, we might define several levels of success:
- complete success: the user places the order with no error, exactly as specified
- success with one minor issue: the user places the order but omits the gift message or orders the wrong flowers
- success with a major issue: the user places the order but enters the wrong date or delivery address
- failure: the user is not able to place the order
Of course, the precise levels of success would depend on the task and your and your users’ particular needs. (For example, if you did a survey and determined that most mothers would consider it a major offense to get tulips instead of roses, you may change the rating accordingly).
Reporting Levels of Success
To report levels of success, you simply report the percentage of users who were at a given level. So, for example, if out of 100 users, 35 completed the task with a minor issue, you would say that 35% of your users were able to complete the task with a minor issue. Like for any metric, you would have to report the confidence interval for that number.
(*) In this table, the ranges represent 95% confidence intervals calculated using the Adjusted Wald method.
Note that this method simply amounts to using multiple metrics for success instead of just one — each level of success is a separate metric.
You can also use other metrics such as number of errors; for example, you could define different error types (e.g., wrong flowers, wrong shipping address) and track the number of people who made each of these errors. Doing so may actually give you a more nuanced picture than using levels of success because you might be able to say precisely which of the different errors is more common and, thus, focus on fixing that one.
Do Not Use Numbers for Success Levels
A common error that people make when working with success levels is to assign numbers to them; for example, they may say:
- complete success = 1
- success with one minor issue = 0.66
- success with a major issue = 0.33
- failure = 0
And then, instead of reporting success, they simply average these success levels for their participants. In our example, they might say that the success rate is:
(20*1+35*0.66+ 30*0.33+0*15)/100 = 0.53 = 53%
This approach is wrong! The numbers that we assigned to the different levels of success are simply labels and they form an ordinal scale, not an interval or ratio scale. That means that, even though there is an order established across these levels of success (e.g., failure is worse than success with major issue), there is no mathematical meaning to these numbers and we cannot average them because we cannot truly guarantee that these numbers are evenly spaced on a 0 to 1 scale (or whatever other scale we’re using between complete success and complete failure). In other words, we don’t know and have no reason to assume if the difference between complete success and success with minor issue is the same as the difference between failure and success with major issue.
Since the temptation of averaging numbers is so big in real life, we strongly recommend that you assign word labels to levels of success instead of numbering them.
Jakob Nielsen, Ph.D., is a User Advocate and principal of the Nielsen Norman Group which he co-founded with Dr. Donald A. Norman (former VP of research at Apple Computer). Dr. Nielsen established the "discount usability engineering" movement for fast and cheap improvements of user interfaces and has invented several usability methods, including heuristic evaluation. He holds 79 United States patents, mainly on ways of making the Internet easier to use.
Raluca Budiu is Director of Research at Nielsen Norman Group, where she consults for clients from a variety of industries and presents tutorials on mobile usability, designing interfaces for multiple devices, quantitative usability methods, cognitive psychology for designers, and principles of human-computer interaction. She also serves as editor for the articles published on NNgroup.com. Raluca coauthored the NN/g reports on tablet usability, mobile usability, iPad usability, and the usability of children's websites, as well as the book Mobile Usability. She holds a Ph.D. from Carnegie Mellon University.
This article was originally posted on July 20, 2021 at nng.com