In order to market and sell a new medical device in the United States, the Food and Drug Administration (FDA) desires evidence of its safe and effective use among users. Part of this “evidence” is a validation usability study that evaluates all of your identified critical tasks. As resources, timelines, and even jobs can rely on the success of a study, this process can be understandably stressful. Fortunately, the FDA has provided multiple guidance documents to help manufacturers plan and conduct a sound validation usability study.
Yet, not all validation studies go well. In fact, quite a few of them go poorly due to issues that could have been minimized or avoided altogether.
For this reason, we have outlined a few prevalent mistakes responsible for derailing a Human Factors validation usability study. We hope that by increasing awareness of these common pitfalls, our readers and clients can better prepare for premarket submissions of their medical devices to the FDA.
1. Skipping formative usability studies altogether
Without a doubt, failing to conduct formative usability studies is the #1 reason validation studies fail. Human factors validation testing is the real thing; it’s the “final exam”. If the product doesn’t perform well, you may have to start testing all over again. That means an entirely new set of participants, study dates, and of course, revisions to your project’s budget.
And unfortunately, it’s astonishingly easy to miss usability problems without talking to end-users. You can brainstorm, predict, and assume all you want, but usability problems won’t rear their ugly heads until a user gets his hands on the product.
That’s why we conduct formative usability studies: they are the perfect opportunity to “practice”.
In addition to identifying design improvements, we employ formative studies to ensure there are no surprises once validation comes along. Formative testing is a catalyst for identifying usability issues and uncovering potential solutions. It allows you to work out the kinks before conducting a validation test that will be reviewed by the FDA.
Many manufacturers elect to skip formative testing for a variety of reasons. Some have confidence in their product’s ability to get through validation unscathed. Others are low on time or money, so they take the gamble based on resource limitations. It’s a simple matter of weighing the risks, but we always suggest doing formative work whenever possible.
2. Combining user groups
The FDA requires at least 15 participants per user group when completing a human factors validation study. In an effort to save time and resources, it’s tempting to combine users from similar fields into one user group. For example, a company might try to combine doctors and nurses into a single “healthcare provider” user group (rather than treating them separately). While there are times when this can be done, these exceptions are few and far between.
Here’s the problem: intended use is not always the same across all users within the “healthcare provider” domain.
Let’s say a company is creating a new patient multi-parameter monitoring system for use in a hospital. That is, the type of device that displays the patient’s vitals (e.g., CO2 levels, SpO2, heart rate, blood pressure). Through formative research, we might learn that nurses use similar (predicate) devices to verify that vital measures are recorded. S/he might also monitor the data to verify that the patient is responding within a “normal” range across the board. By contrast, a cardiologist might use the same device to investigate the patient’s multi-day history of arrhythmia events, searching for trends, patterns, or responses to various treatments. In this sense, the cardiologist is using the device for a different purpose than the nurse.
It often comes down to whether components of those different tasks involve a critical task. If even one critical task is performed by a cardiologist and not a nurse, a separate group of cardiologists should evaluate that task. However, if nurses and cardiologists complete the same task, collapsing nurses and cardiologists into a single group might work. In such a case, combining user groups must be thoroughly justified in your FDA submission. FDA will ask for clarification if confusion arises, which can lead to lengthy delays during their review process. It is another matter of costs and risks.
As a side note: the vast majority of FDA submissions require some level of clarification after the submission. Annually, less than 1% of 510(k) submissions make it through the FDA submission process without some sort of clarifying documentation.
3. Omitting training (and a decay period)
When companies read the term “user interface” in the Human Factors guidance, they think only in terms of their “device”. After all, isn’t this what users “interface” with? The truth is that the term “user interface” applies to several device-related components, and not just the device itself.
Indeed, according to the FDA, a “user interface” refers to any and all components of the product, including the device, manufacturer supplied accessories, instructions for use (IFU), labeling, packaging, and training content.
Training is often neglected when preparing for a validation study. One reason might be that training is (unfortunately) the last piece of the puzzle manufacturers tackle in their design process. Many manufacturers approach training like a contingency plan; they will only start talking about training if they see obvious weaknesses in their device design or IFU. Incidentally, this delayed planning approach makes it harder to integrate training into the existing workflow — both from a process design standpoint, and from a sales model standpoint.
Another issue is that manufacturers may not guarantee onboarding/continual device training to all of their its customers, under all circumstances. For this reason, manufacturers may not have standardized training procedures prepared leading up to a validation study.
For the record: If the manufacturer will provide training every time in the real world, training must be included in validation testing.
Depending on the device in question, training is something that should be anticipated from the outset of the design process. It should be tested early and often — just like the device — to ensure its consistent and effective.
There is another part to this training issue. Even when manufacturers incorporate training into a validation study, many omit a “decay period” following training. A decay period refers to the period of time between training completion and the start of the testing session. The goal is to replicate memory degradation (i.e., forgetting, misremembering) that naturally occurs between training and real-world use.
Confusingly, an appropriate decay period depends on how the device is intended to be used. While uncommon, a manufacturer may administer training minutes before initial use of a new medical device. For example, some self-care smartphone apps have begun incorporating a mandatory app training module into the initial setup process; the user cannot proceed unless he or she completes the training module(s). In this situation, the “training” component is baked directly into the use of the product. However, these atypical cases shouldn’t be followed across all devices. Most medical devices used in a hospital, for example, rarely implement training in a similar way.
Regardless of device type or platform, however, FDA requires a training decay period that reasonably corresponds with the device’s real-world training and use. In some cases, this might be as little as an hour, in others it might be several days or weeks. The FDA guidance states:
FDA, 2016: Applying Human Factors and Usability Engineering to Medical Devices
“In some cases, giving the participants a break of an hour (e.g., a “lunch break”) is acceptable; in other cases, a gap of one or more days would be appropriate, particularly if it is necessary to evaluate training decay as a source of use-related risk”.
Keep in mind, including training and a training decay adds time and cost to a study. Even if the hands-on part of the validation session will take 30 minutes, you have to include time for training (e.g., 30 minutes), as well as time for training decay (e.g., 1 – 2 hours). This increases your participant incentive amounts, and the total amount of time you need to set aside lab space.
While these factors do increase costs overall, understand that they may be required for the device in question. Keep focused on the long-term goal, and don’t get bogged down in these short-term costs.
4. Failing to create a realistic testing environment
Some manufacturers believe it’s necessary to replicate every last aspect of a use environment, spending thousands on background props. Other manufacturers fail to replicate the use environment at all; they supply a table, chair, and device to the user, then tell them to figure it out.
There is a balance point with setting up a realistic test environment. FDA recommends that a simulated-use context should be:
“…sufficiently realistic so that the results of the testing are generalizable to actual use. The need for realism is therefore driven by the analysis of risks related to the device’s specific intended use, users, use environments, and the device user interface”.FDA, 2016: Applying Human Factors and Usability Engineering to Medical Devices
In other words, if the manufacturer anticipates that an environmental factor will influence a user’s interaction or ability to interact with a device, then it should be present and accounted for in some way.
A good example of this is the presence of background noise in a hospital ICU simulation. There are many “dings” and “pings” and “beeps” in an ICU. Some of them might be the elevator down the hallway. Others might be due to patient monitors from neighboring areas. If a manufacturer needs to test awareness and response to their device’s alarm, then these confounding noises should be present during testing too. To read more about alarm design, check out our recent post on designing “user-friendly alarms”.
5. Excluding data from ‘bad’ participants
We’ve all had them in our studies. People who can’t keep their hands off their phones for more than 2 minutes. Others don’t listen to the simplest of directions. And don’t forget the ones who make up their own rules because, hey, “this isn’t real anyways”.
Try as you may to design a study to minimize the impact of “bad” participants, Murphy’s Law holds true nonetheless: “Anything that can go wrong, will go wrong.”
Especially in a validation study for FDA, “bad” participants make it challenging to conduct effective research. But, that doesn’t mean you get to instantly toss out their data like it never happened. In fact, it’s all the more reason to take note of what is happening. This “bad” participant is arguably representative of the “worst” type of user you will have in a real-world setting. While atypical, their behavior is not beyond the realm of possibility, nor are the accompanying risks that result from poor compliance.
As the brains behind the development of a medical device, you have to approach your design with the following mindset: Design as if your least competent user will interact with your device in the most complicated situation. If you can drive success in these situations, you will be successful in “normal” situations as well.
Okay. So, that’s all well and good. But what do you do about the data collected from a “bad” participant?
There are a couple of approaches you can take. First and foremost, if you truly believe the “bad” participant was an anomaly, then recruit and run one or two extra participants. Next, use the “bad” participants performance data to quantitatively demonstrate that he/she was beyond the norm of your sample. “Beyond the norm” is often defined by some measure being at least two standard deviations above or below the sample’s mean. If your “bad” participant fails considerably more tasks than your sample mean, then this is a good starting point in your argument to FDA. (Note that this on its own may not be enough rationale for FDA, but it’s a good starting point for your argument).
Running extra participants also gives your submission a “full” 15 users in the event that FDA accepts your argument of “the “bad” participants data should be taken with a grain of salt. Had FDA accepted your argument, and you didn’t run those extra participants, they might come back and ask that you conduct a few extra participants to complete their minimum sample requirements of 15. Running a couple extra allows you to get ahead of the game so you don’t have to replicate your study efforts post-hoc.
Importantly, FDA views your validation study as a “qualitative” evaluation. Meaning, it’s not all within the numbers. If there is reason to suspect that your “bad” participant’s outcomes are within reason in the real-world (i.e., the root cause for their use-errors are rationally tied to the design of the interfaces), then your argument for FDA to overlook this or that use-error might be kaput.
A lot of work should be put into the recruiting process for your study to minimize the possibility of ever running a “bad” participant in the first place. If you are working with an outside recruiting agency, for example, ask about their approach or strategies around keeping “bad” participants out of their participant pool. Likewise, ensure that your researchers conducting the validation study have the experience needed to reign in a participant when they start to go rogue. This way, even if a “bad” participant does manage to make it into your validation study, the negatives can be minimized. Far too often, inexperienced researchers find themselves in the role of the “Human Factors expert” on their team. When things go wrong during a study session — remember, Murphy’s Law — they don’t know how to handle it effectively.
Keep in mind, there is a huge difference between a “bad” participant and a participant that performs poorly in the study. A “bad” participant is one whose performance is negatively affected by their own, self-imposed, extraneous factors. A poor performing participant is one whose performance is primarily due to your device. Don’t start pointing fingers until you know for sure what type of participant you have. What may first appear as a “bad” participant in the beginning of a study may turn out to be the norm by the time you finish running your entire sample of participants.
6. Coaching participants
Unlike formative research where there are no “rules” for how a study should be conducted, a validation study has a few “must do” and “must not do” items. One of the “must not do” items is coaching participants.
What do we mean by “coaching participants”?
“Coaching” in observational research refers to research personnel (i.e., moderator, confederate) helping the participant directly or indirectly reach a judgment, decision, or action. To state the obvious, “coaching” is problematic because it fundamentally changes what the participant might have actually done or said, had he or she been left alone to figure things out. Once the participant’s opinion or behavior is altered in some way, it’s impossible to tease apart what was due to the influence of coaching, and what was the participant’s interpretation of the user interface.
Obvious forms of coaching involve directing “stuck” participants to appropriate sections of the interface. As painful as it can be to sit there and watch while a participant struggles, a researcher’s primary job in a validation study is to remain hands-off at all costs. The only time you should intervene in a participant session is if he or she is going to hurt themselves, or cause permanent damage to your only device. (If the latter situation occurs, prepare to write a major justification in your FDA submission.)
Less obvious forms of coaching involve the way you frame or present questions. For example, questions like, “how much do you like this product?” may introduce bias due to the fact that the question is “positively” framed. Test items like, “please rate how much you like or dislike this product” are better because they are neutrally framed.
Even less obvious forms of coaching might involve the mere presentation of questions altogether. As an example, imagine that you are conducting a long, 3 hour session as part of a validation study. It would be unreasonable to expect any participant to recall a single task or decision she made hours earlier in the session, so your team agrees to conducting multiple debriefs throughout the study instead. That way, you are following up on a more manageable and memorable period of time.
However, during one of your initial debriefs, you only call attention to tasks where you observed use-errors, close calls, and difficulties. You never cover a task with successful performance. The participant observes this trend after a few questions, and realizes that anything you ask about has to do with something they “messed up” in some way. And, while he or she doesn’t know the intended response or behavior you expected to see, they know to pay closer attention to tasks similar to this one later in the study.
These less obvious forms of coaching are difficult to avoid. In some cases though, it requires changing major components of your study design. Conducting shorter study sessions is one work-around to the issue. Another solution is to follow-up on tasks where you observed the participant complete the task successfully. Sometimes you can frame these questions like, “walk me through how you did x, y, or z.” You can use a similar approach with tasks the participant struggled on too.
7. Failing to properly identify root causes
FDA requires that all observed “difficulties”, “close-calls”, “use-errors”, and “negative use comments” for critical tasks be evaluated to determine each root cause. Specifically, what about the user interface(s) — if anything — made the participant perform in the way that they did?
Suffice it to say, understanding the root cause for each and every issue is a challenge.
Less experienced researchers tend to accept a participant’s initial explanation at face value. For example, you will often hear a participant say, “oh, I just skipped that step by mistake.” But this may or may not suffice to address the root cause of the issues you observed. People are not particularly good at identifying the hows and whys of their behaviors. What’s more, some have a hard time opening up in a social situation like an interview to explain their thinking on the fly. Less experienced researchers feel uncomfortable with this dilemma, so they move on, fearing that they will irritate a participant or make them feel embarrassed.
By contrast, more experienced researchers know how to press participants in subtle and respectful ways. They have also learned that people are more resilient in these situations than you might expect. During their probing efforts, experienced researchers understand that there may be deeper reasons why he or she skipped a question:
- Differences from the participant’s workplace processes
- Instructions that put too many “steps” together in one sequence or sentence
- Instructions failing to make the step stand-out enough
- Difficulties understanding a word or phrase in the step
- Fatigue from the study session (i.e., multiple hours of testing)
Note that not all “root causes” have to say that a device interface was to blame. Sometimes, the user’s background, experience, and expectations play a role too. Likewise, participants do occasionally skip a step by mistake. Like, an actual mistake, such as the pages of the IFU got stuck together. That’s okay — no big deal. However, as a researcher you have to pay due diligence to determine if the root cause can be explained at a deeper level.
The better transparency you have in your root cause analysis, the easier it is for their reviewers to make sense of the “story” that is being told through your analysis. If that story has gray areas, they will put your submission on pause while they ask for further clarification.
Human factors validation testing can be challenging. But they don’t have to be if medical device manufacturers make themselves aware of common pitfalls. As they truly are so prevalent, we find the above mistakes are a very good starting point.
What common mistakes have we missed? If you have suggestions or tips you want us to add to our list, please drop a note in the comment section below. We’d love to hear from you!
Lastly, what other topics would you like to see on the Research Collective blog? We would love to address any of your questions or suggestions!
About the Authors
Joe O’Brian | Senior Human Factors Scientist | Research Collective
Joe O’Brian is a Senior Human Factors Scientist at Research Collective. He has co-authored articles on topics ranging from judgment and decision making to education and healthcare technologies. At Research Collective, his contributions include project planning, observational and biometric research, and advanced statistical analysis for major automotive and healthcare organizations. You can find Joe on LinkedIn here.
Anders Orn | Human Factors Scientist | Research Collective
As a Human Factors and User Experience Researcher, Anders Orn plans for and conducts observational research at Research Collective. While he is involved in many aspects of research, Anders enjoys in usability testing in the healthcare and automotive industries as they are a unique opportunity to examine human behavior. You can find Anders on LinkedIn here.