The NASA-TLX and Human Factors

In 1988, Sandra G. Hart of NASA’s Human Performance Group and Lowell E. Staveland of San Jose State University introduced the NASA Task Load Index (NASA-TLX). With more than 15,000 citations listed on Google Scholar since 1988, it has spread far beyond its original application of aviation, focus, and the English language (Hart, 2006).

The NASA-TLX estimates users’ perceived cognitive demand, which can help gauge a system’s usability, effectiveness, or comfort. The “immediate, often unverbalized impressions that occur spontaneously” (Hart & Staveland, 1988) are of particular interest, as they are either difficult or impossible to observe objectively. Researchers are not mind-readers, so understanding how a user feels sometimes requires – shocker – asking them how they feel.

Six Subscales

A key benefit of the NASA-TLX is how quick and easy it is to administer. Users simply rate how demanding a given task was along six subscales, or dimensions, on a scale from 1 (low/good) to 20 (high/poor). The original dimensions are:

  • Mental Demand – How much mental and perceptual activity was required? Was the task easy or demanding, simple or complex, exacting or forgiving?
  • Physical Demand – How much physical activity was required? Was the task easy or demanding? Slow or brisk? Slack or strenuous? Restful or laborious?
  • Temporal Demand How much time pressure did you feel due to the rate or pace at which the tasks or task elements occurred? Was the pace slow and leisurely or rapid and frantic?
  • Frustration Level – How insecure, discouraged, irritated, stressed, and annoyed versus secure, gratified, content, relaxed and complacent did you feel during the task?
  • Effort – How hard did you have to work (mentally and physically) to accomplish your level of performance?
  • Performance – How successful do you think you were in accomplishing the goals of the task set by the experimenter (or yourself)? What is your level of satisfaction?

The authors held these dimensions to be key ingredients of overall “workload” for “most people” in “most scenarios” (Hart, 2006), but they have been tailored to different studies over the years. Subscales are often added, deleted, or modified. For example, the Surgery Task Load Index (SURG-TLX; Wilson et al., 2011) changed three dimensions to better represent the workload of surgeons in the operating room. Modifying questionnaires in this way should be done cautiously because it essentially creates a new questionnaire, which in turn requires “establishing validity, sensitivity, and reliability of the new instrument before using it” (Hart, 2006).

Validating the NASA-TLX was critical to show that subscale ratings could be combined into a single overall workload score. This has traditionally been done using one of two methods:

  • Weighted combination – users are shown all 15 pairwise comparisons of dimensions (e.g., Mental Demand vs. Physical Demand, Effort vs. Performance, and so on) and asked to select which better represented their experience of demand. The number of times a dimension was selected determines its weight (or importance) and the weights are then multiplied by raw ratings and either averaged or summed into an overall score.
  • Unweighted combination – users do not perform the pairwise comparisons procedure. Instead, raw ratings from each subscale are simply averaged or summed into an overall score, which assumes that each dimension was equally important (this is often referred to as a Raw-TLX, or RTLX, score; Hart, 2006).

The ability to obtain a single value that captures a user’s overall experience of workload has been a particularly appealing feature of the NASA-TLX. Researchers have long compared these different calculation methods (cf, Byers, Bittner, & Hill, 1989; Hendy, Hamilton, & Landry, 1993; Liu & Wickens, 1994). However, after 35 years since its inception, the mere idea of calculating an overall score is now on the chopping block…

Overall Workload Scores – ‘Mathematically Meaningless’?

Very recently, a new article has led some cognitive workload researchers to get sweaty palms, and with a title like The Mathematical Meaninglessness of the NASA Task Load Index (Bolton, Biltekoff, & Humphrey, 2023), it’s no wonder. The authors performed a sophisticated analysis to assess fundamental properties of the NASA-TLX, including what different ratings even mean. We highly recommend reading that article for more detail, but here are some main takeaways:

  • Overall workload scores are ‘mathematically meaningless’ if the traditional calculations mentioned above are used. This is for a few reasons, including that:
    • Ratings for the Effort subscale are not directly comparable to other dimensions, so combining them all is conceptually like combining “different temperatures, some in Fahrenheit and some in Celsius” (Bolton et al., 2023; p. 598).
    • Ratings across some dimensions tend to be highly correlated, meaning that they are not truly independent parts of the “workload” whole.
  • Subscale ratings alone are not ‘mathematically meaningless’. After all, they are nothing more than a simple Likert-scale ratings for different concepts.
    • The analysis also confirmed that these ratings can be analyzed with popular parametric statistical tests (we’re looking at you, t-tests).
  • Future research is needed to determine whether and how dimension scores can be collapsed into an overall workload score.
    • In a pinch, you could drop the oddball Effort subscale and combine the others.
    • Some alternative methods have been proposed but not fully evaluated (yet).

The moral of the story is to stick to analyzing the subscale scores separately, at least until these problems are resolved or alternative scoring methods are validated.

NASA-TLX Advantages

Valid

The NASA-TLX subscales reasonably represent sources of cognitive workload across a variety of tasks. The subscales have been repeatedly validated as acceptable estimates for ingredients of subjective workload (Hart & Staveland, 1988; Rubio, Díaz, Martín, & Puente, 2004; Xiao, Wang, Wang, & Lan, 2005).

Applicable to Multiple Domains

The NASA-TLX is flexible and applicable to a number of domains. Originally, it was intended for use in aviation, but quickly spread to air traffic control, civilian and military cockpits, robotics, and unmanned vehicles. In later years, studies in the automotive, healthcare, and technology domains used the NASA-TLX (Hart, 2006) or variants like the SURG-TLX (Wilson et al., 2011).

Diagnostic-ish

The NASA-TLX has been shown to have some diagnostic abilities. While the original aggregate scores are now under scrutiny, the subscales can still help identify sources of workload during a given task (Hart, 2006). This feature can be remarkably helpful for developers hoping to improve their design.

Highly Accessible

In addition to having been translated into at least 12 languages, the NASA-TLX can be administered in various “mediums”. Paper and pencil is still popular, but it has also been integrated into computer software packages, an iOS app, and Android apps. Each of these options are completely free, which is significant. The accessibility of this measure has also led to a massive repository of data from different studies. This is useful for meta-analyses to survey ratings across hundreds of studies and benchmark what should be interpreted as a “high” or “low” score (cf. Grier, 2015).

NASA-TLX Disadvantages

Overall Workload Scores are Unreliable

The ability to obtain a single score for overall workload has been a hallmark NASA-TLX feature, so evidence that this should not be done (Bolton et al., 2023) is a glaring disadvantage. Hopefully, a suitable alternative calculation method will be validated soon to fill the current void.

Memory

Asking a user to complete a scale during a task can be rather intrusive. Unfortunately, waiting until the task is complete can lead to its own set of problems. Users are prone to forget various details of the task. As human memory has been shown to deteriorate over time, time between the task in question and the NASA-TLX itself is not ideal. This applies to all retrospective self-report questionnaires and is not unique to the NASA-TLX.

Task Performance Can Bias Ratings

A user’s perception of their own task performance can weigh heavily on all sorts of ratings. For example, workload ratings could be higher or lower depending on whether or not they believed they completed the task successfully – even though workload was in fact the same either way.

Subjective

The NASA-TLX is a subjective measure of a user’s perceived cognitive workload, nothing more. Importantly, it is NOT a measure of the “true” biological workload required by the brain to use a system, however sensational that would be. Practitioners employing the NASA-TLX must keep in mind what they are measuring and avoid treating perceived workload ratings as anything other than what they are. This again applies to all self-report questionnaires, not just NASA-TLX. We recommend using biometrics in addition to or instead of subjective ratings for more reliable data.

References

Bolton, M.L., Biltekoff, E., & Humphrey, L. (2023). The mathematical meaninglessness of the NASA Task Load Index: A level of measurement analysis. IEEE Transactions on Human-Machine Systems (Vol. 53, No. 3, pp. 590-599).

Grier, R.A. (2015). How high is high? A meta-analysis of NASA-TLX global workload scores. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 59, No. 1, pp. 1727-1731).

Hart, S. G. (2006, October). NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting (Vol. 50, No. 9, pp. 904-908). Sage CA: Los Angeles, CA: Sage Publications.

Hart, S. G., & Staveland, L. E. (1988). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology (Vol. 52, pp. 139-183). North-Holland.

Hendy, K. C., Hamilton, K. M., & Landry, L. N. (1993). Measuring subjective workload: when is one scale better than many?. Human Factors, 35(4), 579-601.

Hill, S. G., Iavecchia, H. P., Byers, J. C., Bittner Jr, A. C., Zaklade, A. L., & Christ, R. E. (1992). Comparison of four subjective workload rating scales. Human factors, 34(4), 429-439.

Liu, Y., & Wickens, C. D. (1994). Mental workload and cognitive task automaticity: an evaluation of subjective and time estimation metrics. Ergonomics, 37(11), 1843-1854.

Rubio, S., Díaz, E., Martín, J., & Puente, J. M. (2004). Evaluation of subjective mental workload: A comparison of SWAT, NASA‐TLX, and workload profile methods. Applied Psychology, 53(1), 61-86.

Wilson, M.R., Poolton, J.M., Malhotra, N., Ngo, K., Bright, E., & Masters, R.S. (2011). Development and validation of a surgical workload measure: The Surgery Task Load Index (SURG-TLX). World Journal of Surgery, 35, 1961-1969.

Xiao, Y. M., Wang, Z. M., Wang, M. Z., & Lan, Y. J. (2005). The appraisal of reliability and validity of subjective workload assessment technique and NASA-task load index. Chinese journal of industrial hygiene and occupational diseases, 23(3), 178-181.

Categories

Tags