In research, no two concepts are more confused with one another than validity and reliability. It’s not necessarily the difference between them that is so baffling; it’s more the fact that there is so much more to it. The tricky stuff comes when we examine the different types of validity and reliability, as well as when or why they matter the most.
Below we’ve outlined the basics of validity and reliability as it applies to user-centered research. We hope that you find them helpful. If you have an example you’d like to share, please talk about it in the comment section below!
- Internal Validity
- External Validity
- Construct Validity
- Criterion-Related Validity (Concurrent & Predictive)
To start things off, let’s get on the same page about what we mean by the term, “reliability”. Reliability is the degree to which a specific research method or tool is capable of producing consistent results from one test to the next.
In domains such as mechanical engineering, reliability is pretty easy to conceptualize. For example, if you wanted to know the distance between points on a flat surface, you could use a ruler. If you were to perform the measurement several times in a row, your results would have relatively high reliability.
However, the truth is that each measurement you made wouldn’t be exactly the same as the ones before it. One reason for this is that your measurement reliability is dependent on your interpretation of what the tool is saying. Is the distance closer to 5-11/16th inches, or 5-3/4th? Did you start measuring from the inside or outside of the line last time? In this situation, you impose a qualitative interpretation of a tool believed to afford a quantitative result. As a result, your reliability (i.e., consistency) is good – maybe even great – but not perfect.
If you were a general contractor framing a new house or installing kitchen cabinets, the level of measurement reliability described above (+/- 1/16th”) is probably fine. But when you get into situations where high reliability is essential – like computing the theoretical atomic weight of a kilogram – then you start to see where seemingly small deviations can wreak havoc on the overall accuracy of your results.
Reliability is equally as important in the world of social sciences as it is in the physical sciences. For instance, if Joey completes a survey once during January, February, and again in March, we could call the survey ‘reliable’ if his scores do not waver. The more they change from one month to the next, the lower we predict the survey’s overall reliability to be.
All study criteria must remain consistent through all three months. This would include things like the ordering of survey questions, the time of day Joey took the survey, and the time he gets to think about each question before responding. Even environmental and contextual factors are notable. Whether Joey ate breakfast, who scored his survey, and even what the relative humidity was on test day matter. They all add up. Social science research can be incredibly difficult to conduct with a high degree of reliability.
In reality, we cannot control every last detail in social science research. There are too many variables, and their individual and collective impact on the final result(s) is very, very difficult to model and measure without studying relationship(s) at a population level (i.e., the entire known group of end-users in the world who could potentially interact with the device or system). Of course, this doesn’t stop us from trying to account for the variables we anticipate to matter the most.
Indeed, a key difference between reliability as it applies to physical sciences vs. social sciences is that the latter must rely almost exclusively on building theoretical models to explain the relationships between variables. There is no direct, single, absolute measurement of “human cognition”. There are only smaller, proxy measures; things that tell us indirectly about one component of cognition (e.g., cognitive workload). And, through an amalgamation of these proxy measures, we make claims about cognition as a whole.
One goal of new research – often unstated – is to increase the reliability of these proxy measures over time. Only through a process of successive approximations can we approach a consensus on how these small, proxy measures are handled. To aid in this pursuit, researchers have defined multiple types of reliability as an estimator of the measure’s overall reliability. The following sections describe each type of reliability in greater detail.
Inter-rater reliability means that there is high agreement among judges or raters when making assessment decisions. The terms inter-rater agreement, inter-rater concordance, and interobserver reliability are synonimous.
In human-centered research, inter-rater reliability might be considered if researchers are attempting to categorize a qualitative task. During this categorization process, two or more raters individually assign an outcome or event into predetermined “bins” in an effort to quantify a user’s performance.
Not surprisingly, this can get tricky. If raters don’t universally understand these “bins”, you run the risk of lowering the inter-rater reliability. For example, suppose a person struggles to navigate through a website, but eventually finds the information they needed. One rater might score that outcome as a “success”, while a second rater might deem it a “success with difficulty”. Two things might be happening in situations like this. Either the “bins” you’ve defined are not discrete enough, and/or your raters need more training. Adding in another rater is always a good way to test these assumptions, too.
Having clear category criteria and consensus amongst raters is essential before beginning an analysis. The more consensus there is between raters, the higher your inter-rater reliability is, and vice versa. There are a number of statistical methods you can use to determine inter-rater reliability. We won’t discuss them here, but we encourage you to investigate each method on your own:
- Choen’s kappa: easy to compute; common amongst “categorical” data
- Fleiss’ kappa: helpful when sampling from multiple raters
- Krippendorff’s alpha: more complicated to compute, but highly accurate
Note that inter-rater reliability is different from intra-rater reliability. In the same sense that an interstate is different from an intrastate. While inter-rater reliability focuses on the agreement or disagreement between two independent raters, intra-rater reliability refers to multiple rating attempts made by the same individual. Human-centered research uses this latter approach less frequently due to its susceptibility to biases. However, in some situations (e.g., a limited number of subject matter experts), it might be the best option you have.
Test-retest reliability refers to the consistency of a test over time, when completed by the same individual(s). In our example above, Joey completed the same survey once per month for a total of three months. If scores on the three surveys were highly consistent, we could make the argument that his test-retest reliability is high.
To say the survey itself has high test-retest reliability, however, many more people like Joey would need to take it. Then, we would evaluate each one (i.e., within subjects) to determine if their results were as consistent as Joey’s results. Next, we would compare those results across individuals (i.e., between subjects). If there is minimal deviation across scores, then we could say that our survey has a high test-retest reliability overall.
It’s important to restate here that establishing test-retest reliability cannot occur unless all other variables remain the same over time. Joey’s scores probably wouldn’t have high test-retest reliability if he was taking antidepressants between his first and second survey attempts; the medication (i.e., a confounding variable) might explain the change in scores. The following conditions need to be the same in order to rule out other explanations for test similarities or differences:
- Measurement tools (i.e., Joey’s survey, computer)
- Testing conditions (i.e., time, location)
- Observers (i.e., for scoring that requires qualitative assessment)
- The same interval of testing (i.e., one month apart)
Generally speaking, if the Cronbach’s alpha score – a coefficient of reliability – is 0.7 or higher, we consider that measure to have “good” test-retest reliability. To learn more about Cronbach’s alpha and how to calculate if your data, we suggest visiting: https://stats.idre.ucla.edu/spss/faq/what-does-cronbachs-alpha-mean/
Parallel-forms reliability – also referred to as equivalent forms reliability – is the degree to which two sets of measures in a study evaluate the same construct. To be clear: we’re not talking about “measures” as a single survey question or item here. The emphasis is on the “set” of measures.
Some research situations may require the use of multiple versions of the same questionnaire. For example, if a researcher wants to know if their revised version of a questionnaire still produces the same results as the original questionnaire, he or she might administer both questionnaires to the same individual in a single session. If the two versions of the questionnaire have high parallel-forms reliability, then the results should be the same for both. But, if there is a noticeable difference in scores, this indicates that the revised questionnaire was different from the original.
Parallel-forms reliability is essential to check on standardized tests, such as IQ tests, the GRE, and the SAT. Standardized tests require constant updating and revisions to minimize learning effects and cheating. Therefore, a new “batch” of questions must replace an existing “batch” of questions.
Internal consistency is similar to parallel-forms reliability in that it evaluates the degree to which two measures assess the same construct. It is different, however, because it refers to individual items within a survey or questionnaire.
Consider a researcher that wants to measure the cognitive load a new smartphone application puts on a user. To accomplish this goal, he/she might evaluate an assortment of measures believed to be a proxy measure of cognitive load, such as the users’ pupil diameter, heart rate, and responses to the NASA Task Load Index (TLX). Knowing from the published literature that these measures possess varying levels of construct validity (described below), the question is to determine the level of consistency that exists between these measures (i.e., internal consistency). That is, how well the measures correlate with one another.
For the record, internal consistency does not indicate how well these items actually predict a central construct (e.g., cognitive load). It only tells you whether items are related amongst themselves.
Like the concept of “reliability”, there are many types of validity. Our working definition of validity is, “the degree to which the tool actually measures what it claims to measure”. It refers to the credibility or integrity of your research. Additionally, it helps you answer whether your results are generalizable to other situations or contexts.
Internal validity refers to how much the experimental manipulations cause the effects that are observed. In order words, it is the strength of the causal relationship between the independent and dependent variables. This type of validity is a large reason why experts use significance testing in research. Significance testing helps determine whether results are due to experimental manipulations (i.e., independent variables), or due to randomness and chance. If we have low internal validity, our significance test will usually come back low as well.
Often in social science research, significance testing – for better or for worse – is often based on two values: your study’s effect size, and that effect size’s p-value. The effect size speaks to the “strength” of the relationship between the independent variable (IV) and the dependent variables (DVs). The higher the effect size, the stronger the relationship.
The effect size’s p-value tells a different story – one that relates more to internal validity. A p-value refers to the effect size’s “probability value”. That is, how likely the effect size you found is due to an honest-to-goodness relationship between your IV and DVs, or just smoke and mirrors.
A small p-value indicates the observed relationship has a small chance of being due to other things, such as randomness. For example, p=0.05 says the relationship has a 95% chance of being due to the manipulation of your variables. In other words, while we assume the internal validity is accurate, a 5% chance that it’s inaccurate remains.
External validity is an umbrella term that describes the extent to which research conclusions are generalizable to other people, situations, and time. These three components can be subdivided into their own forms of validity:
- Ecological validity: Can the results be generalized to other settings?
- Population validity: Can the results be generalized to other people?
- Historical validity: Can the results be generalized over time?
External validity – especially its ecological and population components – is important to consider when progressing your testing from a controlled lab setting to a real-world environment. As an example, let’s say two teams of researchers want to evaluat how easily users can change the radio station in a new automotive infotainment system. Both teams will use the same set of tasks, then they will compare results at the end. Team A conducts their study first and decides to test the infotainment system on a realistic benchtop computer station in a stationary, well lit, air-conditioned lab with members from their own team. To their pleasure, they find that almost all participants complete the study tasks without a single issue or negative comment.
Team B runs their study afterwards, anticipating that few issues – if any – will be observed. They test the same infotainment system as Team A, but do so in an actual vehicle, and with random people they recruited from around town. What’s more, they ask their participants to complete the study tasks while driving on roadways with moderate traffic. Much to their displeasure, they find that more than half of their participants experience issues during one or more tasks. And, to make matters worse, they receive several negative comments about the system’s ease of use.
Team B panics. They ruminate on the number of ways they possibly, maybe, probably, definitely biased each and every participant. However, consider the environmental demands and participant differences that were different from Team A’s study. This is an example of a set of studies with poor external validity. In truth, Team A’s study is the culprit responsible for this mess. They set the bar high and didn’t offer the caveat that their study sample and environment may have influenced the results. As a result, Team B expected diamonds but wound up with dirt.
If you find yourself in this situation, a good strategy to follow from the beginning of the study is to communicate two messages. First, any study conducted in a controlled environment and with “knowledgeable” participants will almost always generate attractive results, compared to results you’d see in the real-world. Second, any issues you see in a controlled environment – no matter how small they seem – will be exacerbated once you add in additional environmental demands and participant variability. Fix those design issues before going any further with testing, or else you’ll run the risk of getting the same feedback across all participants.
Construct validity is the degree to which a research tool measures the construct it’s supposed to measure. As an example, think back to our usability study example in which the researcher used pupil diameter, heart rate, and the NASA-TLX to measure cognitive load. While internal consistency is how well the three tools correlate with one another, construct validity is how well they actually measure cognitive load. If construct validity is low, there is little point in using those research instruments because they aren’t measuring what you want to measure!
The weird part about construct validity is that it’s never based on absolutes in the beginning. For example, while our standard convention of a “linear foot” (12 inches) is on millions of measuring devices in (nearly) the exact same exact way, it’s origins were completely arbitrary. Indeed, the “foot”, by which we define many of today’s important measurements, comes from human body dimensions. The Egyptians, for instance, defined their foot – a “djeser” – as a measure of four male palms placed side by side (thumbs tucked under). Centuries later, some Romans used the “Drusianus foot”, taken from the measurement of Nero Caludius Drusus’ sweaty, sandaled soles. Then, one day in July 1959, the International Yard and Pound Agreement established the standard for the “foot” as we know it today.
In the context of social science, the same issues apply. In the early stages of developing a new psychological construct, some researcher has to eventually place a flag in the ground and say, “henceforth, this tool is a measurement of this construct!”.Researchers then attempt to define areas around and related to that construct with the goal of usurping the original tools and replacing them with (ideally) more reliable and accurate ones.
Criterion-related validity refers to how well one measure predicts the outcome of another. Just to keep you on your toes, there are two main types: concurrent validity and predictive validity. Both are very similar and differ from one another solely by the time at which the two measures are administered. Concurrent validity focuses on two measures administered at roughly the same time; if Measure A predicts the outcome of Measure B, the concurrent validity is high. Predictive validity, on the other hand, refers to one measure predicting the results of another measure at a later time.
Concurrent validity is a measure of how much a tool correlates with existing, validated tools intended to measure the same construct. Demonstration of concurrent validity is a common method of justifying the use of a new research measure; if a proposed measure significantly correlates with an already validated one, that is promising evidence towards eventually establishing the new measure’s validity.
Meanwhile, predictive validity is the degree to which a future measurement of a variable can be predicted by a current measurement. The go-to example everyone knows is the SAT. Many colleges ask for prospective students’ SAT scores because the test’s predictive validity is high. While certainly imperfect, high SAT scores significantly correlate with high GPA in college. Tests like these are used all the time by organizations that want to recruit the best prospects using the fewest resources.
Correlation is heavily involved in establishing concurrent and predictive validity. While positive and negative correlations are meaningful in most situations, negative correlation is less relevant. Nearly always, researchers examining concurrent and predictive validity are interested in positive correlations because they are usually administering two measures of similar constructs. A negative correlation coefficient of -0.68 certainly means something, but it’s probably not typically something researchers are hoping for. When interpreting correlation coefficients – or “R” values – the following values can be considered a rule of thumb:
- 0.00 – no relationship, very weak
- 0.30 – weak relationship
- 0.50 – moderate relationship
- 0.70 – strong relationship
When evaluating criterion-related validity of a measure, stronger relationships indicate higher validity.
Validity concerns the test’s ability to measure what it is intended to measure. Meanwhile, reliability concerns the degree of consistency in the results if we repeat the test over and over. Now that we know the difference between the two, and the various types of each construct, it is important to note that validity and reliability are independent of one another. A measure could be only valid, only reliable, both, or neither.
So there you have it! Not so difficult, right? If you find that remembering a definition or two is challenging, try using an example to help you. We find that it’s much easier to remember an example that is more concrete rather than an abstract definition.
Lastly, what other topics would you like to see on the Research Collective blog? We would love to address any of your questions or suggestions!
About the Authors
Joe O’Brian | Senior Human Factors Scientist | Research Collective
Joe O’Brian is a Senior Human Factors Scientist at Research Collective. He has co-authored articles on topics ranging from judgment and decision making to education and healthcare technologies. At Research Collective, his contributions include project planning, observational and biometric research, and advanced statistical analysis for major automotive and healthcare organizations. Joe can be found on LinkedIn here.
Anders Orn | Human Factors Scientist | Research Collective
As a Human Factors and User Experience Researcher, Anders Orn plans for and conducts observational research at Research Collective. While he is involved in many aspects of research, Anders enjoys usability testing in the healthcare and automotive industries as they are a unique opportunity to examine human behavior. You can find Anders on LinkedIn here.