In research, no two concepts are more confused with one another than validity and reliability. It’s not necessarily the difference between them that is so baffling; it’s more the fact that there is so much more to it. The tricky stuff comes when we examine the different types of validity and reliability, as well as when or why they matter the most.
Below we’ve outlined the basics of validity and reliability as it applies to user-centered research. We hope that you find them helpful. If you have an example you’d like to share, please talk about it in the comment section below!
To start things off, let’s get on the same page about what we mean by the term, “reliability”.
Reliability is the degree to which a specific research method or tool is capable of producing consistent results from one test to the next.
In domains such as mechanical engineering, reliability is pretty easy to conceptualize. For example, if you wanted to know the distance between point A and point B on a flat sheet of paper, you could use a plastic, grade-school ruler to find out how many inches the two points are from one another. If you were to measure the distance between the two points several times in a row, few would argue that your results would have relatively high reliability.
However, the truth is that each measurement you made wouldn’t be exactly the same as the ones before it.
One reason for this is that your measurement reliability is dependent on your interpretation of what the tool is saying. Is the distance closer to 5-11/16th inches, or 5-3/4th? Did you start measuring from the inside or outside of the line last time?
In this situation, you impose a qualitative interpretation of a tool believed to afford a quantitative result. As a result, your reliability (i.e., consistency) is good — maybe even great — but not perfect.
If you were a general contractor framing a new house, or installing kitchen cabinets, the level of measurement reliability described above (+/- 1/16th”) is probably fine. But when you get into situations where high reliability is essential — like computing the theoretical atomic weight of a kilogram — then you start to see where seemingly small deviations can wreak havoc on the overall accuracy of your results.
Reliability is equally as important in the world of social sciences as it is in the physical sciences.
For instance, if Joey completes a survey once during January, February, and again in March, the survey would be considered ‘reliable’ if his scores do not waver. The more they change from one month to the next, the lower we predict the survey’s overall reliability to be.
The caveat is that all the study criteria from one monthly test to the next must be exactly the same to suggest reliability among your results. This would include things like the ordering of survey questions, the time of day Joey took the survey, and the time he gets to think about each question before responding. And, if we’re really splitting hairs, we would also need to account for environmental and contextual factors like, whether Joey ate breakfast, who scored Joey’s survey, and even what the relative humidity and weather was like on the day Joey took each survey.
All of these things add up. Social science research can be incredibly difficult to conduct with a high degree of reliability.
In reality, we cannot control every last detail in social science research. There are too many variables, and their individual and collective impact on the final result(s) is very, very difficult to model and measure without studying relationship(s) at a population level (i.e., the entire known group of end-users in the world who could potentially interact with the device or system).
Of course, this doesn’t stop us from trying to account for the variables we anticipate to matter the most.
Indeed, a key difference between reliability as it applies to physical sciences vs. social sciences is that the latter must rely *almost exclusively* on building theoretical models to explain the relationships between variables. There is no direct, single, absolute measurement of “human cognition”. There are only smaller, proxy measures; things that tell us indirectly about one component of cognition (e.g., cognitive workload). And, through an amalgamation of these proxy measures, we make claims about cognition as a whole.
One goal of new research — often unstated — is to increase the reliability of these proxy measures over time. Through a process of successive approximations we get closer in consensus and reliability on how these small, proxy measures are defined and approached through research.
The following sections describe each type of reliability in greater detail.
Inter-rater reliability means that there is high agreement among judges or raters when making assessment decisions. The terms, inter-rater agreement, inter-rater concordance, and interobserver reliability are also used amongst industry and academic contributors to the literature.
In human-centered research, inter-rater reliability might be considered if researchers are attempting to categorize a qualitative task. During this categorization process, two or more raters individually assign an outcome or event into predetermined “bins” in an effort to quantify a user’s performance.
Not surprisingly, this can get tricky. If your “bins” aren’t universally understood, you run the risk of lowering the inter-rater reliability. For example, if a person struggles to navigate through a website, but eventually finds the information they needed, you might have one rater score that outcome as a “success”, while a second rater scores the same outcome as a “success with difficulty”. In situations like this, two things might be happening: either the “bins” you’ve defined are not discrete enough, and/or your raters need more training. Adding in another rater is always a good way to test these assumptions too.
Having clear category criteria and consensus amongst your raters is essential before beginning an analysis that will involve an inter-rater reliability assessment. The more consensus there is between raters, the higher your inter-rater reliability is, and vice versa.
There are a number of statistical methods you can use to determine inter-rater reliability. While the application of each method is beyond the scope of the present discussion, we encourage you to look into the pros and cons of each method on your own:
- Choen’s kappa: easy to compute; common amongst “categorical” data
- Fleiss’ kappa: helpful when sampling from multiple raters
- Krippendorff’s alpha: more complicated to compute, but highly accurate
Note that inter-rater reliability is different from intra-rater reliability. In the same sense that an interstate is different from an intrastate. Unlike inter-rater reliability which focuses on the agreement or disagreement between two independent raters, intra-rater reliability refers to multiple rating attempts made by the same individual.
This latter approach is less widely used in human-centered research due to its susceptibility to biases. However, in some situations (e.g., limited number of subject matter experts), it might be the best option you have.
Test-retest reliability refers to consistency of a test over time, when completed by the same individual(s). In our example above, Joey completed the same survey once per month for a total of three months. If his scores on each of the three surveys were highly consistent, we could make the argument that his test-retest reliability is high.
However, in order to say that our survey itself has high test-retest reliability, we would need a lot more people like Joey to take our survey. Then, we would evaluate each one (i.e., within subjects) to determine if their results were as consistent as Joey’s results. Next, we would compare those results across individuals (i.e., between subjects). If there is minimal deviation across these compared scores, then we could say that our survey has a high test-retest reliability overall.
It’s important to restate here that establishing test-retest reliability cannot occur unless all all other variables remain the same over time. For example, Joey’s scores probably wouldn’t have high test-retest reliability if he were prescribed antidepressants between his first and second survey attempts; the medication (i.e., a confounding variable) might explain the change in scores.
The following conditions need to be met in order to rule out other explanations for test similarities or differences:
- The same measurement tools (i.e., Joey’s survey, computer)
- The same testing conditions (i.e., time, location)
- The same observers (i.e., for scoring that requires qualitative assessment)
- The same interval of testing (i.e., one month apart)
Generally speaking, if the Cronbach’s alpha score — a coefficient of reliability — is 0.7 or higher, a measure is said to have “good” test-retest reliability.
Parallel-forms reliability — also referred to as equivalent forms reliability — is the degree to which two sets of measures in a study evaluate the same construct. To be clear: we’re not talking about “measures” as a single survey question or item here. The emphasis is on the “set” of measures.
Some research situations may require the use of multiple versions of the same questionnaire to be asked. For example, if a researcher wants to know if their revised version of a questionnaire still produces the same results as the original questionnaire, he or she might administer both questionnaires — the new and old one — to the same individual in a single session. Theoretically, if the two versions of the questionnaire have high parallel-forms reliability, then the results should be the same for both questionnaires. But, if there is a noticeable difference in scores, this indicates that the revised questionnaire was different from the original.
Parallel-forms reliability is an essential to check on standardized tests, such as IQ tests, Graduate Record Examinations (GRE), and the Scholastic Aptitude Test (SAT). Standardized tests require constant updating and revisions to minimize learning effects and cheating. As a result, a new “batch” of questions must be incorporated into the test to replace an existing “batch” of questions. The hard part is ensuring that information being removed from the test is accounted for in a similar way amongst the new information being added in.
Internal consistency is similar to parallel-forms reliability in that it evaluates the degree to which two measures assess the same construct. Internal consistency is different, however, because it refers to individual items within a survey or questionnaire.
As an example, let’s say that a researcher is interested in measuring the cognitive load a new smartphone application puts on a user. To accomplish this goal, he/she might evaluate an assortment of measures believed to be a proxy measure of cognitive load, such as the users’ pupil diameter, heart rate, and responses to the NASA Task Load Index (TLX). Knowing from the published literature that these measures possess varying levels of construct validity (described below), the question is to determine the level of consistency that exists between these measures (i.e., internal consistency). That is, how well the measures correlate with one another.
For the record, internal consistency does not indicate how well these items actually predict a central construct (e.g., cognitive load). It only tells you whether items are related amongst themselves.
Like the concept of “reliability”, there are many types of validity. Our working definition of validity is, “the degree of probability to which the tool actually measures what it claims to measure”. It refers to the credibility or integrity of your research. Additionally, it helps you answer whether your results can be generalized to other situations or contexts.
Internal validity refers to how much the experimental manipulations cause the effects that are observed. In order words, it is the strength of the causal relationship between the independent and dependent variables. This type of validity is a large reason why significance testing is used in research. Significance testing helps us determine whether the results we get are due to one or more experimental manipulations (i.e., independent variables), or due to randomness and chance. If we have low internal validity, our significance test will usually come back low as well.
Often in social science research, significance testing — for better or for worse — is often based on two values: your study’s effect size, and that effect size’s p-value. The effect size speaks to the “strength” of the relationship between the independent variable (IV) and the dependent variables (DVs). The higher the effect size, the stronger the relationship.
The effect size’s p-value tells a different story — one that relates more to internal validity. A p-value refers to the effect size’s “probability value”. That is, how likely the effect size you found is due to an honest-to-goodness relationship between your IV and DVs, or just smoke and mirrors.
A small p-value essentially says that the relationship found in your study has a small chance of being due to other things, such as randomness, or false-positives. For example, if your p-value is 0.05, this says that your relationship has a 95% chance of being due to the relationships between your variables. Another way of looking at this example is that while your internal validity is assumed to be accurate, it you still run a 5% chance that it’s inaccurate (and you don’t know it).
External validity is an umbrella term that describes the extent to which research conclusions can be generalized to other people, situations, and time. These three components can be subdivided into their own forms of validity:
- Ecological validity: Can the results be generalized to other settings?
- Population validity: Can the results be generalized to other people?
- Historical validity: Can the results be generalized over time?
External validity — especially its ecological and population components — is important to consider when progressing your testing from a controlled lab setting to a real-world environment.
As an example, let’s say two teams of researchers are tasked with evaluating how easily users can change the radio station in a new automotive infotainment system. Both teams will use the same set of tasks, then they will compare results at the end. Team A conducts their study first, and decides to test the infotainment system on a realistic benchtop computer station in a stationary, well lit, air-conditioned lab with members from their own team. To their pleasure, they find that almost all of the participants are able to complete the study tasks without a single issue or negative comment.
Team B runs their study afterwards, anticipating that few issues — if any — will be observed. They test the same infotainment system as Team A, but do so in an actual vehicle, and with random people they recruited from around town. What’s more, their participants are asked to complete the study tasks while driving on roadways with moderate traffic. Much to their displeasure, they find that more than half of their participants experience issues during one or more tasks. And, to make matters worse, they receive several negative comments about the system’s ease of use.
Team B panics.
They ruminate on the number of ways they possibly, maybe, probably, definitely biased each and every participant. Why, oh why, did I chew gum during Participant #3! That threw him out of his rhythm. And, I knew Participant #8 was no good. Why didn’t we just dismiss her data altogether? In their rush to figure it out, they don’t stop to consider how environmental demands and participant differences were different from Team A’s study.
This is a long-winded example of a set of studies with poor external validity. In truth, Team A’s study is the culprit responsible for this mess. They set the bar high, and didn’t offer the caveat that their study sample and environment may have influenced the results. As a result, Team B expected diamonds, but wound up with dirt.
If you find yourself in this situation, a good strategy to follow from the beginning of the study to communicate two messages. First, any study conducted in a controlled environment and with “knowledgeable” participants will almost always generate attractive results, compared to results you’d see in the real-world. Second, any issues you see in a controlled environment — no matter how small they seem — will be exacerbated once you add in additional environmental demands and participant variability. Fix those design issues before going any further with testing, or else you’ll run the risk of getting the same feedback across all participants.
Construct validity is the degree to which a research tool measures the construct it’s supposed to measure. As an example, think back to our usability study example in which the researcher used pupil diameter, heart rate, and the NASA-TLX to measure cognitive load. While internal consistency is how well the three tools correlate with one another, construct validity is how well they actually measure cognitive load. If construct validity is low, there is little point in using those research instruments because they aren’t measuring what you want to measure!
The weird part about construct validity is that it’s never based on absolutes in the beginning.
For example, while our standard convention of a “linear foot” (12 inches) is printed on millions of measuring devices in (nearly) the exact same exact way, it’s origins were completely arbitrary. Indeed, the “foot” by which we define many of today’s important measurements — like, you know, the “$5 foot-long” — was originally derived from human body dimensions. The Egyptians, for instance, defined their foot — a “djeser” — as a measure of four male palms placed side by side (thumbs tucked under). Centuries later, some Romans used the “Drusianus foot”, taken from the measurement of Nero Caludius Drusus’ sweaty, sandaled soles.
Then, one day in July 1959, the International Yard and Pound Agreement established the standard for the “foot” as we know it today.
In the context of social science, the same issues apply.In early stages of developing a new psychological construct, some researcher has to eventually place a flag in the the ground and say, “henceforth, this tool is a measurement of this construct!”. *cue lightning strikes and thunder* Then, a giant game of academic king of the hill is played over subsequent decades. Researchers attempt to define areas around and related to that construct with the goal of usurping the original tools, and replacing them with (ideally) more reliable and accurate ones.
Criterion-related validity refers to how well one measure predicts the outcome of another. Just to keep you on your toes, there are two main types: concurrent validity and predictive validity. Both are very similar, and differ from one another solely on the time at which the two measures are administered. Concurrent validity focuses on two measures administered at roughly the same time; if Measure A predicts the outcome of Measure B, the concurrent validity is high. Predictive validity, on the other hand, refers to one measure predicting the results of another measure at a later time.
Concurrent validity is a measure of how much a tool correlates with existing, validated tools intended to measure the same construct. Demonstration of concurrent validity is a common method of justifying the use of a new research measure; if a proposed measure significantly correlates with an already validated one, that is promising evidence towards eventually establishing the new measure’s validity.
Meanwhile, predictive validity is the degree to which a future measurement of a variable can be predicted by a current measurement. The go-to example everyone knows is the SAT. Many colleges ask for prospective students’ SAT scores because the test’s predictive validity is high. While certainly imperfect, high SAT scores significantly correlate with high GPA in college. Tests like these are used all the time by organizations that want to recruit the best prospects using the fewest resources.
Correlation is heavily involved in establishing concurrent and predictive validity. While positive and negative correlations are meaningful in most situations, negative correlation is less relevant. Nearly always, researchers examining concurrent and predictive validity are interested in positive correlations because they are usually administering two measures of similar constructs. A negative correlation coefficient of -0.68 certainly means something, but it’s probably not typically something researchers are hoping for.
When interpreting correlation coefficients — or “R” values — the following values can be considered a rule of thumb:
- 0.00 – no relationship, very weak
- 0.30 – weak relationship
- 0.50 – moderate relationship
- 0.70 – strong relationship
When evaluating criterion-related validity of a measure, stronger relationships indicate higher validity.
Validity concerns the test’s ability to measure what it is intended to measure. Meanwhile, reliability concerns the degree of consistency in the results if we repeat the test over and over. Now that we know the difference between the two, and the various types of each construct, it is important to note that validity and reliability are independent of one another. A measure could be only valid, only reliable, both, or neither.
So there you have it! Not so difficult, right? If you find that remembering a definition or two is challenging, try using an example to help you. We find that it’s much easier to remember an example that is more concrete rather than an abstract definition.
Lastly, what other topics would you like to see on the Research Collective blog? We would love to address any of your questions or suggestions!
About the Authors
Joe O’Brian | Senior Human Factors Scientist | Research Collective
Joe O’Brian is a Senior Human Factors Scientist at Research Collective. He has co-authored articles on topics ranging from judgment and decision making to education and healthcare technologies. At Research Collective, his contributions include project planning, observational and biometric research, and advanced statistical analysis for major automotive and healthcare organizations. Joe can be found on LinkedIn here.
Anders Orn | Human Factors Scientist | Research Collective
As a Human Factors and User Experience Researcher, Anders Orn plans for and conducts observational research at Research Collective. While he is involved in many aspects of research, Anders enjoys in usability testing in the healthcare and automotive industries as they are a unique opportunity to examine human behavior. You can find Anders on LinkedIn here.