Skip to main content
Free AccessEditorial

A Test Is Much More Than Just the Test Itself

Some Thoughts on Adaptation and Equivalence

Published Online:https://doi.org/10.1027/1015-5759/a000428

The very substance of psychological assessment is based on carefully developed tests, but the field is not exclusively a tributary to the test instrument itself. In fact, tests, and indirectly through such tests, the field of psychological assessment is to a large extent influenced by the characteristics of the testing process: How the test is administered, scored, reported, secured, or disposed of are important circumstantial variables. Stated differently, assessment is more than just the test, and research on assessment supersedes attention to just the instrument. Many of the caveats of test usage and maybe even of failed attempts to show validity are due to process-related rather than to test-related issues. This holds in particular for test adaptations, a highly relevant topic given that many instruments are used across languages and cultures. Of note, a substantial number of papers published in the European Journal of Psychological Assessment involve adapted versions of psychological test instruments and their validity.

Test adaptation, which often also goes under the names “test localization” or “test indigenization,” is a scientific and professional activity that refers to the development of a derived test version, the adapted test, which is obtained by transferring the original test from its source language or culture to a target language or culture. The adaptation process usually includes a translation of the instrument, but is much more than mere translation: It involves a thorough scientific process and is guided by the principles of the scientific method, most prominent of all being the need to offer proof of the psychometric appropriateness of the adapted test in the new language and culture and of its psychometric similarity (“equivalence”) to the original test.

The current editorial is written in the light of the often insufficiently covered topics of adaptation and equivalence. It champions the need to be more inclusive in our reports of research on psychological assessment instruments, with aspects related to the adaptation process (both the development/adaptation and the actual testing process) and not only with regard to the actual outcome of the adaptation (the final adapted test). However, editorials can serve only as an initial spark and as a way to raise awareness of an important topic. Comprehensive information on test adaptation and equivalence is provided by the work of the International Test Commission, such as its various guidelines (e.g., International Test Commission [ITC], 2006, 2012, 2014, 2015, 2017). More specifically, some process-related issues for test adaptation have been covered by such important documents as the International Guidelines on Computer-Based and Internet Delivered Testing (ITC, 2006, 2012, 2014, 2015, 2017), the International Guidelines on Quality Control in Scoring, Test Analysis, and Reporting of Test Scores (ITC, 2012), the International Guidelines on the Security of Tests, Examinations, and Other Assessments (ITC, 2014), the International Guidelines for Practitioner Use of Test Revisions, Obsolete Tests, and Test Disposal (ITC, 2015), and others.

Although they may be strictly correct in their approach, researchers who use or develop test adaptations seem to have absorbed the technical and statistical skills more easily than the underlying (and yet equally important) philosophy of test development and adaptation, which emphasizes the process and assigns relatively greater importance to the a priori steps (before data collection) than to the a posteriori steps (after data collection) of test development and adaptation. This, in turn, leads to a number of gaps in the general manner in which research papers are approached:

  • Low variability in the statistical approach used in the various papers (mostly structural equation modeling (SEM) instead of a broad array, including multidimensional scaling, cluster analysis, item response theory (IRT), and so forth).
  • Lack of integration of multiple sources of data (e.g., data from test-takers and experts).
  • A lack of detail when reporting on the actual development or adaptation process (the “test craft,” based on important a priori analyses; see below), and a tendency to lean more toward the statistical (confirmatory) aspects.
  • Lack of sophistication in the adaptation design (e.g., using mostly target-monolingual designs based on back translation, and avoiding mono- or multisample bilingual designs).

We argue that a stronger and more inclusive emphasis on the a priori steps in test adaptation and test validation is needed. Along this line of thinking, the current editorial is a continuation of a previous editorial (Ziegler & Bensch, 2013) that outlined how mere translations, as they offer a very limited approach to test adaptations, are not interesting to the European Journal of Psychological Assessment. To extend this view, in this editorial we would like to encourage (a) a holistic understanding of equivalence in test adaptations as well as (b) a more comprehensive view of the methodologies employed to establish equivalence.

(a) A Holistic Understanding of Equivalence in Test Adaptations

“Equivalence,” often referred to (with a more statistical undertone) as “invariance,” refers to the comparability of scores that are obtained from the administration of different forms (original vs. adapted) of a test and is considered a specific source of validity. The fact that one form of a test is equivalent to another has two important implications. First, test scores derived from the two forms can be directly compared (at the level of equivalence they reflect). Second, any evidence generated by one form is also valid for the other form in the sense that validity evidence is transferable. The terms “equivalence” and “bias” are closely connected: Bias is associated with error and is an expression of nonequivalence. If the original and adapted forms of a test are not equivalent, not only can responses collected with the two forms of the test not be directly compared, but conclusions based on evidence from the source form cannot be advanced for scores generated with the target form.

Bias and equivalence related to adaptations of psychological tests come in at least three major forms (van de Vijver & Poortinga, 2005): construct bias (i.e., incomplete overlap of the measured constructs in the original and adapted versions of the test), method bias (i.e., nuisance factors arising from aspects of the method, e.g., sample, instrument, or administration conditions), and item bias (i.e., anomalies in items such as those stemming from incorrect or poor item translation, differential item familiarity, cultural appropriateness, and so forth).

Most of these forms of bias, albeit all important when establishing equivalence, are not sufficiently considered by much of the published research with adapted tests, which tends to focus on only some narrow aspects of construct and item bias/equivalence. For example, in terms of construct (non)equivalence, construct contamination (i.e., the incomplete overlap of construct-relevant indicators across the source and target culture), construct deficiency (i.e., the incomplete coverage of the construct in the target culture), or the differential appropriateness of construct-relevant indicators across the source and target culture are rarely if ever discussed. In terms of method (non)equivalence, sample bias, which may emerge from a lack of comparability or even minor differences in sample characteristics between the source and target culture samples, is hardly discussed at all. Second, instrument bias, which may emerge from phenomena such as the differential familiarity of test takers from the two cultures with the stimulus material or with the response procedures, is not discussed much either. And finally, administration bias, which may emerge from technological, physical, or social administration conditions or the differential expertise of test administrators, is also rarely if ever discussed. In terms of item (non)equivalence, incorrect or poor item translations, inadequate item formulation, or cultural variations in item familiarity or item appropriateness are usually not addressed either.

All of these different forms of equivalence could, generally speaking, be established on the basis of a posteriori analyses. However, tests are rarely – extremely rarely – perfectly equivalent. They might be equivalent at a certain level (for a review, see Schmitt & Kuljanin, 2008), which implies that they are also nonequivalent at a certain level. The sources of nonequivalence should be carefully documented. Of note, documentation of the actual reason behind nonequivalence cannot be performed with a posteriori analyses but requires analytical depth, qualitative reasoning, and supplementary data. For instance, take a case in which the researcher establishes partial equivalence and advances the hypothesis that nonequivalence stems from differential item familiarity. Whereas IRT (or SEM) can isolate the biased items, confirming the source of bias (differential familiarity in the two cultures) is impossible without a dedicated design, more data, and professional reasoning.

As a first approach to a more inclusive view on adaptation and equivalence, we encourage authors to consider the following suggestions and guidelines.

  • Including different forms of bias in the analyses (not only construct, or item bias): method bias, which is considered by some authors to be the most insidious and pervasive of all the three types of bias (e.g., van de Vijver & Leung, 2011), as well as the interplay between item and construct bias.
  • Collecting supplementary variables about the context in which the test was administered; this may help in identifying specific forms of bias. For instance, instrument bias may pass undetected if data are not collected on variables such as the familiarity of test takers with the item format, cultural response sets, or the (culturally moderated) social desirability of items. Administration bias may pass undetected if data are not collected on variables such as the physical, social, or technological administration conditions, the differential expertise of test administrators, or other administrator/interviewer characteristics. Obviously, considering these aspects places high demands on the research design, but it also increases sophistication and aids the discovery of new areas of research;
  • Combining qualitative with quantitative approaches: Some forms of bias (e.g., construct deficiency; i.e., the incomplete coverage of the construct in the target culture) are virtually undetectable through exclusively quantitative research;
  • Iterating across several cycles of test adaptation: If bias is detected, simply shrugging the flagged item(s) off (e.g., by deletion or minimization of impact) is the easy way out. More appropriate explanations may be gained by going back and specifically collecting additional data to unravel the mechanism through which the bias was generated.

The ways in which we construe the relation of an “original test” with its various “test adaptations” have evolved over time along with our understanding of a test’s validity. Logically and chronologically, we could probably outline three such evolutionary phases. The first and most simplistic view is that the test is the label (i.e., the name of the test), no matter whether we are looking at the original form or at an adaptation. In this view, no evidence of equivalence is needed – the fact that the adaptation carries the same name as the original is considered proof supreme that it is the same test. Obviously, this view is outdated today.

The second understanding emphasizes the difference between the original form and the various adaptations. Test adaptations are derived versions, which are legally and empirically different from the original though inspired by it. Items may change, scales may disappear; cultural indigenization may change the adapted form of the original test in such a way that, although the original may still be discernible, the two forms are not identical but share a complicated relationship.

The third and most recent understanding emphasizes the fact that we cannot have real indigenization without strong evidence of equivalence (see also Ziegler & Bensch, 2013). When such evidence exists, the two forms are virtually identical for a given purpose (e.g., measurement equivalence), and in a sense, this makes the adapted form of the test part of the original: Equivalence of the adapted form contributes to the validity of the entire test and all its forms (original and other forms). This is, of course, in many respects an oversimplification, and the depth of the full process of cultural adaptation (see the nine levels at which test indigenization is conducted as outlined by Church, 2001), the method used for translation and indigenization of item content (e.g., back translation vs. decentering), and other variables may play a role.

In this third understanding of equivalence, a test needs to be considered an “extended family,” with the original form and derived forms contributing to the overall validity of the “family” and ideally together contributing to a comprehensive picture. This would, for instance, imply comparing several different forms with each other rather than comparing just one form with another. A good example that adopted such an approach is work by Byrne and van de Vijver (2014), who compared the structure of the Family Values Scale across 27 countries with a multilevel equivalence framework. However, few if any papers have comprehensively discussed the issue of the equivalence of adapted forms, which is actually one that is very relevant for the field of psychological assessment and, thus, also for the European Journal of Psychological Assessment. From an editorial perspective, we wonder how much of the current replication crisis in the social sciences might be due to a lack of measurement equivalence between the test forms. After all, equivalence is a form of validity (Iliescu, 2017), and a lack of validity, including a lack of equivalence, has many sources and implications.

(b) On the Preferred (and the Forgotten) Methods for Establishing Equivalence

Many papers that deal with adapted tests report models with a good fit (please note that an editorial on model fit is planned to appear later in 2017), for instance, when it comes to the internal structure of the adapted instrument. Often, the reader might have the impression that these almost perfect versions just fell into the researcher’s lap without much effort at all: all that was needed was a quick translation, and then the data were collected, and the results of the analyses were reported. However, anybody who has ever been involved in test adaptation knows how difficult and sometimes even cumbersome it is to obtain an appropriate translation, how much tweaking, pulling, and pushing of items it requires, and how many cycles of going back and forth – often with regard to minor and unexpected cultural issues – between the original and the new version are required.

This effort is rarely (if ever) reported, and papers often focus exclusively on the quantitative (and a posteriori) aspects of adapting the test, for instance, by establishing measurement invariance in SEM. However, robust reports on the cultural comparability of various forms of the same test should be inspired by both a priori (usually judgment-driven) and a posteriori (usually data-driven) methods (van de Vijver, 2011). A priori procedures are applied to prevent the appearance of bias. Examples of a priori procedures are judgment-based approaches regarding the translation and cultural adaptation of the various components of the test, such as items, rating scales, instructions, as well as structured or unstructured approaches to the work of the adaptation committee, including qualitative (e.g., think-aloud studies) and quantitative methods (e.g., ratings of item appropriateness by members of the translation panel). Interactions with actual test takers, as these are not aimed at collecting data, also qualify as a priori procedures, for instance, cognitive interviews or ratings of the cultural appropriateness or social desirability of items (Iliescu, 2017).

A posteriori procedures are those that are used after the data have been collected and are much more commonly seen in published manuscripts. These procedures may detect the existence of bias, and they may sometimes statistically control for the effects of nonequivalence, but they can never actively generate the adapted form of the test by themselves. That is, a posteriori procedures are reactive; they may tell the research that there is a problem but not what the problem is or how to solve it. Examples of a posteriori procedures include invariance analyses in SEM or, more generally, the analysis of psychometric characteristics.

Here, we highlight the fact that test adaptation should mainly result from carefully conducted a priori procedures (possibly after repeated iterations) and should only marginally rely on a posteriori procedures. Given this situation, it is worth noting that we continue to see mostly reports of a posteriori analyses across various papers, without much insight into the a priori approaches. The a priori efforts are the actual “craft” (the “test craft” if you will) of test development or adaptation, and they are invaluable learning points for future research and practice. Statistical reports cannot inspire future projects as much as prospective insights into the craft can.

Conclusion

In a nutshell, this editorial asks for a more holistic approach to the conducting of test adaptations and the methods employed to do so. Obviously, the request for more information conflicts with the natural limitations of journal space (which is rather strict in the European Journal of Psychological Assessment). An excellent opportunity to be more inclusive is to make use of electronic supplementary material (ESM) to inform readers about both the a priori and the a posteriori measures that were used to establish equivalence. This will allow researchers to achieve two ends at the same time: informing readers concisely and precisely of what was done in the main article and providing additional information in the ESM to readers who would like to dive into the adaptation process more deeply.

References

Samuel Greiff, Cognitive Science and Assessment, University of Luxembourg, 11, Porte des Sciences, 4366 Esch-sur-Alzette, Luxembourg,
Dragos Iliescu, Department of Psychology, University of Bucharest, Sector 6, Sos Panduri 90, 050663 Bucharest, Romania,