Skip to main content
Free AccessEditorial

A Farewell, a Welcome, and an Unusual Exchange

Published Online:https://doi.org/10.1027/1015-5759/a000203

This will be no ordinary Editorial for two reasons. First of all, I want to announce some changes within the Editorial Board. Doing this I want to use the chance to thank some people for supporting the Editorial Board of this journal over the course of the last few years. Working as an action editor never is easy and the more I appreciate the commitment brought forward by Johnny Fontaine, Symeon Vlachopoulos, and Richard Griffith. Without you we could not have made the journal the way it is now. Thanks! Moreover, I want to welcome Iris Engelhard, Lena Lämmle, Martin Bäckström, Stephan Dilchert, and Samuel Greiff as new action editors. Having you on board makes me feel very optimistic.

My job as an Editor-in-Chief includes a lot of e-mail contact with a variety of persons, e.g., action and consulting editors, authors, and reviewers. Below I want to share some passages from an interesting, challenging, and – I believe – inspiring e-mail exchange I (MZ) had with Stéphane Vautier (SV), one of our consulting editors. Including this exchange into an editorial is the second reason why this Editorial is a bit different. However, my hope is that our exchange will affect the readers as much as it did me and thereby also affect the submissions made to the journal.

The starting point of the exchange was the meaning of test scores derived from tests and questionnaires regularly published in this and other journals:

SV

Let us take the following propositions.

  1. 1.
    The test Z was developed to measure the trait zeta.
  2. 2.
    Person X scored y points at this test.
  3. 3.
    The score y comes near to the norm’s average of the test.
  4. 4.
    Hence, the trait zeta of person X is classified as average compared to the norm group.

[Comment: here, my goal is to take as starting point the typical reasoning used by a psychologist who wants to assess an individual’s trait through his/her test score, as you suggested.]

Is the conclusion (4) valid with respect to the premise (1–3)? No, because to be valid it should follow deductively. It does not follow deductively because a necessary condition for a deductive outcome is that it makes sense (false conclusions used in scientific reasoning to refute hypotheses also need to make sense to be either false or true). Why do I state that the conclusion does not make sense? Because I cannot interpret it for the following reason: The norm’s average of the test is the mean score of a sample of scores drawn from a theoretical statistical population of scores (that is, the set of scores that the human population in focus would have if they could be tested at least one time), but the trait zeta of the person X cannot be equated with nor usefully related to his score y on a logical basis. We just know that any permissible score refers to a response pattern that is compatible with the scoring rule of the test. And we ignore how the response pattern can be interpreted as the result of a measurement process. Specifically, item response modeling just states that any value of the quantity that is postulated in the psychometric model that has been used to validate the test (or the test’s data) can be associated with any response pattern because of unknown factors that determine the responses. Classical test theory neglects the empirical meaning of the test scores. And, ironically, the norm’s average of the test cannot refer to any response pattern.

My conclusion is that the argument (1–4) is only a rhetorical device, the function of which is to allow assessment psychologists to believe that their tests enable them to measure psychological constructs in persons, provided that the meaning of “measure” is not scrutinized too deeply. They need this rhetoric as far as they pretend to be able to draw some prescriptive or evaluative conclusions (like, for example, “the trait of the person X is not too low nor too high,” “this person should be preferred to this one”), which matches the social demand, on a scientific basis. Hence, as long as they need the label “scientifically validated,” they are unable to admit scientific reasoning that would restrict this pretention. And psychometricians need also this rhetoric because they need that psychologists need them.”

MZ

Stephan, you are making quite a few very strong statements. I have to admit, I’m a bit torn. I agree with you that there is a fundamental problem regarding measurement. As Michell has pointed out in many publications, we as psychologists probably do not measure in a strict sense. Steven’s definition of measurement is well accepted and probably it should be evaluated more critically. However, if we assume that there are latent traits or constructs that have a causal influence on our behavior, they should be measurable. So let’s say there is something like intelligence. We would now not be able to see intelligence but rather intelligent behavior. After all, we often describe people in terms of differences in their “intelligent behavior.” An intelligence test could now be viewed as a measure evoking intelligent behavior that is scored and compared to a norm. In this sense, the test score would be a manifestation of the latent trait. Of course, the whole idea of latent traits can and has been criticized. But my understanding of your first statement is that you believe that summing the items is inappropriate because the items have different information regarding the measured trait. Thus, two scores based on different response patterns would not be comparable. You also doubt the ability of IRT models to test the specific objectivity, i.e., amongst other the equality of items. Thus, there is just no way that a score can be compared to other scores. Again, I agree with many of these issues. However, I would say that constructing a test based on a sound theory in combination with a strong validation effort can help with some of these problems. If there is a strong theory, the items should clearly reflect the theory and thus, it could be explained how the answers are formed. In this regard, most test publications have much potential for improvement. The theory should further explain the nomological net to allow convergent and discriminant validity testing based on concrete hypotheses. Moreover, the mechanisms underlying test criterion correlations should be explained and testes. There is more that could be said with regard to validity but I want to conclude this argument and just state that validity is more than factorial validity and the whole validation process should be based on a theory and thus hypothesis-driven. If this was the case, the test score interpretation would at least be backed by some empirical findings that hold true for the norm that we use for comparison. This leaves the issue of item inequality. How about leaving the trivial scoring function behind and using a factor score or something similar instead? Here we would weigh each item according to its contribution to the total score.

SV

About Michell and Stevens

Michell doubts that psychologists are able to make quantitative measurements of theoretical quantitative attributes, and he presents Stevens’ (1946) scales as a misleading version of measurement.

1“Most quantitative psychologists think that measurement is simply the assignment of numerals to objects and events according to rule. [Stevens’] understanding of the concept of measurement is clearly misleading because it ignores the fact that only quantitative attributes are measurable.” (Michell, 1999, p. xii)

I would like to elaborate on two points. Firstly, it is possible that what psychologists call constructs does not match what Michell calls quantitative attributes. If a construct is not a quantity, telling that it can nevertheless be measured is misleading. For example, it is semantically absurd to say that we measure a psychological process: a process is not a quantity. If the construct refers to a theoretical quantity, one has to suppose a theoretical origin, that is, zero quantity is a theoretical state of the object that possesses the quantity, and one has to suppose a maximal quantity too. Thus, the construct refers to the segment [0, max], where “max” is unknown, and its label, “Intelligence” or “Dispositional optimism” for example, serves to identify the kind of the quantity. It is worth noting that psychologists who want to measure a construct do not refer to such a segment. Secondly, I see no methodological trouble in speaking of the ordinal measurement of a theoretical quantity, which corresponds to the claim that there is a step function that links any point of the segment, i.e., the quantity, to a simply ordered observable of the ordinal scale.

2The word “simply” means that two distinct observables can be compared using “more than” or “less than”.

Of course, this function is a theoretical hypothesis, and the next issue is whether it can be tested. I would like to remember Michell’s (2012) provocative argument, which is that psychometricians “… studiously turned away from investigating whether the attributes they aspired to measure really are quantitative …” (p. 7). But I see no methodological trouble in using numbers to code the observables of an ordinal scale, provided that these numbers are not entered in algebraic computations that would be meaningless. For example, your “intelligent behavior” can be described by the series “correct, correct, uncorrect,” which can be coded “1, 1, 0.” But the sum 1+1+0 describes no observation and the resulting test scores do not constitute an ordinal scale because the scores describe no behavior at all. The fact that the vectors (or m-tuples) of item responses that can be imagined for a given test of m items are partially ordered by the product order has been overlooked. For the sake of clarity, let me detail this important point. There is no problem with the idea that the logically possible responses to an item are simply ordered. But psychologists deal with vectors of item responses. Thus, how to compare 1, 1, 0 and 0, 0, 1 for example? We have to define a binary relation on the set of the possible 3-tuples, which cannot be derived from the orderings that we use at the level of single items. The product ordering of the vectors defines that two distinct vectors are comparable if the comparisons that can be made item by item are homogeneous. Hence, 1, 1, 0 and 0, 0, 1 are incomparable because sometimes we find “more than” and sometimes we find “less than” (heterogeneity of the item-by-item comparisons). Thus, the set of the logically possible vectors cannot serve as an ordinal scale for the ordinal measurement of a theoretical quantity because ordinal measurement requires a simple ordering as opposed to a partial ordering of the scale’s values. As the comparison of vectors is a matter of definition, the issue of finding and choosing a definition that would simply order the set of the possible vectors is open.

About Causality and Measurability

You ask for “strong theory” and I agree. A step function is useful to conceptualize that a quantitative, theoretical variation causes a discrete, observable variation. Here again, two points seem worth noting to me. Firstly, we have to think about what is happening at the level of any item, and about validity of the relevant inferences. Secondly, we have to realize that at the level of the response vectors, the theory is quite testable. And we have to be ready to admit that our theory is wrong. At the item level, the step function is tautological. For the sake of simplicity, let consider a dichotomous item, that is, the possible responses are in the set {0, 1}. Let think about the consequences of variation of the quantity in [0, max]. The step function puts down that when the quantity varies before a threshold A, the value of which is unknown, the observed response is 0, and that when it varies after the threshold A, the observed response is 1. Thus, we may infer that the observed change from 0 to 1 means that the quantity improved, and that the observed change from 1 to 0 means that the quantity diminished. Such inferences are valid if we suppose that the observed variations depend only on the quantitative variation to be measured (ordinally). As soon as we admit that the observed responses also depend on other factors, the inferences from the observed responses to the theoretical quantitative variation are no more valid. This is why experimental control is decisive in the development of measurement devices (see Chang, 2004; Sherry, 2011; Trendler, 2009; 2013). This is a strong epistemological point: any probabilistic measurement model that does not restrict the observable effect of the random component acknowledges that the inference from the observed response to the quantity is logically invalid (see Vautier, Lacot, & Veldhuis, in press; Vautier, Veldhuis, Lacot, & Matton, 2012). From this perspective, current IRT models ratify the failure to measure the target quantity, that is, the failure to specify the experimental conditions that allow one to make valid inferences from the data to the quantity.

Secondly, when we consider not the responses to a single item but the m-tuples from several items that are supposed to measure the same amount of quantity, one has to solve a new theoretical problem, namely, to derive the possible step functions. To grasp the issue and for the sake of simplicity, it suffices to consider the case of two dichotomous items. We start with two step functions, the first one for the first item, with threshold A1, and the second one for the second item, with threshold A2. Consequently, there are three ways of ordering A1 and A2. If A1 = A2, the response vectors (0, 1) and (1, 0) are precluded, providing that one postulates that the quantity does not vary during the short time that is needed to get the responses to the two items. If A1 < A2, the 2-tuple (0, 1) is precluded. Finally, if A2 < A1, (1, 0) is precluded. These measurement hypotheses are testable because they imply some restriction with respect to the logically possible response vectors. It is likely that such falsifiers will be observed in practice, in which case these observations cannot be used for valid inference about the course of the quantity on the basis of any step function. Moreover, the overall reasoning has to be replicated for any person we are interested in. Importantly, we cannot assume blindly that if a measurement function holds for a given person, it is correct for another person, as opposed to what is done in IRT. One may remember that a similar situation was reached in psychology when Guttman (1944) reflected on the use of qualitative data (see also Johnson, 1935; 1943). Overall, this is because scientific measurement entails causal and testable thinking that we have to conclude that psychological constructs are not ordinally measurable by m-tuples from psychological tests. Hence, they are not quantitatively measurable because ordinal measurability is a necessary condition to quantitative measurability.

About the Test Validation Doctrine

You say that “The theory should further explain the nomological net to allow convergent and discriminant validity testing based on concrete hypotheses.” Although I understand your effort to acknowledge our measurement problems while saving a sense of historical continuity in methodology, in my opinion the test validation doctrine is a dying paradigm. This doctrine suffers from serious problems, as Borsboom, Cramer, Kievit, Scholten, and Franic (2009) and Michell (2009; 2013), among others, forcefully argued. From my point of view, the worst problem is that in its attempt to save the idea that test scores measure constructs, the doctrine abandoned the scientific principle of valid reasoning, and replaced it by a kind of art of weighting uncertainty in the absence of any (causal) law: likelihood maximization is not measurement. Suppose that we, as psychologists, recognize that our test observations cannot be viewed as (ordinal) measurements of theoretical quantities, that is, as dependent variables of a unique cause of quantitative variation. We are not committed to reject the observations if they are useful. However, we are committed to elaborate on their usefulness. Maybe the future of psychological testing lies in our ability to show how specific social issues can be addressed helpfully by test users. This is quite different from conducting validation studies as an end.

I would be tempted to distinguish two kinds of goals. The assessment goal, which consists in projecting a person on an appreciative scale through his/her response. Thus, the problem is that we have to build a convention for identifying people in an evaluative space – see Coombs’ (1964, chapter 13) requisite for compression. But, like monetary values assigned to things, the resulting values are not natural properties. We have to accept that our evaluative judgments on persons are not based on brute ontology (Searle, 1995) – scientific measurement refers to brute ontology. I see psychometrics as an attempt to naturalize the values we want to assign to persons. Because the art of assessment is the art of a social construction, we have to be ready to investigate the philosophical and political problems raised by the practice of assessment (e.g., Cromby & Willis, 2013; Vautier et al., 2012).

The second kind of goal is prediction at the macro level (at the level of the aggregate, see, e.g., Krause, 2010; Lamiell, 1998). Efforts dedicated to construct validation, that is, psychometric modeling, are irrelevant in terms of prediction. What is more relevant for prediction consists in using the test observations as the independent variable, and to delineate the dependent variables they can predict, or, given the dependent variable, to identify the test observations that convey “predictive” information. In this perspective, the statistical task consists in conditioning a dependent variable on points, in such a way that the residual is minimized with respect to the approaches investigated in previous research. Importantly, there is no conceptual imperative to condition on scores, as response vectors can also be used as conditioning points. The statistical price to be paid is that the references classes related to the conditioning points have a reasonable size, which may require the building of shared databases within the research community (“big data”). In short, predictive validity has nothing to do with construct validity, and the tradition of psychological testing (or assessment) should be allowed to be appreciated for its predictive merits.

A Source of Reluctance

Even if psychological tests were “validated” for prediction instead of construct validity, the skillful work of data analysts would nevertheless stress the fact that human behavior seems partially unpredictable, if not totally (Vautier, 2011). But if unpredictability of human behavior is the scientific fact, this is the psychologists’ task to vulgarize it. And the empirical issue is whether unpredictability is total or restricted (Vautier et al., 2013). As long as the community of test users is committed to communicate by using a rhetoric that suggests their efficiency on human beings to justify its social legitimacy, the current validation doctrine will be useful, not for scientific reasons, but because it plays the role of a paradigm in Kuhn’s sense, that is, a community of believers. A more interesting scientific goal consists in investigating the reasons of human behavior unpredictability that test users experience day after day in their assessment practice. The reflection above suggests that the European Journal of Psychological Assessment could be renamed the European Journal of Psychological Assessment and Prediction. This would be an outlet for those who want to investigate the assessment issues as the art of putting people in evaluative spaces for helping specific decision making, and for those who want to investigate the human fact as a challenge for nomothetic prediction, i.e., prediction that is valid for any people.

MZ

Stephan, I can see that you have put a lot of thought into this. Your standpoint seems well anchored. I do not want to argue and discuss all the issues here. The space would not be enough and I would agree with you on many issues anyway. So, let me start with your statement that IRT fails because it does not define the experimental conditions under which the model holds and thus, the conditions under which the items are all equal and a score can be interpreted. I have heard this quite often and think that it is important to stress that if an IRT model holds, it does not mean that a score can be interpreted for each and every person. The model test only holds true for the sample the data were collected in. An example can be found within faking literature. Some studies show that the same instrument used within the same sample differs in person homogeneity simply due to the situational demand (Zickar, Gibby, & Robie, 2004; Ziegler, 2011; Ziegler & Kemper, 2013). Thus, you are right by saying that the conditions have to be defined under which a certain model holds. However, this should be standard procedure within IRT. Thus, if a person can be considered a part of the sample a specific IRT model was tested in, the model should also hold for this person. Within some boundaries;) I would tentatively argue that the same can be said for factor scores. Having said this, I must also express my concern and my agreement with another one of your statements. The samples are often convenience samples and the test is often recommended for a much wider variety of persons than the statistical model was tested in. This in fact is inappropriate. I agree. However, if test constructors base their constructions on sound theory, define the population(s) and conditions the test is to be used in, draw their samples from these populations, and test them under the defined conditions, I would not share your concern. I would be concerned again if test users did not appreciate these defined borders of test application. Thus, the issue might not only be with psychometricians but maybe also with test users … This argument requires stronger emphasis on how to feedback test results. For example, a reference to the specifics of the norm group as well as the interpretation of a confidence interval taking measurement error into account should be mandatory.

My second answer is directed to your validation argument. I agree that exploring the nomological net on the lookout for construct validity based evidence alone does not suffice. Validation is a process. Of course, test-criterion correlations are just as important if not even more important. We need the nomological net to know what we measure and we need test-criterion correlations to know that it predicts something. In both cases, the validation should be based on theoretical arguments and hypotheses. However, I do see papers that deal exactly with such criterion-oriented studies. I agree that there should be more, maybe especially within our journal. Such studies should focus on a wide range of different ways to estimate a score’s predictive power. Sensitivity and specificity, clinically significant change, mediation and moderation models are just some examples.

I like your idea about changing the name of the journal. However, I would rather keep the name and see some of the ideas you and me shared manifest within the papers published. In that sense, I totally agree with you that much more can and maybe must be done to show a test is working. The current practice of testing factorial validity and then adding some correlations surely has its merits but also has much room for improvement.

SV and MZ

Maybe we could join our voices to conclude this dialogue in making some recommendations to readers who would like to submit manuscripts to the journal. Please specify the main statements a user can be interested in if s/he would employ the test you worked on. And do not hesitate to critically assess these statements in the light of logic and of your data, as it will help delineate our state of scientific knowledge and ignorance regarding specific assessment or prediction settings.

References

Matthias Ziegler, Institut für Psychologie, Humboldt University Berlin, Rudower Chaussee 18, 12489 Berlin, Germany, +49 30 2093-9447, +49 30 2093-9361,
Stéphane Vautier, Octogone, University of Toulouse-Le Mirail, 5 allées Antonio Machado, 31058 Toulouse cedex 9, France,