Abstract

Zusammenfassung. Auf der Grundlage des Multifacetten-Rasch-Modells (Linacre, 1989; Linacre & Wright, 2002) wird eine Systematik von Methoden präsentiert, die eine detaillierte Untersuchung der psychometrischen Qualität von Beurteilungen in verschiedenen Anwendungsbereichen (z.B. bei Leistungs- oder Eignungsbeurteilungen) erlauben. Wesentliche Ziele sind: (a) Messung der Strenge der Beurteiler, der Fähigkeit der beurteilten Personen und der Schwierigkeit von Aufgaben und Kriterien in einem einheitlichen Bezugssystem, (b) Konstruktion fairer Leistungsmaße durch Berücksichtigung der Beurteilerstrenge sowie der Aufgaben- bzw. Kriterienschwierigkeit, (c) Erfassung der Konsistenz des Bewertungsverhaltens, (d) Prüfung weiterer Beurteilereffekte (z.B. zentrale Tendenz und Halo-Effekte), (e) Analyse von Interaktionseffekten und differenziellen Facettenfunktionen. Perspektiven für die Entwicklung und Anwendung möglichst objektiver, genauer und fairer Beurteilungsverfahren werden diskutiert.

Evaluation of ratings: Psychometric quality assurance via many-facet Rasch measurement

Abstract. Building on the many-facet Rasch measurement model (Linacre, 1989; Linacre & Wright, 2002), this paper presents a general framework of statistical procedures suitable for a detailed analysis of the psychometric quality of rating data collected in various kinds of applied settings (e.g., performance assessment). Major goals are: (a) measuring severity (or leniency) of raters, ability of examinees, difficulty of tasks and items (or criteria) in a single frame of reference, (b) deriving fair measures of examinee ability by taking rater severity, task and item difficulty into account, (c) assessing the degree of rater consistency, (d) detecting other rater effects (e.g., central tendency and halo effects), (e) analyzing interaction effects and differential facet functioning. Perspectives for the development and application of rating systems being as objective, precise, and fair as possible are discussed.

Literatur

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43 , 561– 573 First citation in article Crossref, Google Scholar
Andrich, D. , Sheridan, B. E. , Luo, G. (2004). RUMM2020: Rasch unidimensional measurement models [Computer software]. Perth, Western Australia: RUMM Laboratory First citation in article Google Scholar
Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17 , 1– 42 First citation in article Crossref, Google Scholar
Bachman, L. F. , Lynch, B. K. , Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12 , 238– 257 First citation in article Crossref, Google Scholar
Baumert, J. (2001). Internationale Schulleistungsvergleiche. In D. H. Rost (Hrsg.), Handwörterbuch Pädagogische Psychologie (2. Aufl., S. 294-303). Weinheim: Psychologie Verlags Union First citation in article Google Scholar
Bond, T. G. , Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences . Mahwah, NJ: Erlbaum First citation in article Crossref, Google Scholar
Bortz, J. , Döring, N. (2002). Forschungsmethoden und Evaluation für Human- und Sozialwissenschaftler (3. Aufl.). Berlin: Springer First citation in article Crossref, Google Scholar
Brennan, R. L. (2001). Generalizability theory . New York: Springer First citation in article Crossref, Google Scholar
Campbell, S. K. , Kolobe, T. H. A. , Osten, E. T. , Lenke, M. , Girolami, G. L. (1995). Construct validity of the Test of Infant Motor Performance. Physical Therapy, 75 , 585– 596 First citation in article Crossref, Google Scholar
Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York: Harper & Row First citation in article Google Scholar
Cronbach, L. J. , Gleser, G. C. , Nanda, H. , Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles . New York: Wiley First citation in article Google Scholar
Eckes, T. (2003). Qualitätssicherung beim TestDaF: Konzepte, Methoden, Ergebnisse. Fremdsprachen und Hochschule, 69 , 43– 68 First citation in article Google Scholar
Eckes, T. (2004a). Beurteilerübereinstimmung und Beurteilerstrenge: Eine Multifacetten-Rasch-Analyse von Leistungsbeurteilungen im “Test Deutsch als Fremdsprache“ (TestDaF). Diagnostica, 50 , 65– 77 First citation in article Link, Google Scholar
Eckes, T. (2004b). Facetten des Sprachtestens: Strenge und Konsistenz in der Beurteilung sprachlicher Leistungen. In A. Wolff et al. (Hrsg.), Materialien Deutsch als Fremdsprache (S. 451-484). Regensburg: FaDaF First citation in article Google Scholar
Eckes, T. , Grotjahn, R. (in Druck) Der C-Test als Ankertest für TestDaF: Analysen auf der Basis eines probabilistischen Testmodells. In R. Grotjahn (Ed.), The C-test: Theory, empirical research, applications. Frankfurt: Lang First citation in article Google Scholar
Embretson, S. E. , Reise, S. P. (2000). Item response theory for psychologists . Mahwah, NJ: Erlbaum First citation in article Google Scholar
Engelhard, G. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5 , 171– 191 First citation in article Crossref, Google Scholar
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31 , 93– 112 First citation in article Crossref, Google Scholar
Engelhard, G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1 , 19– 33 First citation in article Google Scholar
Engelhard, G. (2002). Monitoring raters in performance assessments. In G. Tindal & T. M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (pp. 261-287). Mahwah, NJ: Erlbaum First citation in article Google Scholar
Engelhard, G. , Myford, C. M. (2003). Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition Program with a many-faceted Rasch model (College Board Research Report No. 2003-1). New York: College Entrance Examination Board First citation in article Google Scholar
Europarat (2001). Gemeinsamer europäischer Referenzrahmen für Sprachen: Lernen, lehren, beurteilen . Berlin: Langenscheidt First citation in article Google Scholar
Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests: Grundlagen und Anwendungen . Bern: Huber First citation in article Google Scholar
Fischer, G. H. (1995). The linear logistic test model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 131-155). New York: Springer-Verlag First citation in article Crossref, Google Scholar
Fischer, G. H. , Ponocny-Seliger, E. (2003). Structural Rasch modeling: Handbook of the usage of LpcM-WiN 1.0 [Software manual]. Groningen, The Netherlands: Science Plus Group First citation in article Google Scholar
Fisher, A. G. (1993). The assessment of IADL motor skills: An application of many-faceted Rasch analysis. American Journal of Occupational Therapy, 47 , 319– 329 First citation in article Crossref, Google Scholar
Fitzpatrick, A. R. , Ercikan, K. , Yen, W. M. , Ferrara, S. (1998). The consistency between raters scoring in different test years. Applied Measurement in Education, 11 , 195– 208 First citation in article Crossref, Google Scholar
Greve, W. , Wentura, D. (1997). Wissenschaftliche Beobachtung: Eine Einführung . Weinheim: Psychologie Verlags Union First citation in article Google Scholar
Hambleton, R. K. , Robin, F. , Xing, D. (2000). Item response models for the analysis of educational and psychological test data. In H. E. A. Tinsley & S. D. Brown (Eds.), Handbook of applied multivariate statistics and mathematical modeling (pp. 553-581). San Diego, CA: Academic Press First citation in article Crossref, Google Scholar
Hornke, L. F. (2004). Normen, Standards, Richtlinien auch für die Personalarbeit. In L. Hornke & U. Winterfeld (Hrsg.), Eignungsbeurteilungen auf dem Prüfstand: DIN 33430 zur Qualitätssicherung (S. 9-25). Heidelberg: Spektrum Akademischer Verlag First citation in article Google Scholar
Hoskens, M. , Wilson, M. (2001). Real-time feedback on rater drift in constructed-response items: An example from the Golden State Examination. Journal of Educational Measurement, 38 , 121– 145 First citation in article Crossref, Google Scholar
Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it?. Psychological Methods, 5 , 64– 86 First citation in article Crossref, Google Scholar
Hoyt, W. T. , Kerns, M.-D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4 , 403– 424 First citation in article Crossref, Google Scholar
Hyde, J. S. , Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104 , 53– 69 First citation in article Crossref, Google Scholar
Johnson, S. (1996). The contribution of large-scale assessment programmes to research on gender differences. Educational Research and Evaluation, 2 , 25– 49 First citation in article Crossref, Google Scholar
Keeves, J. P. , Alagumalai, S. (1999). New approaches to measurement. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 23-41). Amsterdam: Pergamon First citation in article Crossref, Google Scholar
Kersting, M. (2004). Qualitätsstandards. In K. Westhoff, L. J. Hellfritsch, L. F. Hornke, K. D. Kubinger, F. Lang, H. Moosbrugger et al. (Hrsg.), Grundwissen für die berufsbezogene Eignungsbeurteilung nach DIN 33430 (S. 22-32). Lengerich: Pabst First citation in article Google Scholar
Kubinger, K. D. (2003). Testtheorie, Probabilistische. In K. D. Kubinger & R. S. Jäger (Hrsg.), Schlüsselbegriffe der Psychologischen Diagnostik (S. 415-423). Weinheim: Beltz First citation in article Google Scholar
Linacre, J. M. (1989). Many-facet Rasch measurement . Chicago: MESA Press First citation in article Google Scholar
Linacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 85-98). Norwood, NJ: Ablex First citation in article Google Scholar
Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3 , 103– 122 First citation in article Google Scholar
Linacre, J. M. (2002a). Judging debacle in Pairs Figure Skating. Rasch Measurement Transactions, 15 , 4 839– 840 First citation in article Google Scholar
Linacre, J. M. (2002b). What do infit and outfit, mean-square and standardized mean?. Rasch Measurement Transactions, 16 , 2 878– First citation in article Google Scholar
Linacre, J. M. (2003). Size vs. significance: Standardized chi-square fit statistic. Rasch Measurement Transactions, 17 , 1 918– First citation in article Google Scholar
Linacre, J. M. (2004). A user’s guide to FACETS: Rasch-model computer programs [Software manual]. Chicago: Winste ps. com First citation in article Google Scholar
Linacre, J. M. , Wright, B. D. (2002). Construction of measures from many-facet data. Journal of Applied Measurement, 3 , 484– 509 First citation in article Google Scholar
Lumley, T. , McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12 , 54– 71 First citation in article Crossref, Google Scholar
Lunz, M. E. , Linacre, J. M. (1998). Measurement designs using multifacet Rasch modeling. In G. A. Marcoulides (Ed.), Modern methods for business research (pp. 47-77). Mahwah, NJ: Erlbaum First citation in article Google Scholar
Lunz, M. E. , Stahl, J. A. (1993). The effect of rater severity on person ability measure: A Rasch model analysis. American Journal of Occupational Therapy, 47 , 311– 317 First citation in article Crossref, Google Scholar
Lunz, M. E. , Stahl, J. A. , Wright, B. D. (1996). The invariance of judge severity calibrations. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 99-112). Norwood, NJ: Ablex First citation in article Google Scholar
Lunz, M. E. , Wright, B. D. (1997). Latent trait models for performance examinations. In J. Rost & R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 80-88). Münster: Waxmann First citation in article Google Scholar
Lunz, M. E. , Wright, B. D. , Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3 , 331– 345 First citation in article Crossref, Google Scholar
Lynch, B. K. , McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15 , 158– 180 First citation in article Crossref, Google Scholar
MacMillan, P. D. (2000). Classical, generalizability, and multifaceted Rasch detection of interrater variability in large, sparse data sets. Journal of Experimental Education, 68 , 167– 190 First citation in article Crossref, Google Scholar
Marcoulides, G. A. (1999). Generalizability theory: Picking up where the Rasch IRT model leaves off?. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every psychologist and educator should know (pp. 129-152). Mahwah, NJ: Erlbaum First citation in article Google Scholar
Marcus, B. , Schuler, H. (2001). Leistungsbeurteilung. In H. Schuler (Hrsg.), Lehrbuch der Personalpsychologie (S. 397-431). Göttingen: Hogrefe First citation in article Google Scholar
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47 , 149– 174 First citation in article Crossref, Google Scholar
McNamara, T. F. (1996). Measuring second language performance . London: Longman First citation in article Google Scholar
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: Macmillan First citation in article Google Scholar
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50 , 741– 749 First citation in article Crossref, Google Scholar
Micko, H. C. (1970). Eine Verallgemeinerung des Meßmodells von Rasch mit einer Anwendung auf die Psychophysik der Reaktionen. Psychologische Beiträge, 12 , 4– 22 First citation in article Google Scholar
Moosbrugger, H. (2002). Item-Response-Theorie (IRT). In M. Amelang & W. Zielinski, Psychologische Diagnostik und Intervention (3. Aufl., S. 68-92). Berlin: Springer-Verlag First citation in article Google Scholar
Müller, H. (1999). Probabilistische Testmodelle für diskrete und kontinuierliche Ratingskalen: Einführung in die Item-Response-Theorie für abgestufte und kontinuierliche Items . Bern: Huber First citation in article Google Scholar
Myford, C. M. , Wolfe, E. W. (2000). Strengthening the ties that bind: Improving the linking network in sparsely connected rating designs (TOEFL Technical Report, TR-15) . Princeton, NJ: Educational Testing Service First citation in article Google Scholar
Myford, C. M. , Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4 , 386– 422 First citation in article Google Scholar
Myford, C. M. , Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5 , 189– 227 First citation in article Google Scholar
O’Neill, T. R. , Lunz, M. E. (2000). A method to study rater severity across several administrations. In M. Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 135-146). Stamford, CT: Ablex First citation in article Google Scholar
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests . Chicago: University of Chicago Press. (Original erschienen 1960) First citation in article Google Scholar
Rost, J. (1999). Was ist aus dem Rasch-Modell geworden?. Psychologische Rundschau, 50 , 140– 156 First citation in article Link, Google Scholar
Rost, J. (2004). Lehrbuch Testtheorie, Testkonstruktion (2. Aufl.). Bern: Huber First citation in article Google Scholar
Rost, J. , Carstensen, C. H. (2002). Multidimensional Rasch measurement via item component models and faceted designs. Applied Psychological Measurement, 26 , 42– 56 First citation in article Crossref, Google Scholar
Rost, J. , Langeheine, R. (1997). A guide through latent structure models for categorical data. In J. Rost & R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 13-37). Münster: Waxmann First citation in article Google Scholar
Rost, J. , Spada, H. (1983). Die Quantifizierung von Lerneffekten anhand von Testdaten. Zeitschrift für Differentielle und Diagnostische Psychologie, 4 , 29– 49 First citation in article Google Scholar
Saal, F. E. , Downey, R. G. , Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88 , 413– 428 First citation in article Crossref, Google Scholar
Shavelson, R. J. , Webb, N. M. (1991). Generalizability theory: A primer . Newbury Park, CA: Sage First citation in article Google Scholar
Stahl, J. A. , Lunz, M. E. (1996). Judge performance reports: Media and message. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 113-125). Norwood, NJ: Ablex First citation in article Google Scholar
Steyer, R. , Eid, M. (2001). Messen und Testen (2. Aufl.). Berlin: Springer-Verlag First citation in article Crossref, Google Scholar
Tyndall, B. , Kenyon, D. M. (1996). Validation of a new holistic rating scale using Rasch multi-faceted analysis. In A. Cumming & R. Berwick (Eds.), Validation in language testing (pp. 39-57). Clevedon, UK: Multilingual Matters First citation in article Google Scholar
Wang, W.-C. (2000). The simultaneous factorial analysis of differential item functioning. Methods of Psychological Research Online, 5 , 57– 76 First citation in article Google Scholar
Wang, W.-C. , Chen, P.-H. , Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9 , 116– 136 First citation in article Crossref, Google Scholar
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15 , 263– 287 First citation in article Crossref, Google Scholar
Wilson, M. , Case, H. (2000). An examination of variation in rater severity over time: A study in rater drift. In M. Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 113-133). Stamford, CT: Ablex First citation in article Google Scholar
Wirtz, M. , Caspar, F. (2002). Beurteilerübereinstimmung und Beurteilerreliabilität: Methoden zur Bestimmung und Verbesserung der Zuverlässigkeit von Einschätzungen mittels Kategoriensystemen und Ratingskalen . Göttingen: Hogrefe First citation in article Google Scholar
Wright, B. D. (1988). The efficacy of unconditional maximum likelihood bias correction. Applied Psychological Measurement, 12 , 315– 318 First citation in article Crossref, Google Scholar
Wright, B. D. (1999). Rasch measurement models. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 85-97). Amsterdam: Pergamon First citation in article Crossref, Google Scholar
Wright, B. D. , Masters, G. N. (1982). Rating scale analysis . Chicago: MESA Press First citation in article Google Scholar
Wright, B. D. , Masters, G. N. (2002). Number of person or item strata. Rasch Measurement Transactions, 16 , 3 888– First citation in article Google Scholar
Wright, B. D. , Stone, M. H. (1979). Best test design . Chicago: MESA Press First citation in article Google Scholar
Wu, M. L. , Adams, R. J. , Wilson, M. R. (1997). ConQuest: Generalized item response modeling [Computer software]. Melbourne, Australia: Australian Council for Educational Research First citation in article Google Scholar

Volume 213Issue 2April 2005

ISSN: 0044-3409eISSN:

Licenses & Copyright

Keywords

Acknowledgments:

Ich danke meiner Kollegin Hella Klemmert (TestDaF-Institut) und zwei anonymen Gutachtern für hilfreiche Kommentare zu einer früheren Fassung dieser Arbeit.

PDF download

Verify Phone

Congrats!

Evaluation von Beurteilungen:

Abstract

Literatur

Licenses & Copyright

Acknowledgments:

Support & Contact

Support & Contact

Legal information

Legal information

More offers

More offers

Our partners

Our partners

Change Password

Your password must have 8 characters or more and contain 3 of the following:

Password Changed Successfully

Create a new account

Request Username

Verify Phone

Congrats!

Evaluation von Beurteilungen:

Abstract

Literatur

Licenses & Copyright

Acknowledgments:

Support & Contact

Support & Contact

Legal information

Legal information

More offers

More offers

Our partners

Our partners