Evaluation von Beurteilungen:
Abstract
Zusammenfassung. Auf der Grundlage des Multifacetten-Rasch-Modells (Linacre, 1989; Linacre & Wright, 2002) wird eine Systematik von Methoden präsentiert, die eine detaillierte Untersuchung der psychometrischen Qualität von Beurteilungen in verschiedenen Anwendungsbereichen (z.B. bei Leistungs- oder Eignungsbeurteilungen) erlauben. Wesentliche Ziele sind: (a) Messung der Strenge der Beurteiler, der Fähigkeit der beurteilten Personen und der Schwierigkeit von Aufgaben und Kriterien in einem einheitlichen Bezugssystem, (b) Konstruktion fairer Leistungsmaße durch Berücksichtigung der Beurteilerstrenge sowie der Aufgaben- bzw. Kriterienschwierigkeit, (c) Erfassung der Konsistenz des Bewertungsverhaltens, (d) Prüfung weiterer Beurteilereffekte (z.B. zentrale Tendenz und Halo-Effekte), (e) Analyse von Interaktionseffekten und differenziellen Facettenfunktionen. Perspektiven für die Entwicklung und Anwendung möglichst objektiver, genauer und fairer Beurteilungsverfahren werden diskutiert.
Abstract. Building on the many-facet Rasch measurement model (Linacre, 1989; Linacre & Wright, 2002), this paper presents a general framework of statistical procedures suitable for a detailed analysis of the psychometric quality of rating data collected in various kinds of applied settings (e.g., performance assessment). Major goals are: (a) measuring severity (or leniency) of raters, ability of examinees, difficulty of tasks and items (or criteria) in a single frame of reference, (b) deriving fair measures of examinee ability by taking rater severity, task and item difficulty into account, (c) assessing the degree of rater consistency, (d) detecting other rater effects (e.g., central tendency and halo effects), (e) analyzing interaction effects and differential facet functioning. Perspectives for the development and application of rating systems being as objective, precise, and fair as possible are discussed.
Literatur
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43 , 561– 573Andrich, D. , Sheridan, B. E. , Luo, G. (2004). RUMM2020: Rasch unidimensional measurement models [Computer software]. Perth, Western Australia: RUMM LaboratoryBachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17 , 1– 42Bachman, L. F. , Lynch, B. K. , Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12 , 238– 257Baumert, J. (2001). Internationale Schulleistungsvergleiche. In D. H. Rost (Hrsg.), Handwörterbuch Pädagogische Psychologie (2. Aufl., S. 294-303). Weinheim: Psychologie Verlags UnionBond, T. G. , Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences . Mahwah, NJ: ErlbaumBortz, J. , Döring, N. (2002). Forschungsmethoden und Evaluation für Human- und Sozialwissenschaftler (3. Aufl.). Berlin: SpringerBrennan, R. L. (2001). Generalizability theory . New York: SpringerCampbell, S. K. , Kolobe, T. H. A. , Osten, E. T. , Lenke, M. , Girolami, G. L. (1995). Construct validity of the Test of Infant Motor Performance. Physical Therapy, 75 , 585– 596Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York: Harper & RowCronbach, L. J. , Gleser, G. C. , Nanda, H. , Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles . New York: WileyEckes, T. (2003). Qualitätssicherung beim TestDaF: Konzepte, Methoden, Ergebnisse. Fremdsprachen und Hochschule, 69 , 43– 68Eckes, T. (2004a). Beurteilerübereinstimmung und Beurteilerstrenge: Eine Multifacetten-Rasch-Analyse von Leistungsbeurteilungen im “Test Deutsch als Fremdsprache“ (TestDaF). Diagnostica, 50 , 65– 77Eckes, T. (2004b). Facetten des Sprachtestens: Strenge und Konsistenz in der Beurteilung sprachlicher Leistungen. In A. Wolff et al. (Hrsg.), Materialien Deutsch als Fremdsprache (S. 451-484). Regensburg: FaDaFEckes, T. , Grotjahn, R. (in Druck) Der C-Test als Ankertest für TestDaF: Analysen auf der Basis eines probabilistischen Testmodells. In R. Grotjahn (Ed.), The C-test: Theory, empirical research, applications. Frankfurt: LangEmbretson, S. E. , Reise, S. P. (2000). Item response theory for psychologists . Mahwah, NJ: ErlbaumEngelhard, G. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5 , 171– 191Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31 , 93– 112Engelhard, G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1 , 19– 33Engelhard, G. (2002). Monitoring raters in performance assessments. In G. Tindal & T. M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (pp. 261-287). Mahwah, NJ: ErlbaumEngelhard, G. , Myford, C. M. (2003). Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition Program with a many-faceted Rasch model (College Board Research Report No. 2003-1). New York: College Entrance Examination Board2001). Gemeinsamer europäischer Referenzrahmen für Sprachen: Lernen, lehren, beurteilen . Berlin: Langenscheidt
(Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests: Grundlagen und Anwendungen . Bern: HuberFischer, G. H. (1995). The linear logistic test model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 131-155). New York: Springer-VerlagFischer, G. H. , Ponocny-Seliger, E. (2003). Structural Rasch modeling: Handbook of the usage of LpcM-WiN 1.0 [Software manual]. Groningen, The Netherlands: Science Plus GroupFisher, A. G. (1993). The assessment of IADL motor skills: An application of many-faceted Rasch analysis. American Journal of Occupational Therapy, 47 , 319– 329Fitzpatrick, A. R. , Ercikan, K. , Yen, W. M. , Ferrara, S. (1998). The consistency between raters scoring in different test years. Applied Measurement in Education, 11 , 195– 208Greve, W. , Wentura, D. (1997). Wissenschaftliche Beobachtung: Eine Einführung . Weinheim: Psychologie Verlags UnionHambleton, R. K. , Robin, F. , Xing, D. (2000). Item response models for the analysis of educational and psychological test data. In H. E. A. Tinsley & S. D. Brown (Eds.), Handbook of applied multivariate statistics and mathematical modeling (pp. 553-581). San Diego, CA: Academic PressHornke, L. F. (2004). Normen, Standards, Richtlinien auch für die Personalarbeit. In L. Hornke & U. Winterfeld (Hrsg.), Eignungsbeurteilungen auf dem Prüfstand: DIN 33430 zur Qualitätssicherung (S. 9-25). Heidelberg: Spektrum Akademischer VerlagHoskens, M. , Wilson, M. (2001). Real-time feedback on rater drift in constructed-response items: An example from the Golden State Examination. Journal of Educational Measurement, 38 , 121– 145Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it?. Psychological Methods, 5 , 64– 86Hoyt, W. T. , Kerns, M.-D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4 , 403– 424Hyde, J. S. , Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104 , 53– 69Johnson, S. (1996). The contribution of large-scale assessment programmes to research on gender differences. Educational Research and Evaluation, 2 , 25– 49Keeves, J. P. , Alagumalai, S. (1999). New approaches to measurement. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 23-41). Amsterdam: PergamonKersting, M. (2004). Qualitätsstandards. In K. Westhoff, L. J. Hellfritsch, L. F. Hornke, K. D. Kubinger, F. Lang, H. Moosbrugger et al. (Hrsg.), Grundwissen für die berufsbezogene Eignungsbeurteilung nach DIN 33430 (S. 22-32). Lengerich: PabstKubinger, K. D. (2003). Testtheorie, Probabilistische. In K. D. Kubinger & R. S. Jäger (Hrsg.), Schlüsselbegriffe der Psychologischen Diagnostik (S. 415-423). Weinheim: BeltzLinacre, J. M. (1989). Many-facet Rasch measurement . Chicago: MESA PressLinacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 85-98). Norwood, NJ: AblexLinacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3 , 103– 122Linacre, J. M. (2002a). Judging debacle in Pairs Figure Skating. Rasch Measurement Transactions, 15 , 4 839– 840Linacre, J. M. (2002b). What do infit and outfit, mean-square and standardized mean?. Rasch Measurement Transactions, 16 , 2 878–Linacre, J. M. (2003). Size vs. significance: Standardized chi-square fit statistic. Rasch Measurement Transactions, 17 , 1 918–Linacre, J. M. (2004). A user’s guide to FACETS: Rasch-model computer programs [Software manual]. Chicago: Winste ps. comLinacre, J. M. , Wright, B. D. (2002). Construction of measures from many-facet data. Journal of Applied Measurement, 3 , 484– 509Lumley, T. , McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12 , 54– 71Lunz, M. E. , Linacre, J. M. (1998). Measurement designs using multifacet Rasch modeling. In G. A. Marcoulides (Ed.), Modern methods for business research (pp. 47-77). Mahwah, NJ: ErlbaumLunz, M. E. , Stahl, J. A. (1993). The effect of rater severity on person ability measure: A Rasch model analysis. American Journal of Occupational Therapy, 47 , 311– 317Lunz, M. E. , Stahl, J. A. , Wright, B. D. (1996). The invariance of judge severity calibrations. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 99-112). Norwood, NJ: AblexLunz, M. E. , Wright, B. D. (1997). Latent trait models for performance examinations. In J. Rost & R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 80-88). Münster: WaxmannLunz, M. E. , Wright, B. D. , Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3 , 331– 345Lynch, B. K. , McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15 , 158– 180MacMillan, P. D. (2000). Classical, generalizability, and multifaceted Rasch detection of interrater variability in large, sparse data sets. Journal of Experimental Education, 68 , 167– 190Marcoulides, G. A. (1999). Generalizability theory: Picking up where the Rasch IRT model leaves off?. In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every psychologist and educator should know (pp. 129-152). Mahwah, NJ: ErlbaumMarcus, B. , Schuler, H. (2001). Leistungsbeurteilung. In H. Schuler (Hrsg.), Lehrbuch der Personalpsychologie (S. 397-431). Göttingen: HogrefeMasters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47 , 149– 174McNamara, T. F. (1996). Measuring second language performance . London: LongmanMessick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: MacmillanMessick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50 , 741– 749Micko, H. C. (1970). Eine Verallgemeinerung des Meßmodells von Rasch mit einer Anwendung auf die Psychophysik der Reaktionen. Psychologische Beiträge, 12 , 4– 22Moosbrugger, H. (2002). Item-Response-Theorie (IRT). In M. Amelang & W. Zielinski, Psychologische Diagnostik und Intervention (3. Aufl., S. 68-92). Berlin: Springer-VerlagMüller, H. (1999). Probabilistische Testmodelle für diskrete und kontinuierliche Ratingskalen: Einführung in die Item-Response-Theorie für abgestufte und kontinuierliche Items . Bern: HuberMyford, C. M. , Wolfe, E. W. (2000). Strengthening the ties that bind: Improving the linking network in sparsely connected rating designs (TOEFL Technical Report, TR-15) . Princeton, NJ: Educational Testing ServiceMyford, C. M. , Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4 , 386– 422Myford, C. M. , Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5 , 189– 227O’Neill, T. R. , Lunz, M. E. (2000). A method to study rater severity across several administrations. In M. Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 135-146). Stamford, CT: AblexRasch, G. (1980). Probabilistic models for some intelligence and attainment tests . Chicago: University of Chicago Press. (Original erschienen 1960)Rost, J. (1999). Was ist aus dem Rasch-Modell geworden?. Psychologische Rundschau, 50 , 140– 156Rost, J. (2004). Lehrbuch Testtheorie, Testkonstruktion (2. Aufl.). Bern: HuberRost, J. , Carstensen, C. H. (2002). Multidimensional Rasch measurement via item component models and faceted designs. Applied Psychological Measurement, 26 , 42– 56Rost, J. , Langeheine, R. (1997). A guide through latent structure models for categorical data. In J. Rost & R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 13-37). Münster: WaxmannRost, J. , Spada, H. (1983). Die Quantifizierung von Lerneffekten anhand von Testdaten. Zeitschrift für Differentielle und Diagnostische Psychologie, 4 , 29– 49Saal, F. E. , Downey, R. G. , Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88 , 413– 428Shavelson, R. J. , Webb, N. M. (1991). Generalizability theory: A primer . Newbury Park, CA: SageStahl, J. A. , Lunz, M. E. (1996). Judge performance reports: Media and message. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 113-125). Norwood, NJ: AblexSteyer, R. , Eid, M. (2001). Messen und Testen (2. Aufl.). Berlin: Springer-VerlagTyndall, B. , Kenyon, D. M. (1996). Validation of a new holistic rating scale using Rasch multi-faceted analysis. In A. Cumming & R. Berwick (Eds.), Validation in language testing (pp. 39-57). Clevedon, UK: Multilingual MattersWang, W.-C. (2000). The simultaneous factorial analysis of differential item functioning. Methods of Psychological Research Online, 5 , 57– 76Wang, W.-C. , Chen, P.-H. , Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9 , 116– 136Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15 , 263– 287Wilson, M. , Case, H. (2000). An examination of variation in rater severity over time: A study in rater drift. In M. Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 113-133). Stamford, CT: AblexWirtz, M. , Caspar, F. (2002). Beurteilerübereinstimmung und Beurteilerreliabilität: Methoden zur Bestimmung und Verbesserung der Zuverlässigkeit von Einschätzungen mittels Kategoriensystemen und Ratingskalen . Göttingen: HogrefeWright, B. D. (1988). The efficacy of unconditional maximum likelihood bias correction. Applied Psychological Measurement, 12 , 315– 318Wright, B. D. (1999). Rasch measurement models. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 85-97). Amsterdam: PergamonWright, B. D. , Masters, G. N. (1982). Rating scale analysis . Chicago: MESA PressWright, B. D. , Masters, G. N. (2002). Number of person or item strata. Rasch Measurement Transactions, 16 , 3 888–Wright, B. D. , Stone, M. H. (1979). Best test design . Chicago: MESA PressWu, M. L. , Adams, R. J. , Wilson, M. R. (1997). ConQuest: Generalized item response modeling [Computer software]. Melbourne, Australia: Australian Council for Educational Research