Abstract
Testlets sind Teilmengen von Testitems, die sich auf denselben Input beziehen. Testverfahren, die Testlets enthalten, sind in der pädagogisch-psychologischen Diagnostik weit verbreitet. Mit der Verwendung von Testlets ist allerdings ein ernstes psychometrisches Problem verbunden: Items, die einem Testlet angehören, verletzen die grundlegende Annahme der lokalen Unabhängigkeit. Gegenstand dieser Arbeit waren Einflüsse von Testlets im Prüfungsteil Leseverstehen des Tests Deutsch als Fremdsprache (TestDaF). Anhand eines Modells der Testlet-Response-Theorie (Wainer, Bradlow & Wang, 2007) wurden Antworten von Teilnehmenden (N = 2 859) auf 30 Items, aufgeteilt auf drei Lesetexte (Testlets) mit je 10 Items, analysiert. Im ersten Lesetext fielen die Testlet-Effekte deutlich aus; in den beiden anderen Lesetexten ergaben sich nur schwache Effekte. Weitere Analysen zeigten, dass die Vernachlässigung der Testlet-Effekte eine erhöhte Schätzung der Testreliabilität sowie abweichende Schätzungen der Itemtrennschärfe und Itemschwierigkeit zur Folge hatte. Implikationen der Ergebnisse für die Entwicklung, Analyse und Evaluation testlet-basierter Testverfahren werden diskutiert.
Testlets are subsets of test items that are based on the same input. Tests that contain testlets are common in educational and psychological testing. However, use of testlets runs into a serious psychometric problem: Items within a testlet violate the fundamental assumption of local independence. The present research examined effects of testlets in the reading section of the Test of German as a Foreign Language (TestDaF). Building on a testlet response theory model (Wainer, Bradlow, & Wang, 2007), responses of test-takers (N = 2,859) to 30 items, divided into three reading texts (testlets) with 10 items each, were analyzed. The first reading text manifested pronounced testlet effects; the other two texts showed only weak effects. Further analysis revealed that neglecting testlet effects resulted in overestimated test reliability and biased estimates of item discrimination and item difficulty. Implications of these findings for the construction, analysis, and evaluation of testlet-based tests are discussed.
Literatur
2012). ConQuest (Version 3.0) [Computer software]. Camberwell, Australia: ACER Press.
(2013). Prior approval: The growth of Bayesian methods in psychology. British Journal of Mathematical and Statistical Psychology, 66, 1 – 7.
(1978). A rating formulation for ordered response categories. Psychometrika, 43, 561 – 573.
(1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. LordM. R. NovickEds., Statistical theories of mental test scores (pp. 395 – 479). Reading, MA: Addison-Wesley.
(1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153 – 168.
(2011). A boundary mixture approach to violations of conditional independence. Psychometrika, 76, 57 – 76.
(2007). Copula functions for residual dependency. Psychometrika, 72, 393 – 411.
(2008). Estimation of a Rasch model including subdimensions. IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments 1, 51 – 69.
(2012). Robustness of multidimensional analyses against local item dependence. Psychological Test and Assessment Modeling, 54, 36 – 53.
(2011). Einführung in die Test- und Fragebogenkonstruktion (3. Aufl.). München: Pearson Studium.
(2013). flexMIRT® version 2.00: A numerical engine for flexible multilevel multidimensional item analysis and test scoring. [Computer software]. Chapel Hill, NC: Vector Psychometric Group.
(2013). IRTPRO for Windows [Computer software]. Skokie, IL: Scientific Software International.
(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48 (6).
(2010, July). Examining testlet effects on the PIRLS 2006 assessment. Paper presented at the 4th IEA International Research Conference, Gothenburg, Sweden.
(2007). Effects of ignoring item interaction on item parameter estimation and detection of interacting items. Applied Psychological Measurement, 31, 388 – 411.
(1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265 – 289.
(1999). A comparison of three polytomous item response theory models in the context of testlet scoring. Journal of Outcome Measurement, 3, 1 – 20.
(2010). BUGS code for item response theory. Journal of Statistical Software, 36 (1).
(2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145 – 168.
(2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104 – 121.
(2007). Konstruktion und Analyse von C-Tests mit Ratingskalen-Rasch-Modellen. Diagnostica, 53, 68 – 82.
(2008). Assuring the quality of TestDaF examinations: A psychometric modeling approach. In L. TaylorC. J. WeirEds., Multilingualism and assessment: Achieving transparency, assuring quality, sustaining diversity–Proceedings of the ALTE Berlin Conference May 2005 (pp. 157 – 178). Cambridge, UK: UCLES/Cambridge University Press.
(2011). Item banking for C-tests: A polytomous Rasch modeling approach. Psychological Test and Assessment Modeling, 53, 414 – 439.
(2013). A study of differential item functioning in the TestDaF reading and listening sections. In E. D. GalacziC. J. WeirEds., Exploring language frameworks: Proceedings of the ALTE Kraków Conference, July 2011 (pp. 362 – 388). Cambridge, UK: UCLES/Cambridge University Press.
(2014). Examining testlet effects in the TestDaF listening section: A testlet response theory modeling approach. Language Testing, 31, 39 – 61.
(2010). Bayesian item response modeling: Theory and applications. New York: Springer.
(2010). Mathematische Kompetenz von PISA 2003 bis PISA 2009. In E. KliemeHrsg., PISA 2009: Bilanz nach einem Jahrzehnt (S. 153 – 176). Münster: Waxmann.
(2004). Bayesian data analysis (2nd ed.). Boca Raton, FL: Chapman & Hall/CRC.
(2012). Jahresbericht 2010/11. Zugriff am 25. 03. 2013. Verfügbar unter www.testdaf.de/institut/pdf/Jahresbericht_10 – 11.pdf.
(1992). Full-information bi-factor analysis. Psychometrika, 57, 423 – 436.
(2000). MML and EAP estimation in testlet-based adaptive testing. In W. J. van der LindenC. A. W. GlasEds., Computerized adaptive testing: Theory and practice (pp. 271 – 288). Boston, MA: Kluwer-Nijhoff.
(2010). Scoright. In M. Maier & R. Hatzinger (Hrsg.), IRT Software: Überblick und Anwendungen (S. 27 – 38). Wien: Wirtschaftsuniversität. Zugriff am 20. 11. 2012. Verfügbar unter epub.wu.ac.at/2910/
(2006). Reliability. In R. L. BrennanEd., Educational measurement (4th ed., pp. 65 – 110). Westport, CT: American Council on Education/Praeger.
(1992). Context-dependent item sets. Educational Measurement: Issues and Practice, 11 , 21 – 25.
(2004). Developing and validating multiple-choice test items (3rd ed.). Mahwah, NJ: Erlbaum.
(2010). IRT models for the analysis of polytomously scored data. In M. L. NeringR. OstiniEds., Handbook of polytomous item response theory models (pp. 21 – 42). New York: Routledge.
(2010). Modellierung von Kompetenzen mit mehrdimensionalen IRT-Modellen. In E. KliemeD. LeutnerM. KenkHrsg., Kompetenzmodellierung: Zwischenbilanz des DFG-Schwerpunktprogramms und Perspektiven des Forschungsansatzes (S. 189 – 198). Weinheim: Beltz.
(2012, April). Model selection for equating testlet-based tests in the NEAT design: An empirical study. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, BC, Canada.
(2007). Anlage und Durchführung von IGLU 2006. In W. BosHrsg., IGLU 2006: Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (S. 21 – 45). Münster: Waxmann.
(2000). Adjusting for information inflation due to local dependency in moderately large item clusters. Psychometrika, 65, 73 – 91.
(2001). Testing for local dependency in dichotomous and polytomous item response models. Psychometrika, 66, 109 – 132.
(2002). Locally dependent latent trait model and the Dutch identity revisited. Psychometrika, 67, 367 – 386.
(2010a). Empirically indistinguishable multidimensional IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psychology, 63, 395 – 416.
(2010b). Interpretation of the three-parameter testlet response model and information function. Applied Psychological Measurement, 34, 467 – 482.
(2004). Locally dependent latent trait model for polytomous responses with application to inventory of hostility. Psychometrika, 69, 191 – 216.
(2009). Bayesian analysis for the social sciences. Chichester, UK: Wiley.
(2012). A multilevel testlet model for dual local dependence. Journal of Educational Measurement, 49, 82 – 100.
(2005). Modeling local item dependence with the hierarchical generalized linear model. Journal of Applied Measurement, 6, 311 – 321.
(2007). Hierarchical item response theory models. In C. R. RaoS. SinharayEds., Psychometrics (Handbook of statistics, Vol. 26, pp. 587 – 606). Amsterdam: Elsevier.
(2011). Validierung von Sprachprüfungen: Die Zuordnung des TestDaF zum Gemeinsamen europäischen Referenzrahmen für Sprachen. Frankfurt: Lang.
(2010). Putting the Manual to the test: The TestDaF–CEFR linking project. In W. MartyniukEd., Aligning tests with the CEFR: Reflections on using the Council of Europe’s draft Manual (pp. 50 – 79). Cambridge, UK: UCLES/Cambridge University Press.
(2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16, 207 – 222.
(2007). Estimating item response theory models using Markov chain Monte Carlo methods. Educational Measurement: Issues and Practice, 26 (4), 38 – 51.
(2011). Doing Bayesian data analysis: A tutorial with R and BUGS. Burlington, MA: Academic Press/Elsevier.
(2012). Steuerung zukünftiger Aufgabenentwicklung durch Vorhersage der Schwierigkeiten eines Tests für die erste Fremdsprache Englisch durch Dutch Grid Merkmale. Diagnostica, 58, 31 – 44.
(2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30, 3 – 21.
(2012). WINSTEPS (Version 3.75) [Computer software]. Chicago, IL: Winsteps.com.
(2010). Bayes’sche Methoden in der Statistik. In H. HollingB. SchmitzHrsg., Handbuch Statistik, Methoden und Evaluation (S. 730 – 742). Göttingen: Hogrefe.
(1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149 – 174.
(2011). PISA test format assessment and the local independence assumption. IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments 4, 131 – 155.
(1999). Probabilistische Testmodelle für diskrete und kontinuierliche Ratingskalen: Einführung in die Item-Response-Theorie für abgestufte und kontinuierliche Items. Bern: Huber.
(2010). A comparison of item selection techniques for testlets. Applied Psychological Measurement, 34, 424 – 437.
(2010). Lesekompetenz von PISA 2000 bis PISA 2009. In E. KliemeHrsg., PISA 2009: Bilanz nach einem Jahrzehnt (S. 23 – 71). Münster: Waxmann.
(2007). Leseverstehen. In B. BeckE. KliemeHrsg., Sprachliche Kompetenzen: Konzepte und Messung (S. 197 – 211). Weinheim: Beltz.
(2012). Minimizing the testlet effect: Identifying critical testlet features by means of tree-based regression. In T. J. H. M. Eggen & B. P. Veldkamp (Eds.), Psychometrics in practice at RCEC (Chap. 6). Cito/University of Twente, Enschede, Netherlands. Zugriff am 20. 11. 2012. Verfügbar unter doc.utwente.nl/80198/1/hfstTheoEggen_def_%283%29.pdf.
(1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press. (Original erschienen 1960)
(2009). Multidimensional item response theory. New York: Springer.
(2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667 – 696.
(2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47, 361 – 372.
(2009). Methodische Herausforderungen bei der Kalibrierung von Leistungstests. In D. GranzerHrsg., Bildungsstandards Deutsch und Mathematik: Leistungsmessung in der Grundschule (S. 42 – 106). Weinheim: Beltz.
(2010). Naturwissenschaftliche Kompetenz von PISA 2006 bis PISA 2009. In E. KliemeHrsg., PISA 2009: Bilanz nach einem Jahrzehnt (S. 177 – 198). Münster: Waxmann.
(1988). Item bundles. Psychometrika, 53, 349 – 359.
(2004). Lehrbuch Testtheorie, Testkonstruktion (2. Aufl.). Bern: Huber.
(2004). To Bayes or not to Bayes, from whether to when: Applications of Bayesian methodology to modeling. Structural Equation Modeling, 11, 424 – 451.
(2008). Forschungsmethoden und Statistik in der Psychologie. München: Pearson Studium.
(2004). Experiences with Markov chain Monte Carlo convergence assessment in two psychometric examples. Journal of Educational and Behavioral Statistics, 29, 461 – 488.
(1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237 – 247.
(1989). Trace lines for testlets: A use of multiple-categorical-response models. Journal of Educational Measurement, 26, 247 – 260.
(2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6, 181 – 195.
(2004). Models for residual dependencies. In P. De BoeckM. WilsonEds., Explanatory item response models: A generalized linear and nonlinear approach (pp. 289 – 316). New York: Springer.
(1995). Precision and differential item functioning on a testlet-based test: The 1991 Law School Admissions Test as an example. Applied Measurement in Education, 8, 157 – 186.
(2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der LindenC. A. W. GlasEds., Computerized adaptive testing: Theory and practice (pp. 245 – 270). Boston, MA: Kluwer-Nijhoff.
(2007). Testlet response theory and its applications. Cambridge, UK: Cambridge University Press.
(1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185 – 201.
(1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27, 1 – 14.
(1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28, 197 – 219.
(2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37, 203 – 220
(2005a). Assessment of differential item functioning in testlet-based items using the Rasch testlet model. Educational and Psychological Measurement, 65, 549 – 576.
(2005b). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29, 296 – 318.
(2005c). The Rasch testlet model. Applied Psychological Measurement, 29, 126 – 149.
(2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26, 109 – 128.
(2005). User’s guide for SCORIGHT (Version 3.0): A computer program for scoring tests built of testlets including a module for covariate analysis (ETS Research Report RR 04 – 49). Princeton, NJ: Educational Testing Service. Retrieved May 5, 2010, from www.cambridge.org/resources/052168126X/4366_SCORIGHT_manual.pdf
(2011). On applications of Rasch models in international comparative large-scale assessments: A historical review. Educational Research and Evaluation, 17, 419 – 446.
(1995). Rasch models for item bundles. Psychometrika, 60, 181 – 198.
(1982). Rating scale analysis. Chicago: MESA Press.
(1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125 – 145.
(1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187 – 213.
(2006). Item response theory. In R. L. BrennanEd., Educational measurement (4th ed., pp. 111 – 153). Westport, CT: American Council on Education/Praeger.
(2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27, 119 – 140.
(2010, April). Polytomous IRT or testlet model: An evaluation of scoring models in small testlet size situations. Paper presented at the Annual Meeting of the 15th International Objective Measurement Workshop, Boulder, Colorado.
(