Skip to main content
Open AccessOriginalarbeit

Herausforderungen bei der Schätzung von Trends in Schulleistungsstudien

Eine Skalierung der deutschen PISA-Daten

Published Online:https://doi.org/10.1026/0012-1924/a000177

Zusammenfassung. Internationale Schulleistungsstudien wie das Programme for International Student Assessment (PISA) dienen den teilnehmenden Ländern zur Feststellung der Leistungsfähigkeit ihrer Schulsysteme. In PISA wird die Zielpopulation (15-jährige Schülerinnen und Schüler) alle 3 Jahre getestet. Von besonderer Bedeutung sind dabei die Trendinformationen, die für die Zielpopulation ausweisen, ob sich ihre Leistungen gegenüber denen aus früheren Erhebungen verändert haben. Um solche Trends valide interpretieren zu können, sollten die PISA-Erhebungen unter möglichst vergleichbaren Bedingungen durchgeführt und die verwendeten statistischen Verfahren vergleichbar bleiben. In PISA 2015 wurde erstmalig computerbasiert getestet; zuvor mittels Papier-und-Bleistift-Tests. Es wurde das Skalierungsmodell verändert und in den Naturwissenschaften wurden neue Aufgabenformate eingesetzt. Im vorliegenden Beitrag gehen wir anhand der nationalen PISA-Stichproben von 2000 bis 2015 der Frage nach, inwiefern der Wechsel des Testmodus und der Wechsel des Skalierungsmodells die Interpretation der Trendschätzungen beeinflussen. Die Analysen belegen, dass die Veränderung von Papier-und-Bleistift-Tests auf Computertestung die Trendschätzung für Deutschland verzerrt haben könnte.


Challenges in Estimations of Trends in Large-Scale Assessments: A Calibration of the German PISA Data

Abstract. International large-scale assessments, for instance, the Programme for International Student Assessment (PISA), are conducted to provide information on the effectiveness of educational systems. In PISA, the target population of 15-year-old students is assessed every 3 years. Trends show whether competencies have changed for the target population between PISA cycles. To ensure valid trend information, it is necessary to keep the test conditions and statistical methods in all PISA cycles as constant as possible. In PISA 2015, however, several changes were established; the test model changed from paper pencil to computer tests, scaling methods were changed, and new types of tasks were used in science. In this article, we investigate the effects of these changes on trend estimation in PISA using German data from all PISA cycles (2000 – 2015). Findings suggest that the change from paper pencil to computer tests could have biased the trend estimation.

Literatur

  • Aitkin, M. & Aitkin, I. (2011). Statistical modeling of the National Assessment of Educational Progress. New York: Springer. First citation in articleCrossrefGoogle Scholar

  • Artelt, C. & Baumert, J. (2004). Zur Vergleichbarkeit von Schülerleistungen bei Leseaufgaben unterschiedlichen sprachlichen Ursprungs. Zeitschrift für Pädagogische Psychologie, 18, 171 – 185. First citation in articleLinkGoogle Scholar

  • Artelt, C., Weinert, S. & Carstensen, C. H. (2013). Assessing competencies across the lifespan within the German National Educational Panel Study (NEPS) – Editorial. Journal for Educational Research Online, 5 (2), 5 – 14. First citation in articleGoogle Scholar

  • Baumert, J., Klieme, E., Neubrand, M., Prenzel, M., Schiefele, U. & Schneider, W., et al. (Hrsg.). (2001). PISA 2000: Basiskompetenzen von Schülerinnen und Schülern im internationalen Vergleich. Opladen: Leske + Budrich. First citation in articleCrossrefGoogle Scholar

  • Beaton, A. E. (1988). Disentangling the NAEP 1985 – 86 reading anomaly. Technical report, ETS. Princeton, New Jersey, USA. First citation in articleGoogle Scholar

  • Bennett, R. E., Braswell, J., Oranje, A., Sandene, B., Kaplan, B. & Yan, F. (2008). Does it matter if I take my mathematics test on computer? A second empirical study of mode effects in NAEP. The Journal of Technology, Learning, and Assessment, 6 (9). Verfügbar unter http://www.jtla.org First citation in articleGoogle Scholar

  • Birnbaum, A. (1968). Some latent trait models. In F. M. LordM. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397 – 479). Reading, MA: Addison-Wesley. First citation in articleGoogle Scholar

  • Bock, R. D. & Moustaki, I. (2007). Item response theory in a general framework. In C. R. RaoS. SinharayEds., Handbook of Statistics, Volume 26: Psychometrics (pp. 469 – 513). North Holland: Elsevier. First citation in articleGoogle Scholar

  • Bos, W., Hornberg, S., Arnold, K.-H., Faust, G., Fried, L. & Lankes, E.–M., et al. (Hrsg.). (2007). IGLU 2006. Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich. Münster: Waxmann. First citation in articleGoogle Scholar

  • Brennan, R. L. (2001). Some problems, pitfalls, and paradoxes in educational measurement. Educational Measurement: Issues and Practice, 20 (4), 6 – 18. First citation in articleCrossrefGoogle Scholar

  • Cameron, A. C. & Trivedi, P. K. (2005). Microeconometrics. New York: Cambridge University Press. First citation in articleCrossrefGoogle Scholar

  • Camilli, G. & Penfield, D. A. (1997). Variance estimation for differential test functioning based on Mantel-Haenszel statistics. Journal of Educational Measurement, 34, 123 – 129. First citation in articleCrossrefGoogle Scholar

  • Carstensen, C. H., Prenzel, M. & Baumert, J. (2008). Trendanalysen in PISA: Wie haben sich die Kompetenzen in Deutschland zwischen PISA 2000 und PISA 2006 entwickelt? Zeitschrift für Erziehungswissenschaften, 10, 11 – 34. First citation in articleGoogle Scholar

  • Davier, M. von & Sinharay, S. (2014). Analytics in international large-scale assessments: Item response theory and population models. In L. RutkowskiM. von DavierD. RutkowskiEds., Handbook of international large-scale assessment (pp. 155 – 174). Boca Raton: CRC Press. First citation in articleGoogle Scholar

  • Ehmke, T., Klieme, E. & Stanat, P. (2013). Veränderungen der Lesekompetenz von PISA 2000 nach PISA 2009. Die Rolle von Unterschieden in den Bildungswegen und in der Zusammensetzung der Schülerschaft. Zeitschrift für Pädagogik, 59 [Beiheft], 132 – 150. First citation in articleGoogle Scholar

  • Fox, J.-P. & Verhagen, A. J. (2010). Random item effects modeling for cross-national survey data. In E. DavidovP. SchmidtJ. BillietEds., Cross-cultural analysis: Methods and applications (pp. 461 – 482). London: Routeledge Academic. First citation in articleGoogle Scholar

  • Gebhardt, E. & Adams, R. J. (2007). The influence of equating methodology on reported trends in PISA. Journal of Applied Measurement, 8, 305 – 322. First citation in articleGoogle Scholar

  • Goldhammer, F., Naumann, J., Rölke, H., Stelter, A. & Tóth, K.(in press). Relating product data to process data from computer-based competence assessment. In D. LeutnerJ. FleischerJ. GrünkornE. KliemeEds., Competence assessment in education: Research, models and instruments. Heidelberg: Springer. First citation in articleGoogle Scholar

  • Haberman, S. J. (2009). Linking parameter estimates derived from an item response model through separate calibrations. ETS Research Report ETS RR-09-40. Princeton, ETS. First citation in articleGoogle Scholar

  • Heine, J.-H., Mang, J., Borchert, L., Gomolka, J., Kröhne, U., Goldhammer, F. & Sälzer, C. (2016). Kompetenzmessung in PISA 2015. In K. ReissC. SälzerA. Schiepe-TiskaE. KliemeO. KöllerHrsg.. PISA 2015: Eine Studie in Kontinuität und Wandel (S. 383  – 430). Münster: Waxmann. First citation in articleGoogle Scholar

  • Husek, T. R. & Sirotnik, K. (1967). Item sampling in educational research (CSEIP Occasional Report No. 2). Los Angeles: University of California. First citation in articleGoogle Scholar

  • Educational Testing Service (ETS) (2015). PISA 2015 Field trial analysis report: Outcomes of the cognitive assessment [interner Bericht]. Princeton, NJ. First citation in articleGoogle Scholar

  • Kiefer, T., Robitzsch, A. & Wu, M. (2016). TAM: Test analysis modules. R package version 1.995-0. Verfügbar unter http://CRAN.R-project.org/package=TAM First citation in articleGoogle Scholar

  • Kingston, N. M. (2009). Comparability of computer- and paper-administered multiple-choice tests for K-12 populations: A synthesis. Applied Measurement in Education, 22, 22 – 37. First citation in articleCrossrefGoogle Scholar

  • Köller, O., Knigge, M. & Tesch, B. (Hrsg.). (2010). Sprachliche Kompetenzen im Ländervergleich. Münster: Waxmann. First citation in articleGoogle Scholar

  • Klieme, E. & Beck, B. (2007). Sprachliche Kompetenzen. Konzepte und Messung. DESI-Studie (Deutsch Englisch Schülerleistungen International). Weinheim: Beltz First citation in articleGoogle Scholar

  • Klieme, E., Jude, N., Baumert, J. & Prenzel, M. (2010). PISA 2000 – 2009: Bilanz der Veränderungen im Schulsystem. In E. KliemeC. ArteltJ. HartigN. JudeO. KöllerM. PrenzelW. SchneiderP. StanatHrsg., PISA 2009. Bilanz nach einem Jahrzehnt (S. 277  – 300). Münster: Waxmann. First citation in articleGoogle Scholar

  • Kolen, M. J. & Brennan, R. L. (2014). Test equating, scaling, and linking. New York: Springer. First citation in articleCrossrefGoogle Scholar

  • Kröhne, U. & Martens, T. (2011). Computer-based competence tests in the national educational panel study: The challenge of mode effects. Zeitschrift für Erziehungswissenschaft, 14, 169 – 186. First citation in articleCrossrefGoogle Scholar

  • Linden, W. J. van der (2005). Linear models for optimal test design. New York: Springer. First citation in articleCrossrefGoogle Scholar

  • Macaskill, G. (2008, September). Alternative scaling models and dependencies. TAG(0809)6a Report. Paper presented at the 2008 TAG Meeting in Sydney, Australia. Retrieved October 28, 2016, from https://www.acer.edu.au/files/macaskill_alternativescalingmodelsdependenciespisa(1).pdf First citation in articleGoogle Scholar

  • Mazzeo, J. & Davier, M. von (2008). Review of the Programme for International Student Assessment (PISA) test design: Recommendations for fostering stability in assessment results. Education Working Papers EDU/PISA/GB (2008), 28, 23 – 24. First citation in articleGoogle Scholar

  • Monseur, C. & Berezner, A. (2007). The computation of equating errors in international surveys in education. Journal of Applied Measurement, 8, 323 – 335. First citation in articleGoogle Scholar

  • Muthén, B. & Asparouhov, T. (2014). IRT studies of many groups: The alignment method. Frontiers in Psychology | Quantitative Psychology and Measurement, 5, 978. First citation in articleGoogle Scholar

  • National Center for Education Statistics (NCES) (2013). The nation’s report card: NAEP 2012. Trends in academic progress. Washington, D.C: Institute of Education Science, U.S. Department of Education. First citation in articleGoogle Scholar

  • Organisation for Economic Co-operation and Development (OECD) (2009). PISA 2006 technical report. Paris: OECD Publishing. First citation in articleGoogle Scholar

  • Organisation for Economic Co-operation and Development (OECD) (2013a). Technical report of the survey of adult skills (PIAAC). Paris: OECD Publishing. First citation in articleGoogle Scholar

  • Organisation for Economic Co-operation and Development (OECD) (2013b). The survey of adult skills: Reader’s companion. Paris: OECD Publishing. First citation in articleGoogle Scholar

  • Organisation for Economic Co-operation and Development (OECD) (2016). PISA 2015 assessment and analytical framework. Science, reading, mathematic and financial literacy. Paris: OECD Publishing. First citation in articleGoogle Scholar

  • Oliveri, M. E. & Davier, M. von (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53, 315 – 333. First citation in articleGoogle Scholar

  • Oliveri, M. E. & Davier, M. von (2014). Toward increasing fairness in score scale calibrations employed in international large-scale assessments. International Journal of Testing, 14, 1 – 21. First citation in articleCrossrefGoogle Scholar

  • Parshall, C. G., Harmes, J. C., Davey T., & Pashley, P. J. (2010). Innovative item types for computerized testing. In W. J. van der LindenC. A. W. GlasEds., Elements of adaptive testing (pp. 215 – 230). New York, NY: Springer. First citation in articleGoogle Scholar

  • Pohl, S. & Carstensen, C. H. (2012): NEPS technical report – Scaling the data of the competence tests (NEPS Working Paper No. 14). Bamberg: Otto-Friedrich-Universität, Nationales Bildungspanel. First citation in articleGoogle Scholar

  • Prenzel, M., Sälzer, C., Klieme, E. & Köller, O. (Hrsg.). (2013). PISA 2012: Fortschritte und Herausforderungen in Deutschland. Münster: Waxmann. First citation in articleGoogle Scholar

  • R Core Team (2016). R: A language and environment for statistical computing. Verfügbar unter https://www.R-project.org/ First citation in articleGoogle Scholar

  • Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. First citation in articleGoogle Scholar

  • Rammstedt, B. (Hrsg.). (2013). Grundlegende Kompetenzen Erwachsener im internationalen Vergleich. Ergebnisse von PIAAC 2012. Münster: Waxmann. First citation in articleGoogle Scholar

  • Reiss, K., Sälzer, C., Schiepe-Tiska, A. & Köller, O. (Hrsg.). (2016). PISA 2015: Eine Studie in Kontinuität und Wandel. Münster: Waxmann. First citation in articleGoogle Scholar

  • Robitzsch, A. (2016). Essays zu methodischen Herausforderungen im Large-Scale Assessment. Dissertation, Humboldt-Universität zu Berlin. Zugriff am 28. 10. 2016 unter http://edoc.hu-berlin.de/dissertationen/robitzsch-alexander-2015-10-27/PDF/robitzsch.pdf First citation in articleGoogle Scholar

  • Robitzsch, A. (2016). sirt: Supplementary item response theory models. R package version 1.12-2. Verfügbar unter http://CRAN.R-project.org/package=sirt First citation in articleGoogle Scholar

  • Rost, J. (2004). Lehrbuch Testtheorie – Testkonstruktion. Bern: Huber. First citation in articleGoogle Scholar

  • Sachse, K. A., Roppelt, A. & Haag, N. (2016). A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. Journal of Educational Measurement, 53, 152 – 171. First citation in articleCrossrefGoogle Scholar

  • Wang, S., Jiao, H., Young, M. J., Brooks, T. & Olson, J. (2008). Comparability of computer-based and paper-and-pencil testing in K–12 reading assessments: A meta-analysis of testing mode effects. Educational and Psychological Measurement, 68, 5 – 24. First citation in articleCrossrefGoogle Scholar

  • Wendt, H., Bos, W., Selter, C., Köller, O., Schwippert, K. & Kasper, D. (Hrsg.). (2016). TIMSS 2015: Mathematische und naturwissenschaftliche Kompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich. Münster: Waxmann. First citation in articleGoogle Scholar

  • Wu, M. (2010). Measurement, sampling, and equating errors in large-scale assessments. Educational Measurement: Issues and Practice, 29 (4), 15 – 27. First citation in articleCrossrefGoogle Scholar

  • Xu, X. & von Davier, M. (2010). Linking errors in trend estimation in large-scale surveys: A case study (ETS Research Report RR10-10). Princeton: ETS. First citation in articleGoogle Scholar

  • Yamamoto, K. (2012). Outgrowing the mode effect study of paper and computer based testing. Zugriff am 21. Oktober 2016. Verfügbar unter http://www.umdcipe.org/conferences/EducationEvaluationItaly/COMPLETE_PAPERS/Yamamoto/YAMAMOTO.pdf First citation in articleGoogle Scholar