Open AccessOriginalarbeit

Herausforderungen bei der Schätzung von Trends in Schulleistungsstudien

Eine Skalierung der deutschen PISA-Daten

Alexander Robitzsch

IPN – Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik, Kiel

Zentrum für internationale Bildungsvergleichsstudien (ZIB), 80333 München

Search for more papers by this author

Oliver Lüdtke

IPN – Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik, Kiel

Zentrum für internationale Bildungsvergleichsstudien (ZIB), 80333 München

Search for more papers by this author

Olaf Köller

IPN – Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik, Kiel

Search for more papers by this author

Ulf Kröhne

Deutsches Institut für Internationale Pädagogische Forschung (DIPF), Bildungsqualität und Evaluation, Frankfurt am Main

Zentrum für internationale Bildungsvergleichsstudien (ZIB), 80333 München

Search for more papers by this author

Frank Goldhammer

Deutsches Institut für Internationale Pädagogische Forschung (DIPF), Bildungsqualität und Evaluation, Frankfurt am Main

Zentrum für internationale Bildungsvergleichsstudien (ZIB), 80333 München

Search for more papers by this author

, and

Jörg-Henrik Heine

Technische Universität München, TUM School of Education

Zentrum für internationale Bildungsvergleichsstudien (ZIB), 80333 München

Search for more papers by this author

Published Online:December 06, 2016https://doi.org/10.1026/0012-1924/a000177

Abstract

Zusammenfassung. Internationale Schulleistungsstudien wie das Programme for International Student Assessment (PISA) dienen den teilnehmenden Ländern zur Feststellung der Leistungsfähigkeit ihrer Schulsysteme. In PISA wird die Zielpopulation (15-jährige Schülerinnen und Schüler) alle 3 Jahre getestet. Von besonderer Bedeutung sind dabei die Trendinformationen, die für die Zielpopulation ausweisen, ob sich ihre Leistungen gegenüber denen aus früheren Erhebungen verändert haben. Um solche Trends valide interpretieren zu können, sollten die PISA-Erhebungen unter möglichst vergleichbaren Bedingungen durchgeführt und die verwendeten statistischen Verfahren vergleichbar bleiben. In PISA 2015 wurde erstmalig computerbasiert getestet; zuvor mittels Papier-und-Bleistift-Tests. Es wurde das Skalierungsmodell verändert und in den Naturwissenschaften wurden neue Aufgabenformate eingesetzt. Im vorliegenden Beitrag gehen wir anhand der nationalen PISA-Stichproben von 2000 bis 2015 der Frage nach, inwiefern der Wechsel des Testmodus und der Wechsel des Skalierungsmodells die Interpretation der Trendschätzungen beeinflussen. Die Analysen belegen, dass die Veränderung von Papier-und-Bleistift-Tests auf Computertestung die Trendschätzung für Deutschland verzerrt haben könnte.

Challenges in Estimations of Trends in Large-Scale Assessments: A Calibration of the German PISA Data

Abstract. International large-scale assessments, for instance, the Programme for International Student Assessment (PISA), are conducted to provide information on the effectiveness of educational systems. In PISA, the target population of 15-year-old students is assessed every 3 years. Trends show whether competencies have changed for the target population between PISA cycles. To ensure valid trend information, it is necessary to keep the test conditions and statistical methods in all PISA cycles as constant as possible. In PISA 2015, however, several changes were established; the test model changed from paper pencil to computer tests, scaling methods were changed, and new types of tasks were used in science. In this article, we investigate the effects of these changes on trend estimation in PISA using German data from all PISA cycles (2000 – 2015). Findings suggest that the change from paper pencil to computer tests could have biased the trend estimation.

Literatur

Aitkin, M. & Aitkin, I. (2011). Statistical modeling of the National Assessment of Educational Progress. New York: Springer. First citation in article Crossref, Google Scholar
Artelt, C. & Baumert, J. (2004). Zur Vergleichbarkeit von Schülerleistungen bei Leseaufgaben unterschiedlichen sprachlichen Ursprungs. Zeitschrift für Pädagogische Psychologie, 18, 171 – 185. First citation in article Link, Google Scholar
Artelt, C., Weinert, S. & Carstensen, C. H. (2013). Assessing competencies across the lifespan within the German National Educational Panel Study (NEPS) – Editorial. Journal for Educational Research Online, 5 (2), 5 – 14. First citation in article Google Scholar
Baumert, J., Klieme, E., Neubrand, M., Prenzel, M., Schiefele, U. & Schneider, W., et al. (Hrsg.). (2001). PISA 2000: Basiskompetenzen von Schülerinnen und Schülern im internationalen Vergleich. Opladen: Leske + Budrich. First citation in article Crossref, Google Scholar
Beaton, A. E. (1988). Disentangling the NAEP 1985 – 86 reading anomaly. Technical report, ETS. Princeton, New Jersey, USA. First citation in article Google Scholar
Bennett, R. E., Braswell, J., Oranje, A., Sandene, B., Kaplan, B. & Yan, F. (2008). Does it matter if I take my mathematics test on computer? A second empirical study of mode effects in NAEP. The Journal of Technology, Learning, and Assessment, 6 (9). Verfügbar unter http://www.jtla.org First citation in article Google Scholar
Birnbaum, A. (1968). Some latent trait models. In F. M. LordM. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397 – 479). Reading, MA: Addison-Wesley. First citation in article Google Scholar
Bock, R. D. & Moustaki, I. (2007). Item response theory in a general framework. In C. R. RaoS. SinharayEds., Handbook of Statistics, Volume 26: Psychometrics (pp. 469 – 513). North Holland: Elsevier. First citation in article Google Scholar
Bos, W., Hornberg, S., Arnold, K.-H., Faust, G., Fried, L. & Lankes, E.–M., et al. (Hrsg.). (2007). IGLU 2006. Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich. Münster: Waxmann. First citation in article Google Scholar
Brennan, R. L. (2001). Some problems, pitfalls, and paradoxes in educational measurement. Educational Measurement: Issues and Practice, 20 (4), 6 – 18. First citation in article Crossref, Google Scholar
Cameron, A. C. & Trivedi, P. K. (2005). Microeconometrics. New York: Cambridge University Press. First citation in article Crossref, Google Scholar
Camilli, G. & Penfield, D. A. (1997). Variance estimation for differential test functioning based on Mantel-Haenszel statistics. Journal of Educational Measurement, 34, 123 – 129. First citation in article Crossref, Google Scholar
Carstensen, C. H., Prenzel, M. & Baumert, J. (2008). Trendanalysen in PISA: Wie haben sich die Kompetenzen in Deutschland zwischen PISA 2000 und PISA 2006 entwickelt? Zeitschrift für Erziehungswissenschaften, 10, 11 – 34. First citation in article Google Scholar
Davier, M. von & Sinharay, S. (2014). Analytics in international large-scale assessments: Item response theory and population models. In L. RutkowskiM. von DavierD. RutkowskiEds., Handbook of international large-scale assessment (pp. 155 – 174). Boca Raton: CRC Press. First citation in article Google Scholar
Ehmke, T., Klieme, E. & Stanat, P. (2013). Veränderungen der Lesekompetenz von PISA 2000 nach PISA 2009. Die Rolle von Unterschieden in den Bildungswegen und in der Zusammensetzung der Schülerschaft. Zeitschrift für Pädagogik, 59 [Beiheft], 132 – 150. First citation in article Google Scholar
Fox, J.-P. & Verhagen, A. J. (2010). Random item effects modeling for cross-national survey data. In E. DavidovP. SchmidtJ. BillietEds., Cross-cultural analysis: Methods and applications (pp. 461 – 482). London: Routeledge Academic. First citation in article Google Scholar
Gebhardt, E. & Adams, R. J. (2007). The influence of equating methodology on reported trends in PISA. Journal of Applied Measurement, 8, 305 – 322. First citation in article Google Scholar
Goldhammer, F., Naumann, J., Rölke, H., Stelter, A. & Tóth, K.(in press). Relating product data to process data from computer-based competence assessment. In D. LeutnerJ. FleischerJ. GrünkornE. KliemeEds., Competence assessment in education: Research, models and instruments. Heidelberg: Springer. First citation in article Google Scholar
Haberman, S. J. (2009). Linking parameter estimates derived from an item response model through separate calibrations. ETS Research Report ETS RR-09-40. Princeton, ETS. First citation in article Google Scholar
Heine, J.-H., Mang, J., Borchert, L., Gomolka, J., Kröhne, U., Goldhammer, F. & Sälzer, C. (2016). Kompetenzmessung in PISA 2015. In K. ReissC. SälzerA. Schiepe-TiskaE. KliemeO. KöllerHrsg.. PISA 2015: Eine Studie in Kontinuität und Wandel (S. 383 – 430). Münster: Waxmann. First citation in article Google Scholar
Husek, T. R. & Sirotnik, K. (1967). Item sampling in educational research (CSEIP Occasional Report No. 2). Los Angeles: University of California. First citation in article Google Scholar
Educational Testing Service (ETS) (2015). PISA 2015 Field trial analysis report: Outcomes of the cognitive assessment [interner Bericht]. Princeton, NJ. First citation in article Google Scholar
Kiefer, T., Robitzsch, A. & Wu, M. (2016). TAM: Test analysis modules. R package version 1.995-0. Verfügbar unter http://CRAN.R-project.org/package=TAM First citation in article Google Scholar
Kingston, N. M. (2009). Comparability of computer- and paper-administered multiple-choice tests for K-12 populations: A synthesis. Applied Measurement in Education, 22, 22 – 37. First citation in article Crossref, Google Scholar
Köller, O., Knigge, M. & Tesch, B. (Hrsg.). (2010). Sprachliche Kompetenzen im Ländervergleich. Münster: Waxmann. First citation in article Google Scholar
Klieme, E. & Beck, B. (2007). Sprachliche Kompetenzen. Konzepte und Messung. DESI-Studie (Deutsch Englisch Schülerleistungen International). Weinheim: Beltz First citation in article Google Scholar
Klieme, E., Jude, N., Baumert, J. & Prenzel, M. (2010). PISA 2000 – 2009: Bilanz der Veränderungen im Schulsystem. In E. KliemeC. ArteltJ. HartigN. JudeO. KöllerM. PrenzelW. SchneiderP. StanatHrsg., PISA 2009. Bilanz nach einem Jahrzehnt (S. 277 – 300). Münster: Waxmann. First citation in article Google Scholar
Kolen, M. J. & Brennan, R. L. (2014). Test equating, scaling, and linking. New York: Springer. First citation in article Crossref, Google Scholar
Kröhne, U. & Martens, T. (2011). Computer-based competence tests in the national educational panel study: The challenge of mode effects. Zeitschrift für Erziehungswissenschaft, 14, 169 – 186. First citation in article Crossref, Google Scholar
Linden, W. J. van der (2005). Linear models for optimal test design. New York: Springer. First citation in article Crossref, Google Scholar
Macaskill, G. (2008, September). Alternative scaling models and dependencies. TAG(0809)6a Report. Paper presented at the 2008 TAG Meeting in Sydney, Australia. Retrieved October 28, 2016, from https://www.acer.edu.au/files/macaskill_alternativescalingmodelsdependenciespisa(1).pdf First citation in article Google Scholar
Mazzeo, J. & Davier, M. von (2008). Review of the Programme for International Student Assessment (PISA) test design: Recommendations for fostering stability in assessment results. Education Working Papers EDU/PISA/GB (2008), 28, 23 – 24. First citation in article Google Scholar
Monseur, C. & Berezner, A. (2007). The computation of equating errors in international surveys in education. Journal of Applied Measurement, 8, 323 – 335. First citation in article Google Scholar
Muthén, B. & Asparouhov, T. (2014). IRT studies of many groups: The alignment method. Frontiers in Psychology | Quantitative Psychology and Measurement, 5, 978. First citation in article Google Scholar
National Center for Education Statistics (NCES) (2013). The nation’s report card: NAEP 2012. Trends in academic progress. Washington, D.C: Institute of Education Science, U.S. Department of Education. First citation in article Google Scholar
Organisation for Economic Co-operation and Development (OECD) (2009). PISA 2006 technical report. Paris: OECD Publishing. First citation in article Google Scholar
Organisation for Economic Co-operation and Development (OECD) (2013a). Technical report of the survey of adult skills (PIAAC). Paris: OECD Publishing. First citation in article Google Scholar
Organisation for Economic Co-operation and Development (OECD) (2013b). The survey of adult skills: Reader’s companion. Paris: OECD Publishing. First citation in article Google Scholar
Organisation for Economic Co-operation and Development (OECD) (2016). PISA 2015 assessment and analytical framework. Science, reading, mathematic and financial literacy. Paris: OECD Publishing. First citation in article Google Scholar
Oliveri, M. E. & Davier, M. von (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53, 315 – 333. First citation in article Google Scholar
Oliveri, M. E. & Davier, M. von (2014). Toward increasing fairness in score scale calibrations employed in international large-scale assessments. International Journal of Testing, 14, 1 – 21. First citation in article Crossref, Google Scholar
Parshall, C. G., Harmes, J. C., Davey T., & Pashley, P. J. (2010). Innovative item types for computerized testing. In W. J. van der LindenC. A. W. GlasEds., Elements of adaptive testing (pp. 215 – 230). New York, NY: Springer. First citation in article Google Scholar
Pohl, S. & Carstensen, C. H. (2012): NEPS technical report – Scaling the data of the competence tests (NEPS Working Paper No. 14). Bamberg: Otto-Friedrich-Universität, Nationales Bildungspanel. First citation in article Google Scholar
Prenzel, M., Sälzer, C., Klieme, E. & Köller, O. (Hrsg.). (2013). PISA 2012: Fortschritte und Herausforderungen in Deutschland. Münster: Waxmann. First citation in article Google Scholar
R Core Team (2016). R: A language and environment for statistical computing. Verfügbar unter https://www.R-project.org/ First citation in article Google Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. First citation in article Google Scholar
Rammstedt, B. (Hrsg.). (2013). Grundlegende Kompetenzen Erwachsener im internationalen Vergleich. Ergebnisse von PIAAC 2012. Münster: Waxmann. First citation in article Google Scholar
Reiss, K., Sälzer, C., Schiepe-Tiska, A. & Köller, O. (Hrsg.). (2016). PISA 2015: Eine Studie in Kontinuität und Wandel. Münster: Waxmann. First citation in article Google Scholar
Robitzsch, A. (2016). Essays zu methodischen Herausforderungen im Large-Scale Assessment. Dissertation, Humboldt-Universität zu Berlin. Zugriff am 28. 10. 2016 unter http://edoc.hu-berlin.de/dissertationen/robitzsch-alexander-2015-10-27/PDF/robitzsch.pdf First citation in article Google Scholar
Robitzsch, A. (2016). sirt: Supplementary item response theory models. R package version 1.12-2. Verfügbar unter http://CRAN.R-project.org/package=sirt First citation in article Google Scholar
Rost, J. (2004). Lehrbuch Testtheorie – Testkonstruktion. Bern: Huber. First citation in article Google Scholar
Sachse, K. A., Roppelt, A. & Haag, N. (2016). A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. Journal of Educational Measurement, 53, 152 – 171. First citation in article Crossref, Google Scholar
Wang, S., Jiao, H., Young, M. J., Brooks, T. & Olson, J. (2008). Comparability of computer-based and paper-and-pencil testing in K–12 reading assessments: A meta-analysis of testing mode effects. Educational and Psychological Measurement, 68, 5 – 24. First citation in article Crossref, Google Scholar
Wendt, H., Bos, W., Selter, C., Köller, O., Schwippert, K. & Kasper, D. (Hrsg.). (2016). TIMSS 2015: Mathematische und naturwissenschaftliche Kompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich. Münster: Waxmann. First citation in article Google Scholar
Wu, M. (2010). Measurement, sampling, and equating errors in large-scale assessments. Educational Measurement: Issues and Practice, 29 (4), 15 – 27. First citation in article Crossref, Google Scholar
Xu, X. & von Davier, M. (2010). Linking errors in trend estimation in large-scale surveys: A case study (ETS Research Report RR10-10). Princeton: ETS. First citation in article Google Scholar
Yamamoto, K. (2012). Outgrowing the mode effect study of paper and computer based testing. Zugriff am 21. Oktober 2016. Verfügbar unter http://www.umdcipe.org/conferences/EducationEvaluationItaly/COMPLETE_PAPERS/Yamamoto/YAMAMOTO.pdf First citation in article Google Scholar

Volume 63Issue 2April 2017

ISSN: 0012-1924eISSN: 2190-622X

Licenses & Copyright

Veröffentlicht als Hogrefe OpenMind-Artikel unter der Lizenz CC BY 4.0 (http://creativecommons.org/licenses/by/4.0)

Keywords

PDF download

Verify Phone

Congrats!

Herausforderungen bei der Schätzung von Trends in Schulleistungsstudien

Eine Skalierung der deutschen PISA-Daten

Abstract

Literatur

Licenses & Copyright

Support & Contact

Support & Contact

Legal information

Legal information

More offers

More offers

Our partners

Our partners

Change Password

Your password must have 8 characters or more and contain 3 of the following:

Password Changed Successfully

Create a new account

Request Username

Verify Phone

Congrats!

Herausforderungen bei der Schätzung von Trends in Schulleistungsstudien

Eine Skalierung der deutschen PISA-Daten

Abstract

Literatur

Licenses & Copyright

Support & Contact

Support & Contact

Legal information

Legal information

More offers

More offers

Our partners

Our partners