Multistudy Report

Reliability and Validity of PIRLS and TIMSS

Does the Response Format Matter?

Johannes Schult

Saarland University, Department of Educational Science, Saarbrücken, Germany

Search for more papers by this author

and

Jörn R. Sparfeldt

Saarland University, Department of Educational Science, Saarbrücken, Germany

Search for more papers by this author

Published Online:August 03, 2016https://doi.org/10.1027/1015-5759/a000338

Abstract

Abstract. Academic achievements are often assessed in written exams and tests using selection-type (e.g., multiple-choice, MC) and supply-type (e.g., constructed-response, CR) item response formats. The present article examines how MC items and CR items differ with regard to reliability and criterion validity in two educational large-scale assessments with 4th-graders. The reading items of PIRLS 2006 were compiled into MC scales, CR scales, and mixed scales. Scale reliabilities were estimated according to item response theory (international PIRLS sample; n = 119,413). MC showed smaller standard errors than CR around the reading proficiency mean, whereas CR was more reliable for low and high proficiency levels. In the German sample (n = 7,581), there was no format-specific differential validity (criterion: German grades, r ≈ .5; Δr = 0.01). The mathematics items of TIMSS 2007 (n = 160,922) showed similar reliability patterns. MC validity was slightly larger than CR validity (criterion: mathematics grades; n = 5,111; r ≈ .5, Δr = −0.02). Effects of format-specific test extensions were very small in both studies. It seems that in PIRLS and TIMSS, reliability and validity do not depend substantially on response formats. Consequently, other response format characteristics (like the cost of development, administration, and scoring) should be considered when choosing between MC and CR.

References

American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (1999). Standards for educational and psychological testing (4th ed.). Washington, DC: AERA. First citation in article Google Scholar
Arnold, K.-H., Bos, W., Richert, P. & Stubbe, T. C. (2007). Schullaufbahnpräferenzen am Ende der vierten Klassenstufe [School career preferences at the end of fourth grade]. In W. BosS. HornbergK.-H. ArnoldG. FaustL. FriedE.-M. LankesR. ValtinEds., IGLU 2006. Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (pp. 271–297). Münster, Germany: Waxmann. First citation in article Google Scholar
Bennett, R. E. (1993). On the meanings of constructed response. In R. E. BennettW. C. WardEds., Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 1–27). Hillsdale, NJ: Erlbaum. First citation in article Google Scholar
Bennett, R. E., Rock, D. A. & Wang, M. (1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28, 77–92. doi: 10.1111/j.1745-3984.1991.tb00345.x First citation in article Crossref, Google Scholar
Bridgeman, B. & Lewis, C. (1994). The relationship of essay and multiple-choice scores with grades in college courses. Journal of Educational Measurement, 31, 37–50. doi: 10.1111/j.1745-3984.1994.tb00433.x First citation in article Crossref, Google Scholar
Carmines, E. G. & Zeller, R. A. (1979). Reliability and validity assessment. Thousand Oaks, CA: Sage. doi: 10.4135/9781412985642 First citation in article Crossref, Google Scholar
Chan, N. & Kennedy, P. E. (2002). Are multiple-choice exams easier for economics students? A comparison of multiple-choice and “equivalent” constructed-response exam questions. Southern Economic Journal, 68, 957–971. First citation in article Crossref, Google Scholar
Chang, Y. & Wang, J. (2010, July). Examining testlet effects on the PIRLS 2006 assessment. Paper presented at the 4th IEA International Research Conference, Gothenburg, Sweden. First citation in article Google Scholar
de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press. First citation in article Google Scholar
Duckworth, A. L., Quinn, P. D., Lynam, D. R., Loeber, R. & Stouthamer-Loeber, M. (2011). Role of test motivation in intelligence testing. PNAS, 108, 7716–7720. doi: 10.1073/pnas.1018601108 First citation in article Crossref, Google Scholar
Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. First citation in article Google Scholar
Foy, P., Galia, J. & Li, I. (2007). Scaling the PIRLS 2006 reading assessment data. In M. O. MartinI. V. MullisA. M. KennedyEds., PIRLS 2006 Technical Report (pp. 149–172). Boston, MA: IEA. First citation in article Google Scholar
Foy, P., Galia, J. & Li, I. (2009). Scaling the data from the TIMSS 2007 mathematics and science assessments. In J. F. OlsonM. O. MartinI. V. MullisEds., TIMSS 2007 Technical Report (revised edition, pp. 225–280). Boston, MA: TIMSS & PIRLS International Study Center. First citation in article Google Scholar
Gültekin, S. & Demirtaşlı, N. Ç. (2012). Comparing the test information obtained through multiple-choice, open-ended and mixed item tests based on item response theory. Elementary Education Online, 11, 251–263. First citation in article Google Scholar
Haladyna, T. M. & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge. First citation in article Crossref, Google Scholar
Hartig, J. & Höhler, J. (2009). Multidimensional IRT models for the assessment of competencies. Studies in Educational Evaluation, 35, 57–63. doi: 10.1016/j.stueduc.2009.10.002 First citation in article Crossref, Google Scholar
Hastedt, D. & Sibberns, H. (2005). Differences between multiple choice items and constructed response items in the IEA TIMSS surveys. Studies in Educational Evaluation, 31, 145–161. First citation in article Crossref, Google Scholar
Hill, C. D. & Langer, M. (2005). PlotIRT: A collection of R functions to plot curves associated with item response theory. R functions version 1.03, Retrieved from http://www.unc.edu/~dthissen/dl.html First citation in article Google Scholar
Hohensinn, C. & Kubinger, K. D. (2011). Applying item response theory methods to examine the impact of different response formats. Educational and Psychological Measurement, 71, 732–746. doi: 10.1177/0013164410390032 First citation in article Crossref, Google Scholar
Kastner, M. & Stangl, B. (2011). Multiple choice and constructed response tests: Do test format and scoring matter? Procedia-Social and Behavioral Sciences, 12, 263–273. doi: 10.1016/j.sbspro.2011.02.035 First citation in article Crossref, Google Scholar
Krathwohl, D. R. (2002). A revision of Bloom’s taxonomy: An overview. Theory Into Practice, 41, 212–218. doi: 10.1207/s15430421tip4104_2 First citation in article Crossref, Google Scholar
Lee, H.-S., Liu, O. L. & Linn, M. C. (2011). Validating measurement of knowledge integration in science using multiple-choice and explanation items. Applied Measurement in Education, 24, 115–136. doi: 10.1080/08957347.2011.554604 First citation in article Crossref, Google Scholar
Lissitz, R. W., Hou, X. & Slater, S. C. (2012). The contribution of constructed response items to large scale assessment: Measuring and understanding their impact. Journal of Applied Testing Technology, 13. Retrieved from http://www.jattjournal.com/index.php/atp/article/view/48366 First citation in article Google Scholar
Lukhele, R., Thissen, D. & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250. doi: 10.1111/j.1745-3984.1994.tb00445.x First citation in article Crossref, Google Scholar
Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34, 207–218. doi: 10.1207/s15326985ep3404_2 First citation in article Crossref, Google Scholar
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. The American Psychologist, 50, 741–749. doi: 10.1037/0003-066X.50.9.741 First citation in article Crossref, Google Scholar
Mullis, I. V. S., Martin, M. O. & Foy, P. (2008). TIMSS 2007 international mathematics report: Findings from IEA’s Trends in International Mathematics and Science Study at the fourth and eighth grades. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. First citation in article Google Scholar
Mullis, I. V. S., Martin, M. O., Kennedy, A. M. & Foy, P. (2007). IEA’s Progress in International Reading Literacy Study in primary school in 40 countries. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. First citation in article Google Scholar
Phan, H. T. (2008). Correlates of mathematics achievement in developed and developing countries: An HLM analysis of TIMSS 2003 eighth-grade mathematics scores (Doctoral dissertation). Retrieved from http://scholarcommons.usf.edu/etd/452 First citation in article Google Scholar
Phan, H., Sentovich, C., Kromrey, J., Dedrick, R. & Ferron, J. (2010, April). Correlates of mathematics achievement in developed and developing countries: An HLM analysis of TIMSS 2003 eighth-grade mathematics scores. Paper presented at the annual meeting of the American Educational Research Association, Denver, CO. First citation in article Google Scholar
Rauch, D. & Hartig, J. (2010). Multiple-choice versus open-ended response formats of reading test items: A two-dimensional IRT analysis. Psychological Test and Assessment Modeling, 52, 354–379. First citation in article Google Scholar
Schneider, R. & Sparfeldt, J.R. (2016). Zur (Un-)Genauigkeit selbstberichteter Zensuren bei Grundschulkindern [The accuracy of self-reported grades in elementary school]. Psychologie in Erziehung und Unterricht, 63, 48–59. doi: 10.2378/peu2016.art05d First citation in article Crossref, Google Scholar
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184. doi: 10.1111/j.1745-3984.2003.tb01102.x First citation in article Crossref, Google Scholar
Scouller, K. (1998). The influence of assessment method on students’ learning approaches: Multiple choice question examination versus assignment essay. Higher Education, 35, 453–472. First citation in article Crossref, Google Scholar
Sparfeldt, J. R., Kimmel, R., Löwenkamp, L., Steingräber, A. & Rost, D. H. (2012). Not read, but nevertheless solved? Three experiments on PIRLS multiple choice reading comprehension test items. Educational Assessment, 17, 214–232. doi: 10.1080/10627197.2012.735921 First citation in article Crossref, Google Scholar
Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. ThissenH. WainerEds., Test scoring (pp. 73–140). Mahwah, NJ: Erlbaum. First citation in article Google Scholar
Thissen, D., Wainer, H. & Wang, X.-B. (1994). Are tests comprising both multiple‐choice and free‐response items necessarily less unidimensional than multiple‐choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113–123. doi: 10.1111/j.1745-3984.1994.tb00437.x First citation in article Crossref, Google Scholar
Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In R. E. BennettW. C. WardEds., Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 29–44). Hillsdale, NJ: Erlbaum. First citation in article Google Scholar
Veeravagu, J., Muthusamy, C., Marimuthu, R. & Michael, A. S. (2010). Using Bloom’s taxonomy to gauge students’ reading comprehension performance. Canadian Social Science, 6, 205–212. First citation in article Google Scholar
Wainer, H. & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118. doi: 10.1207/s15324818ame0602_1 First citation in article Crossref, Google Scholar
Wan, L. & Henly, G. A. (2012). Measurement properties of two innovative item formats in a computer-based test. Applied Measurement in Education, 25, 58–78. doi: 10.1080/08957347.2012.635507 First citation in article Crossref, Google Scholar
Waugh, C. K. & Gronlund, N. E. (2013). Assessment of student achievement (10th ed.). Boston, MA: Pearson. First citation in article Google Scholar

Volume 34Issue 4July 2018

ISSN: 1015-5759eISSN: 2151-2426

History

ReceivedOctober 17, 2014
RevisedSeptember 03, 2015
AcceptedSeptember 14, 2015
Published onlineAugust 03, 2016

Licenses & Copyright

Keywords

Acknowledgments:

This research was prepared with the support of the German funds “Bund-Länder-Programm für bessere Studienbedingungen und mehr Qualität in der Lehre (‘Qualitätspakt Lehre’)” [the joint program of the Federal and States Government for better study conditions and the quality of teaching in higher education (“the Teaching Quality Pact”)] at Saarland University (Funding code: 01PL11012). The authors developed the topic and the content of this manuscript independently from this funding. We thank the Institute for School Development Research (IFS) at Technical University Dortmund/the Max Planck Institute for Human Development (MPIB) Berlin/the Standing Conference of the Ministers of Education and Cultural Affairs (KMK) as well as the Research Data Centre (FDZ) at the Institute for Educational Quality Improvement (IQB) for providing the raw data.

PDF download

Verify Phone

Congrats!

Reliability and Validity of PIRLS and TIMSS

Does the Response Format Matter?

Abstract

References

History

Licenses & Copyright

Acknowledgments:

Support & Contact

Support & Contact

Legal information

Legal information

More offers

More offers

Our partners

Our partners

Change Password

Your password must have 8 characters or more and contain 3 of the following:

Password Changed Successfully

Create a new account

Request Username

Verify Phone

Congrats!

Reliability and Validity of PIRLS and TIMSS

Does the Response Format Matter?

Abstract

References

History

Licenses & Copyright

Acknowledgments:

Support & Contact

Support & Contact

Legal information

Legal information

More offers

More offers

Our partners

Our partners