Skip to main content
Multistudy Report

Reliability and Validity of PIRLS and TIMSS

Does the Response Format Matter?

Published Online:https://doi.org/10.1027/1015-5759/a000338

Abstract. Academic achievements are often assessed in written exams and tests using selection-type (e.g., multiple-choice, MC) and supply-type (e.g., constructed-response, CR) item response formats. The present article examines how MC items and CR items differ with regard to reliability and criterion validity in two educational large-scale assessments with 4th-graders. The reading items of PIRLS 2006 were compiled into MC scales, CR scales, and mixed scales. Scale reliabilities were estimated according to item response theory (international PIRLS sample; n = 119,413). MC showed smaller standard errors than CR around the reading proficiency mean, whereas CR was more reliable for low and high proficiency levels. In the German sample (n = 7,581), there was no format-specific differential validity (criterion: German grades, r ≈ .5; Δr = 0.01). The mathematics items of TIMSS 2007 (n = 160,922) showed similar reliability patterns. MC validity was slightly larger than CR validity (criterion: mathematics grades; n = 5,111; r ≈ .5, Δr = −0.02). Effects of format-specific test extensions were very small in both studies. It seems that in PIRLS and TIMSS, reliability and validity do not depend substantially on response formats. Consequently, other response format characteristics (like the cost of development, administration, and scoring) should be considered when choosing between MC and CR.

References

  • American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (1999). Standards for educational and psychological testing (4th ed.). Washington, DC: AERA. First citation in articleGoogle Scholar

  • Arnold, K.-H., Bos, W., Richert, P. & Stubbe, T. C. (2007). Schullaufbahnpräferenzen am Ende der vierten Klassenstufe [School career preferences at the end of fourth grade]. In W. BosS. HornbergK.-H. ArnoldG. FaustL. FriedE.-M. LankesR. ValtinEds., IGLU 2006. Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (pp. 271–297). Münster, Germany: Waxmann. First citation in articleGoogle Scholar

  • Bennett, R. E. (1993). On the meanings of constructed response. In R. E. BennettW. C. WardEds., Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 1–27). Hillsdale, NJ: Erlbaum. First citation in articleGoogle Scholar

  • Bennett, R. E., Rock, D. A. & Wang, M. (1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28, 77–92. doi: 10.1111/j.1745-3984.1991.tb00345.x First citation in articleCrossrefGoogle Scholar

  • Bridgeman, B. & Lewis, C. (1994). The relationship of essay and multiple-choice scores with grades in college courses. Journal of Educational Measurement, 31, 37–50. doi: 10.1111/j.1745-3984.1994.tb00433.x First citation in articleCrossrefGoogle Scholar

  • Carmines, E. G. & Zeller, R. A. (1979). Reliability and validity assessment. Thousand Oaks, CA: Sage. doi: 10.4135/9781412985642 First citation in articleCrossrefGoogle Scholar

  • Chan, N. & Kennedy, P. E. (2002). Are multiple-choice exams easier for economics students? A comparison of multiple-choice and “equivalent” constructed-response exam questions. Southern Economic Journal, 68, 957–971. First citation in articleCrossrefGoogle Scholar

  • Chang, Y. & Wang, J. (2010, July). Examining testlet effects on the PIRLS 2006 assessment. Paper presented at the 4th IEA International Research Conference, Gothenburg, Sweden. First citation in articleGoogle Scholar

  • de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press. First citation in articleGoogle Scholar

  • Duckworth, A. L., Quinn, P. D., Lynam, D. R., Loeber, R. & Stouthamer-Loeber, M. (2011). Role of test motivation in intelligence testing. PNAS, 108, 7716–7720. doi: 10.1073/pnas.1018601108 First citation in articleCrossrefGoogle Scholar

  • Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. First citation in articleGoogle Scholar

  • Foy, P., Galia, J. & Li, I. (2007). Scaling the PIRLS 2006 reading assessment data. In M. O. MartinI. V. MullisA. M. KennedyEds., PIRLS 2006 Technical Report (pp. 149–172). Boston, MA: IEA. First citation in articleGoogle Scholar

  • Foy, P., Galia, J. & Li, I. (2009). Scaling the data from the TIMSS 2007 mathematics and science assessments. In J. F. OlsonM. O. MartinI. V. MullisEds., TIMSS 2007 Technical Report (revised edition, pp. 225–280). Boston, MA: TIMSS & PIRLS International Study Center. First citation in articleGoogle Scholar

  • Gültekin, S. & Demirtaşlı, N. Ç. (2012). Comparing the test information obtained through multiple-choice, open-ended and mixed item tests based on item response theory. Elementary Education Online, 11, 251–263. First citation in articleGoogle Scholar

  • Haladyna, T. M. & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge. First citation in articleCrossrefGoogle Scholar

  • Hartig, J. & Höhler, J. (2009). Multidimensional IRT models for the assessment of competencies. Studies in Educational Evaluation, 35, 57–63. doi: 10.1016/j.stueduc.2009.10.002 First citation in articleCrossrefGoogle Scholar

  • Hastedt, D. & Sibberns, H. (2005). Differences between multiple choice items and constructed response items in the IEA TIMSS surveys. Studies in Educational Evaluation, 31, 145–161. First citation in articleCrossrefGoogle Scholar

  • Hill, C. D. & Langer, M. (2005). PlotIRT: A collection of R functions to plot curves associated with item response theory. R functions version 1.03, Retrieved from http://www.unc.edu/~dthissen/dl.html First citation in articleGoogle Scholar

  • Hohensinn, C. & Kubinger, K. D. (2011). Applying item response theory methods to examine the impact of different response formats. Educational and Psychological Measurement, 71, 732–746. doi: 10.1177/0013164410390032 First citation in articleCrossrefGoogle Scholar

  • Kastner, M. & Stangl, B. (2011). Multiple choice and constructed response tests: Do test format and scoring matter? Procedia-Social and Behavioral Sciences, 12, 263–273. doi: 10.1016/j.sbspro.2011.02.035 First citation in articleCrossrefGoogle Scholar

  • Krathwohl, D. R. (2002). A revision of Bloom’s taxonomy: An overview. Theory Into Practice, 41, 212–218. doi: 10.1207/s15430421tip4104_2 First citation in articleCrossrefGoogle Scholar

  • Lee, H.-S., Liu, O. L. & Linn, M. C. (2011). Validating measurement of knowledge integration in science using multiple-choice and explanation items. Applied Measurement in Education, 24, 115–136. doi: 10.1080/08957347.2011.554604 First citation in articleCrossrefGoogle Scholar

  • Lissitz, R. W., Hou, X. & Slater, S. C. (2012). The contribution of constructed response items to large scale assessment: Measuring and understanding their impact. Journal of Applied Testing Technology, 13. Retrieved from http://www.jattjournal.com/index.php/atp/article/view/48366 First citation in articleGoogle Scholar

  • Lukhele, R., Thissen, D. & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250. doi: 10.1111/j.1745-3984.1994.tb00445.x First citation in articleCrossrefGoogle Scholar

  • Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34, 207–218. doi: 10.1207/s15326985ep3404_2 First citation in articleCrossrefGoogle Scholar

  • Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. The American Psychologist, 50, 741–749. doi: 10.1037/0003-066X.50.9.741 First citation in articleCrossrefGoogle Scholar

  • Mullis, I. V. S., Martin, M. O. & Foy, P. (2008). TIMSS 2007 international mathematics report: Findings from IEA’s Trends in International Mathematics and Science Study at the fourth and eighth grades. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. First citation in articleGoogle Scholar

  • Mullis, I. V. S., Martin, M. O., Kennedy, A. M. & Foy, P. (2007). IEA’s Progress in International Reading Literacy Study in primary school in 40 countries. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. First citation in articleGoogle Scholar

  • Phan, H. T. (2008). Correlates of mathematics achievement in developed and developing countries: An HLM analysis of TIMSS 2003 eighth-grade mathematics scores (Doctoral dissertation). Retrieved from http://scholarcommons.usf.edu/etd/452 First citation in articleGoogle Scholar

  • Phan, H., Sentovich, C., Kromrey, J., Dedrick, R. & Ferron, J. (2010, April). Correlates of mathematics achievement in developed and developing countries: An HLM analysis of TIMSS 2003 eighth-grade mathematics scores. Paper presented at the annual meeting of the American Educational Research Association, Denver, CO. First citation in articleGoogle Scholar

  • Rauch, D. & Hartig, J. (2010). Multiple-choice versus open-ended response formats of reading test items: A two-dimensional IRT analysis. Psychological Test and Assessment Modeling, 52, 354–379. First citation in articleGoogle Scholar

  • Schneider, R. & Sparfeldt, J.R. (2016). Zur (Un-)Genauigkeit selbstberichteter Zensuren bei Grundschulkindern [The accuracy of self-reported grades in elementary school]. Psychologie in Erziehung und Unterricht, 63, 48–59. doi: 10.2378/peu2016.art05d First citation in articleCrossrefGoogle Scholar

  • Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184. doi: 10.1111/j.1745-3984.2003.tb01102.x First citation in articleCrossrefGoogle Scholar

  • Scouller, K. (1998). The influence of assessment method on students’ learning approaches: Multiple choice question examination versus assignment essay. Higher Education, 35, 453–472. First citation in articleCrossrefGoogle Scholar

  • Sparfeldt, J. R., Kimmel, R., Löwenkamp, L., Steingräber, A. & Rost, D. H. (2012). Not read, but nevertheless solved? Three experiments on PIRLS multiple choice reading comprehension test items. Educational Assessment, 17, 214–232. doi: 10.1080/10627197.2012.735921 First citation in articleCrossrefGoogle Scholar

  • Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. ThissenH. WainerEds., Test scoring (pp. 73–140). Mahwah, NJ: Erlbaum. First citation in articleGoogle Scholar

  • Thissen, D., Wainer, H. & Wang, X.-B. (1994). Are tests comprising both multiple‐choice and free‐response items necessarily less unidimensional than multiple‐choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113–123. doi: 10.1111/j.1745-3984.1994.tb00437.x First citation in articleCrossrefGoogle Scholar

  • Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In R. E. BennettW. C. WardEds., Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 29–44). Hillsdale, NJ: Erlbaum. First citation in articleGoogle Scholar

  • Veeravagu, J., Muthusamy, C., Marimuthu, R. & Michael, A. S. (2010). Using Bloom’s taxonomy to gauge students’ reading comprehension performance. Canadian Social Science, 6, 205–212. First citation in articleGoogle Scholar

  • Wainer, H. & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118. doi: 10.1207/s15324818ame0602_1 First citation in articleCrossrefGoogle Scholar

  • Wan, L. & Henly, G. A. (2012). Measurement properties of two innovative item formats in a computer-based test. Applied Measurement in Education, 25, 58–78. doi: 10.1080/08957347.2012.635507 First citation in articleCrossrefGoogle Scholar

  • Waugh, C. K. & Gronlund, N. E. (2013). Assessment of student achievement (10th ed.). Boston, MA: Pearson. First citation in articleGoogle Scholar