Abstract
Abstract. Academic achievements are often assessed in written exams and tests using selection-type (e.g., multiple-choice, MC) and supply-type (e.g., constructed-response, CR) item response formats. The present article examines how MC items and CR items differ with regard to reliability and criterion validity in two educational large-scale assessments with 4th-graders. The reading items of PIRLS 2006 were compiled into MC scales, CR scales, and mixed scales. Scale reliabilities were estimated according to item response theory (international PIRLS sample; n = 119,413). MC showed smaller standard errors than CR around the reading proficiency mean, whereas CR was more reliable for low and high proficiency levels. In the German sample (n = 7,581), there was no format-specific differential validity (criterion: German grades, r ≈ .5; Δr = 0.01). The mathematics items of TIMSS 2007 (n = 160,922) showed similar reliability patterns. MC validity was slightly larger than CR validity (criterion: mathematics grades; n = 5,111; r ≈ .5, Δr = −0.02). Effects of format-specific test extensions were very small in both studies. It seems that in PIRLS and TIMSS, reliability and validity do not depend substantially on response formats. Consequently, other response format characteristics (like the cost of development, administration, and scoring) should be considered when choosing between MC and CR.
References
1999). Standards for educational and psychological testing (4th ed.). Washington, DC: AERA.
. (2007).
(Schullaufbahnpräferenzen am Ende der vierten Klassenstufe [School career preferences at the end of fourth grade] . In W. BosS. HornbergK.-H. ArnoldG. FaustL. FriedE.-M. LankesR. ValtinEds., IGLU 2006. Lesekompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich (pp. 271–297). Münster, Germany: Waxmann.1993).
(On the meanings of constructed response . In R. E. BennettW. C. WardEds., Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 1–27). Hillsdale, NJ: Erlbaum.1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28, 77–92. doi: 10.1111/j.1745-3984.1991.tb00345.x
(1994). The relationship of essay and multiple-choice scores with grades in college courses. Journal of Educational Measurement, 31, 37–50. doi: 10.1111/j.1745-3984.1994.tb00433.x
(1979). Reliability and validity assessment. Thousand Oaks, CA: Sage. doi: 10.4135/9781412985642
(2002). Are multiple-choice exams easier for economics students? A comparison of multiple-choice and “equivalent” constructed-response exam questions. Southern Economic Journal, 68, 957–971.
(2010, July). Examining testlet effects on the PIRLS 2006 assessment. Paper presented at the 4th IEA International Research Conference, Gothenburg, Sweden.
(2009). The theory and practice of item response theory. New York, NY: Guilford Press.
(2011). Role of test motivation in intelligence testing. PNAS, 108, 7716–7720. doi: 10.1073/pnas.1018601108
(2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.
(2007).
(Scaling the PIRLS 2006 reading assessment data . In M. O. MartinI. V. MullisA. M. KennedyEds., PIRLS 2006 Technical Report (pp. 149–172). Boston, MA: IEA.2009).
(Scaling the data from the TIMSS 2007 mathematics and science assessments . In J. F. OlsonM. O. MartinI. V. MullisEds., TIMSS 2007 Technical Report (revised edition, pp. 225–280). Boston, MA: TIMSS & PIRLS International Study Center.2012). Comparing the test information obtained through multiple-choice, open-ended and mixed item tests based on item response theory. Elementary Education Online, 11, 251–263.
(2013). Developing and validating test items. New York, NY: Routledge.
(2009). Multidimensional IRT models for the assessment of competencies. Studies in Educational Evaluation, 35, 57–63. doi: 10.1016/j.stueduc.2009.10.002
(2005). Differences between multiple choice items and constructed response items in the IEA TIMSS surveys. Studies in Educational Evaluation, 31, 145–161.
(2005). PlotIRT: A collection of R functions to plot curves associated with item response theory. R functions version 1.03, Retrieved from http://www.unc.edu/~dthissen/dl.html
(2011). Applying item response theory methods to examine the impact of different response formats. Educational and Psychological Measurement, 71, 732–746. doi: 10.1177/0013164410390032
(2011). Multiple choice and constructed response tests: Do test format and scoring matter? Procedia-Social and Behavioral Sciences, 12, 263–273. doi: 10.1016/j.sbspro.2011.02.035
(2002). A revision of Bloom’s taxonomy: An overview. Theory Into Practice, 41, 212–218. doi: 10.1207/s15430421tip4104_2
(2011). Validating measurement of knowledge integration in science using multiple-choice and explanation items. Applied Measurement in Education, 24, 115–136. doi: 10.1080/08957347.2011.554604
(2012). The contribution of constructed response items to large scale assessment: Measuring and understanding their impact. Journal of Applied Testing Technology, 13. Retrieved from http://www.jattjournal.com/index.php/atp/article/view/48366
(1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250. doi: 10.1111/j.1745-3984.1994.tb00445.x
(1999). Cognition and the question of test item format. Educational Psychologist, 34, 207–218. doi: 10.1207/s15326985ep3404_2
(1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. The American Psychologist, 50, 741–749. doi: 10.1037/0003-066X.50.9.741
(2008). TIMSS 2007 international mathematics report: Findings from IEA’s Trends in International Mathematics and Science Study at the fourth and eighth grades. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
(2007). IEA’s Progress in International Reading Literacy Study in primary school in 40 countries. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
(2008). Correlates of mathematics achievement in developed and developing countries: An HLM analysis of TIMSS 2003 eighth-grade mathematics scores (Doctoral dissertation). Retrieved from http://scholarcommons.usf.edu/etd/452
(2010, April). Correlates of mathematics achievement in developed and developing countries: An HLM analysis of TIMSS 2003 eighth-grade mathematics scores. Paper presented at the annual meeting of the American Educational Research Association, Denver, CO.
(2010). Multiple-choice versus open-ended response formats of reading test items: A two-dimensional IRT analysis. Psychological Test and Assessment Modeling, 52, 354–379.
(2016). Zur (Un-)Genauigkeit selbstberichteter Zensuren bei Grundschulkindern
([The accuracy of self-reported grades in elementary school] . Psychologie in Erziehung und Unterricht, 63, 48–59. doi: 10.2378/peu2016.art05d2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184. doi: 10.1111/j.1745-3984.2003.tb01102.x
(1998). The influence of assessment method on students’ learning approaches: Multiple choice question examination versus assignment essay. Higher Education, 35, 453–472.
(2012). Not read, but nevertheless solved? Three experiments on PIRLS multiple choice reading comprehension test items. Educational Assessment, 17, 214–232. doi: 10.1080/10627197.2012.735921
(2001).
(Item response theory for items scored in two categories . In D. ThissenH. WainerEds., Test scoring (pp. 73–140). Mahwah, NJ: Erlbaum.1994). Are tests comprising both multiple‐choice and free‐response items necessarily less unidimensional than multiple‐choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113–123. doi: 10.1111/j.1745-3984.1994.tb00437.x
(1993).
(On the equivalence of the traits assessed by multiple-choice and constructed-response tests . In R. E. BennettW. C. WardEds., Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 29–44). Hillsdale, NJ: Erlbaum.2010). Using Bloom’s taxonomy to gauge students’ reading comprehension performance. Canadian Social Science, 6, 205–212.
(1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118. doi: 10.1207/s15324818ame0602_1
(2012). Measurement properties of two innovative item formats in a computer-based test. Applied Measurement in Education, 25, 58–78. doi: 10.1080/08957347.2012.635507
(2013). Assessment of student achievement (10th ed.). Boston, MA: Pearson.
(