Multiple-Choice-Prüfungen an Hochschulen?
Ein Literaturüberblick und Plädoyer für mehr praxisorientierte Forschung
Abstract
Zusammenfassung. Multiple-Choice-Aufgaben (MCA) sind bei der Leistungsmessung großer Personengruppen besonders ökonomisch. Im Zuge des hohen Prüfungsaufkommens im Bachelor-Master-System werden MCA-Klausuren auch an deutschen Hochschulen immer häufiger eingesetzt. Doch welche diagnostische Qualität haben Prüfungen mit MCA und wo liegen Vorteile und Probleme? In diesem Literaturüberblick kommen wir zu vier zentralen Ergebnissen: (1) MCA von hoher Qualität sind in vielen Fällen diagnostisch vergleichbar zu Constructed-Response-Aufgaben. (2) Es existieren effektive Strategien, um Rateeffekten zu begegnen. (3) Der Einfluss des Prüfungsformats auf Lern- und Prüfungsstrategien ist kaum vermeidbar. (4) Besonders geeignet für den Hochschulkontext sind die MC-Formate Multiple-Response und Multiple-True-False sowie insbesondere computerbasierte Testaufgaben. Zusätzlich zeigen wir einen Mangel an Forschungsarbeiten auf, die für belastbare Aussagen über den diagnostischen Wert von MCA in realen Kontexten unerlässlich sind und leiten daraus Forschungsfragen ab.
Abstract. Multiple-choice questions (MCQ) are particularly efficient in measuring achievement in large student groups. Due to the high number of tests in the bachelor-master-system, German universities administer MCQ exams with increasing frequency. So what is the diagnostic quality of exams using MCQ and which assets and drawbacks are associated with MCQ application? In the course of this literature review we draw four essential conclusions: (1) High quality MCQ share similar diagnostic characteristics with constructed-response questions in many cases. (2) There are potent strategies to address (the problem of) guessing in MCQ. (3) Effects of MCQ on learning and testing strategies are hardly avoidable. (4) The multiple-response and multiple-true-false format as well as computer-based MCQ-formats are particularly suitable for university exams. Additionally, we identify a considerable lack of research in this area and propose research desiderata so that the diagnostic value of MCQ in higher education can be reliably evaluated in the future.
Literatur
(1993). Type K and other complex multiple-choice items: An analysis of research and item properties. Educational Measurement: Issues and Practice, 12 (1), 28–33.
(1979). Cluing in multiple-choice test items with combinations of correct responses. Academic Medicine, 54, 948–950.
(1972). Recognition and retrieval process in free recall. Psychological Review, 79, 97–132.
(1989). Does guessing really help? Journal of Educational Measurement, 26, 323–336.
(2007). Distractor similarity and item-stem structure: Effects on item difficulty. Applied Measurement in Education, 20, 153–170.
(2003). Guess where: The position of correct answers in multiple-choice test items as a psychometric variable. Journal of Educational Measurement, 40, 109–128.
(2011). The effects of the number of options on the psychometric characteristics of multiple choice items. Psychological Test and Assessment Modeling, 53, 192–211.
(2005). Scoring and keying multiple choice tests: A case study in irrationality. Mind & Society, 4, 3–12.
(2010). Rechtsfragen bei der Einführung von Multiple-Choice-Prüfungen (Antwort-Wahl-Verfahren). Wissenschaftsrecht, 43, 56–67.
(1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28, 77–92.
(1991). Gender differences in multiple choice tests: The role of differential guessing tendencies. Journal of Educational Measurement, 28, 23–35.
(1997). A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21, 65–88.
(2008). Using multiple-choice tests to evaluate students' understanding of accounting. Accounting Education: An International Journal, 17, 55–68.
(1998). Relationships between learning patterns and attitudes towards two assessment formats. Educational Research, 40, 90–98.
(1956). Taxonomy of educational objectives: Handbook I: Cognitive domain. New York, NY: David McKay.
(2013). Mathematics strategy use in solving test items in varied formats. The Journal of Experimental Education, 81, 409–428.
(2009). Sophisticated tasks in e-assessment: What are they and what are their benefits? Assessment & Evaluation in Higher Education, 34, 305–319.
(2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25, 27–40.
(2006). Diagnostic assessment with ordered multiple-choice items. Educational Assessment, 11, 33–63.
(1995). Determining the optimal number of alternatives to a multiple-choice test item: An information theoretic perspective. Educational and Psychological Measurement, 55, 959–966.
(1993). To guess or not to guess: A decision theoretic view of formula scoring. Journal of Educational Measurement, 30, 277–291.
(2001). Quantifying the effects of chance in multiple choice and true/false tests: question selection and guessing of answers. Assessment & Evaluation in Higher Education, 26, 41–50.
(2006). Sampling knowledge and understanding: How long should a test be? Assessment & Evaluation in Higher Education, 31, 569–582.
(2001). A multiple choice test that rewards partial knowledge. Journal of Further and Higher Education, 25, 157–163.
(2015). Reducing the need for guesswork in multiple-choice tests. Assessment & Evaluation in Higher Education, 40, 218–231.
(2008). Feedback enhances the positive effects and reduces the negative effects of multiple-choice testing. Memory & Cognition, 36, 604–616.
(2002). Constructing written test questions for the basic and clinical sciences (3rd ed.). Philadelphia, PA: National Board of Medical Examiners.
(1994). Comparison of items in five-option and extended-matching formats for assessment of diagnostic skills. Academic Medicine, 69, 1–3.
(2011). Assessment of medical knowledge: The pros and cons of using true/false multiple choice questions. The National Medical Journal of India, 24, 225–228.
(1994). Further investigation of nonfunctioning options in multiple-choice test items. Educational and Psychological Measurement, 54, 861–872.
(2006). The coming of age of research on test-taking strategies. Language Assessment Quarterly, 3, 307–331.
(1942). Studies of acquiescence as a factor in the true-false test. Journal of Educational Psychology, 33, 401– 401.
(1998). Further evidence favoring three-option items in multiple-choice tests. European Journal of Psychological Assessment, 14, 197–201.
(2000). Test stakes and item format interactions. Applied Measurement in Education, 13, 55–77.
(2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121.
(2003). Multiple evaluation: A new testing paradigm that exorcizes guessing. International Journal of Testing, 3, 333–352.
(2008). Assessing test-taking strategies of university students: Developing a scale and estimating its psychometric indices. Assessment & Evaluation in Higher Education, 33, 409–419.
(1986). Using test-taking strategies to maximize multiple-choice test scores. Educational and Psychological Measurement, 46, 619–625.
(2002a). Construct-irrelevant variance and flawed test questions: Do multiple-choice item-writing principles make any difference? Academic Medicine, 77 (10), 103–104.
(2002b). Threats to the validity of locally developed multiple-choice tests in medical education: Construct-irrelevant variance and construct underrepresentation. Advances in Health Sciences Education, 7, 235–241.
(2005). The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Advances in Health Sciences Education, 10, 133–143.
(1997). Test item development: Validity evidence from quality assurance procedures. Applied Measurement in Education, 10, 61–82.
(1991). Quality control in the development and use of performance assessments. Applied Measurement in Education, 4, 289–303.
(2014). Lokale Abhängigkeit von Items im TestDaF-Leseverstehen. Diagnostica, 61, 93–106.
(2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.
(1984). Protocol analysis. Cambridge, MA: MIT press.
(2010). Optimal correction for guessing in multiple-choice tests. Journal of Mathematical Psychology, 54, 415–425.
(1993). Evaluation of medical students in radiology. Written testing using uncued multiple-choice questions. Investigative Radiology, 28, 964–968.
(1997). The virtues of extended matching and uncued tests as alternatives to multiple choice questions. Human Pathology, 28, 526–532.
(1988). Formula scoring of multiple-choice tests (correction for guessing). Educational Measurement: Issues and Practice, 7 (2), 33–38.
(1989). Partial-credit scoring methods for multiple-choice tests. Applied Measurement in Education, 2, 79–96.
(1975). The number of alternatives for optimum test reliability. Journal of Educational Measurement, 12, 109–113.
(1985). Validity and reliability of true-false tests. Educational and Psychological Measurement, 45, 1–13.
(1992). Context-dependent item sets. Educational Measurement: Issues and Practice, 11 (1), 21–25.
(1997). Writing test items to evaluate higher order thinking. Boston, MA: Allyn and Bacon.
(2004). Developing and validating multiple-choice test items. New York, NY: Routledge.
(1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2, 51–78.
(1993). How many options is enough for a multiple choice test item? Educational and Psychological Measurement, 53, 999–1010.
(2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23 (1), 17–27.
(2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309–333.
(2013). Developing and validating test items. New York, NY: Routledge.
(1994). Cognitive complexity and the comparability of multiple-choice and constructed-response test formats. The Journal of Experimental Education, 62, 143–157.
(1993). Reliability of comparably written two-option multiple-choice and true-false test items. Educational and Psychological Measurement, 53, 651–660.
(1978). Reliability and validity of multiple-choice tests developed by four distractor selection procedures. The Journal of Educational Research, 71, 203–206.
(2007). Beyond reliability and validity: The role of metacognition in psychological testing. In R. A. Degregorio (Ed.), New developments in psychological testing (pp. 139–162). Hauppauge, NY: Nova Science.
(2011). Applying item response theory methods to examine the impact of different response formats. Educational and Psychological Measurement, 71, 732–746.
(2007). An investigation of item type in a standards-based assessment. Practical Assessment, Research & Evaluation, 12 (18). Retrieved from http://pareonline.net/genpare.asp?v=12&n=18
(2002). The quality of in-house medical school examinations. Academic Medicine, 77, 156–161.
(2011). Multiple choice and constructed response tests: Do test format and scoring matter? Procedia – Social and Behavioral Sciences, 12, 263–273.
(2012). An analysis of the discrete-option multiple-choice item type. Psychological Test and Assessment Modeling, 54, 3–19.
(1970). Models for free recall and recognition. In D. A. Norman (Ed.), Models of human memory (pp. 331–373). New York, NY: Academic Press.
(2014). Gutachten zur Erstellung «gerichtsfester» Multiple-Choice-Prüfungsaufgaben. Psychologische Rundschau, 65, 169–178.
(2004). Validity of high-stakes assessment: Are students engaged in complex thinking? Educational Measurement: Issues and Practice, 23 (3), 6–14.
(2011). Validating measurement of knowledge integration in science using multiple-choice and explanation items. Applied Measurement in Education, 24, 115–136.
(2004). Avoiding misconception, misuse, and missed opportunities: The collection of verbal reports in educational achievement testing. Educational Measurement: Issues and Practice, 23 (4), 6–15.
(2013). Scoring methods for multiple choice assessment in higher education – Is it still a matter of number right scoring or negative marking? Studies in Educational Evaluation, 39, 188–193.
(1983). The relation between incorrect option choice and estimated ability. Educational and Psychological Measurement, 43, 675–685.
(2011). An investigation of explanation multiple-choice items in science assessment. Educational Assessment, 16, 164–184.
(1975). Formula scoring and number right scoring. Journal of Educational Measurement, 12, 7–11.
(1977). The optimal number of choices per item: a comparison of four approaches. Journal of Educational Measurement, 14, 33–38.
(1999). Cognition and the question of test item format. Educational Psychologist, 34, 207–218.
(2009). Evaluation of five guidelines for option development in multiple-choice item-writing. Psicothema, 21, 326–330.
(2012). This is only a test: A machine-graded improvement to the multiple-choice and true-false examination. Teaching in Higher Education, 17, 193–207.
(2004). Improving the fairness of multiple-choice questions: A literature review. Medical Teacher, 26, 709–712.
(1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–103). New York, NY: Macmillan.
(1965). An analysis of test-wiseness. Educational and Psychological Measurement, 25, 707–726.
(2006). A computer-aided environment for generating multiple-choice test items. Natural Language Engineering, 12 (2), 177–194.
(2008). Die verflixten Distraktoren. Diagnostica, 54, 193–201.
(2011). The answer-until-correct item format revisited. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 7, 103–110.
(1989). New directions in educational assessment. Educational Researcher, 18, 3–7.
(1974). Comparative reliabilities and difficulties of the multiple-choice and true-false formats. The Journal of Experimental Education, 42, 62–64.
(1970). Comparison of multiple-choice tests using different types of distractor selection techniques. Journal of Educational Measurement, 7, 87–90.
(2005). It takes only 100 true-false items to test medical students: true or false? Medical Teacher, 27, 468–470.
(1999a). The role of instructions in the variability of sex-related differences in multiple-choice tests. Personality and Individual Differences, 27, 1067–1077.
(1999b). The effect of instructions on multiple-choice test scores. European Journal of Psychological Assessment, 15, 143–143.
(2010). Multiple-choice versus open-ended response formats of reading test items: A two-dimensional IRT analysis. Psychological Test and Assessment Modeling, 52, 354–379.
(1998). Manual for Raven's Progressive Matrices and Vocabulary Scales. Section 3, The Standard Progressive Matrices. Oxford: Oxford Psychologists Press.
(2013). Imperfect models, imperfect conclusions: An exploratory study of multiple-choice tests and historical knowledge. The Journal of Social Studies Research, 37, 3–16.
(1997, April). The art & science of item writing: A meta-analysis of multiple-choice item format effects. Paper presented at the Annual Meeting of the American Education Research Association, Chicago, IL.
(2002). Choosing an item format. In G. Tindal & T. M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (pp. 211–229). Mahwah, NJ: Lawrence Erlbaum Associates.
(2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184.
(2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24 (2), 3–13.
(2005). The positive and negative consequences of multiple-choice testing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 1155–1159.
(2007). Leseverständnis ohne Lesen? Zeitschrift für Pädagogische Psychologie, 21, 305–314.
(2004). Lehrbuch Testtheorie-Testkonstruktion (2., vollst. überarb. u. erw. Aufl.). Bern: Hans Huber.
(1979). An examination of test-wiseness in the cognitive test domain. Review of Educational Research, 49, 252–279.
(1996). Computerized long-menu questions as an alternative to open-ended questions in computerized assessment. Medical Education, 30, 50–55.
(1998). The influence of assessment method on students' learning approaches: Multiple choice question examination versus assignment essay. Higher Education, 35, 453–472.
(1994). Students' experiences in studying for multiple choice question examinations. Studies in Higher Education, 19, 267–279.
(2000). The role of assessment in a learning culture. Educational Researcher, 29, 4–14.
(2002). Automated essay scoring: A cross-disciplinary perspective. New York, NY: Routledge.
(Eds.).(2006). A comparison of three-and four-option english tests for university entrance selection purposes in Japan. Language Testing, 23, 35–57.
(2005). Multiple-choice tests and student understanding: What is the connection? Decision Sciences Journal of Innovative Education, 3, 73–98.
(2012). Not read, but nevertheless solved? Three experiments on PIRLS multiple choice reading comprehension test items. Educational Assessment, 17, 214–232.
(2005). Students' perceptions about evaluation and assessment in higher education: A review. Assessment & Evaluation in Higher Education, 30, 331–347.
(2008). Measurement characteristics of content-parallel single-best-answer and extended-matching questions in relation to number and source of options. Academic Medicine, 83, 21–24.
(2006). Psychometric characteristics and response times for content-parallel extended-matching and one-best-answer items in relation to number of options. Academic Medicine, 81, 52–55.
(2008). Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Medical Education, 42, 198–206.
(2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education in Practice, 6, 354–363.
(2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: A descriptive analysis. BMC Medical Education, 9, 40–40.
(1977). On the equivalence of constructed-response and multiple-choice tests. Applied Psychological Measurement, 1, 355–369.
(1991). The effects of the number of options per item and student ability on test validity and reliability. Educational and Psychological Measurement, 51, 829–837.
(1994). Estimating the optimal number of options per item using an incremental option paradigm. Educational and Psychological Measurement, 54, 86–91.
(1985). Are complex multiple-choice options more difficult and discriminating than conventional multiple-choice options? The Journal of Nursing Education, 24 (3), 92–98.
(2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6, 181–195.
(1964). On the optimal number of alternatives at a choice point. Journal of Mathematical Psychology, 1, 386–391.
(2006). Sex differences in the tendency to omit items on multiple-choice tests: 1980–2000. Applied Measurement in Education, 19, 41–65.
(2008). Multiple-choice questions: A literature review on the optimal number of options. National Medical Journal of India, 21 (3), 130–133.
(1993). Combining multiple-choice and constructed-response test scores: Toward a marxist theory of test construction. Applied Measurement in Education, 6, 103–118.
(2006). Use of a committee review process to improve the quality of course examinations. Advances in Health Sciences Education, 11, 61–68.
(2012). Measurement properties of two innovative item formats in a computer-based test. Applied Measurement in Education, 25, 58–78.
(1981). Solving measurement problems with an answer-until-correct scoring procedure. Applied Psychological Measurement, 5, 399–414.
(2014). Do sequentially-presented answer options prevent the use of testwiseness cues on continuing medical education tests? Advances in Health Sciences Education, Advance online publication. doi: 10.1007/s10459-014-9528-2
(1995). Rasch Models for item bundles. Psychometrica, 60, 181–198.
(2010). Higher education reform in Germany: How the aims of the bologna process can be simultaneously supported and missed. International Journal of Educational Management, 24, 303–313.
(2011). The optimal number of choices in multiple-choice tests: Some evidence for science and technology education. New Educational Review, 23, 227–241.
(1987). Essay versus multiple-choice type classroom exams: The student's perspective. Journal of Educational Research, 80, 352–358.
(2002). Technological innovations in large-scale assessment. Applied Measurement in Education, 15, 337–362.
(2006). Fairness review in assessment. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of Test Development (pp. 359–376). Mahwah, NJ: Lawrence Erlbaum Associates.
(1982). Element of chance and comparative reliability of matching tests and multiple-choice tests. Psychological Reports, 50, 975–980.
(2003). A new look at the influence of guessing on the reliability of multiple-choice tests. Applied Psychological Measurement, 27, 357–371.