The Answer-Until-Correct Item Format Revisited
Abstract
Current availability of computers has led to the use of a new series of response formats that are an alternative to the classical dichotomic format, and to the recovery of other formats, like the case of the answer-until-correct (AUC) format, whose efficient administration requires this kind of technology. The goal of the present study is to determine whether the use of the AUC format improves test reliability and validity in comparison to the classical dichotomic format. Three samples of 174, 431, and 1,446 Spanish students from secondary education, professional training, and high school, ages between 13 and 20 years, were used. A 100-item test and a 25-item test that assessed knowledge of Universal History were used, both tests administered by Internet with the AUC format. There were 56 experimental conditions, resulting from the manipulation of eight scoring models and seven test lengths. The data were analyzed from the perspective of the Classical Test Theory and also with Item Response Theory (IRT) models. Reliability and construct validity, analyzed from the classic perspective, did not seem to improve significantly when using the AUC format; however, when assessing reliability with the Information Function obtained by means of IRT models, the advantages of the AUC format versus the dichotomic format become clear. For low levels of the trait assessed, scores obtained with the AUC format provide more information than scores obtained with the dichotomic format. Lastly, these results are commented on, and the possibilities and limits of the AUC format in highly computerized psychological and educational contexts are analyzed.
References
1978). A rating formulation for ordered response categories. Psychometrika, 43, 357–374.
(1997). A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21, 65–88.
(1990). The relationship of expert-system scored constrained free-response items to multiple-choice and open-ended items. Applied Psychological Measurement, 14, 151–162.
(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51.
(1990). A comparison of answer-until-correct feedback and knowledge-of-correct-response feedback under two conditions of contextualization. Journal of Computer-Based Instruction, 17, 125–129.
(1991). The effects of different feedback strategies using computer-administered multiple-choice questions as instruction. Educational Technology Research and Development, 39, 5–17.
(1986). Introduction to classical and modern test theory. New York, NY: Holt, Rinehart & Winston.
(1982a). Educational measurement and the item bank model. In , Issues in evaluation and accountability. London, UK: Methuen.
(1982b). Latent trait models for answer-until-correct tests, Methodology Project. Los Angeles, CA: California University. Center for the Study of Evaluation, Washington, DC.
(1992). The nominal response model in computerized adaptive testing. Applied Psychological Measurement, 16, 327–343.
(2004). Acceptance by undergraduates of the Immediate Feedback Assessment Technique for multiple-choice testing. Teaching in Higher Education, 9, 17–28.
(1995). Computerized adaptive testing with polytomous items. Applied Psychological Measurement, 19, 5–22.
(2006). Handbook of test development. Mahwah, NJ: Erlbaum.
(2007). Linking scores across computer and paper-based modes of test administration. In , Handbook of statistics (Vol. 26, pp. 1099–1102). North Holland, Netherlands: Elsevier Science & Technology Books.
(1989). Item sampling guessing, partial information and decision-making in achievement testing. In , Mathematical psychology in progress. New York: Springer.
(1972). Increasing test reliability through self-scoring procedures. Journal of Educational Measurement, 9, 205–207.
(2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309–334.
(1975). Incremental reliability and validity of multiple choice tests with an answer-until-correct procedure. Journal of Educational Measurement, 12, 175–178.
(2000). Answer-until-correct item response model with restricted number of responses. Japan Journal of Educational Technology, 24, 53–62.
(1961). The analysis of behavior. New York, NY: McGraw-Hill.
(1982). Some theories of performance in multiple choice tests and their implications for variants of the tasks. British Journal of Mathematical and Statistical Psychology, 35, 71–89.
(2001). Partial knowledge and answer-until-correct tasks in birds and humans. Biometrics, 57, 1251–1252.
(1978). The effect of guessing on item reliability under answer-until-correct scoring. Applied Psychological Measurement, 2, 41–49.
(1968). Goodbye, teacher … . Journal of Applied Behavior Analysis, 1, 79–89.
(2006). Winsteps Rasch measurements computer program. Chicago, IL: Winsteps.com.
(2005). Ítems politómicos versus dicotómicos: Un estudio metodológico
([Politomous vs. dichotomous items: A methodological study] . Anales de Psicología, 21, 339–344.2007). TIMSS 2007 International Science Report: Findings from IEA’s Trends in International Mathematics and Science Study at the fourth and eighth grades. Chestnut Hill, MA: TIMSS & PIRLS International Study Center.
(1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
(2008). Actitudes hacia los tests informatizados aplicados por Internet con formato responder hasta acertar
([Attitudes toward computerized testing with answer-until-correct format administered via Internet] . Acción Psicológica, 5, 25–36.2003). Comparación estadística de la confiabilidad alfa de Cronbach: Aplicaciones en la medición educacional y psicológica
([Statistical comparisons of Cronbach alpha coefficient: Applications to the educational and psychological measurement] . Revista de Psicología de la Universidad de Chile, 12, 127–136.2005). Including item response time in a distractor analysis via multivariate kernel smoothing. Paper presented at the 2006 meeting of the National Council on Measurement in Education. San Francisco, CA.
(2004). Directrices para la construcción de ítems de elección múltiple
([Guidelines for the development of multiple choice items] . Psicothema, 16, 490–497.2006). New Guidelines for developing multiple-choice items. Methodology, 2, 65–72.
(2003). Teoría clásica de los tests. Madrid, Spain: Pirámide.
(2005). Análisis de los ítems. Madrid, Spain: La Muralla.
(2002). La construcción de ítems de elección múltiple
([Development of multiple choice items] . Metodología de las Ciencias del Comportamiento, Volumen especial, 416–422.1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
(1989). Constructing test items. Boston, MA: Kluwer Academic.
(1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research.
(2001). Pasado, presente y futuro de los Tests Adaptativos Informatizados: entrevista con Isaac I. Bejar
([Past, present and future of computerized adaptive testing: Interview with I. Bejar] . Psicothema, 13, 685–690.1969). Estimation of ability using a response pattern of graded scores
([Psychometrika Monograph, 17] . Richmond, VA: Psychometric Society.1972). A general model for free-response data
([Psychometrika Monograph, 18] . Richmond, VA: Psychometric Society.1997). Departure from normal assumptions: A promise for future psychometrics with substantive mathematical modelling. Psychometrika, 62, 471–493.
(1964). Programed instruction and its usefulness for the health professions. American Journal of Public Health, 54, 982–990.
(2007). Focus on formative feedback. Princeton, NJ: ETS.
(2006). Innovative item formats in computer-based testing: In pursuit of improved construct representation. In , Handbook of test development (pp. 329–347). Mahwah, NJ: Erlbaum.
(1954). The science of learning and the art of teaching. Harvard Educational Review, 24, 86–97.
(1975). A examination of decision making based on a partial credit scoring system. Paper presented at the meeting of the National Council on Measurement in Education. Washington, DC.
(1991). MULTILOG, multiple categorical item analysis and test scoring using Item Response Theory. Chicago, IL: Scientific Software.
(2001). Constrained adaptive testing with shadow tests. In , Computerized adaptive testing: Theory and practice (pp. 42–47). North Holland, Netherlands: Kluwer Academic.
(1997). Handbook of modern item response theory. New York, NY: Springer.
(1981). Solving measurement problems with an answer-until-correct scoring procedure. Applied Psychological Measurement, 5, 399–414.
(1982). Some new results on an answer-until-correct scoring procedure. Journal of Educational Measurement, 19, 67–74.
(1983). How do examinees behave when taking multiple-choice tests? Applied Psychological Measurement, 7, 239–240.
(2002). Technological innovations in large-scale assessment. Applied Measurement in Education, 15, 337–362.
(