Find the Mistake!
Psychometric Properties of an Innovative Response Format for Figural Matrices Tasks
Abstract
Abstract: Reasoning ability has commonly been regarded as the best predictor of academic and occupational success. Due to concerns about the validity of multiple-choice (MC) formats, test security breaches, and the fact that the difficulty levels of most existing reasoning assessments target the population mean, there is a constant need for new reliable and valid test instruments that can be applied to assess fluid intelligence in advanced cognitive performance areas. We developed a novel computerized figural matrices test to assess nonverbal reasoning for university student aptitude assessment. In two studies, we generated, revised, and empirically validated the Isometric Matrices Test (IMT). Our results show that the IMT is less prone to test-wiseness strategies than existing reasoning tests. In a third study, we created and evaluated an innovative Find the Mistake (FtM) response format as an alternative to classical multiple-choice formats. Overall, both response formats revealed satisfactory psychometric quality in terms of item difficulties and discrimination, test-retest reliability, construct and criterion validity, and Rasch or two-parameter logistic (2PL) model fit, but in one MC version, the internal consistency was low due to negative discrimination indices. The MC response format turned out to be easier than the FtM format, with men slightly outperforming women in both response modes. We propose the IMT as a useful tool for assessing nonverbal reasoning ability in above-average performance areas and discuss the automatic generation of larger IMT item pools for adaptive testing in order to increase test security and reliability.
References
2014). Standards for educational and psychological testing. American Educational Research Association.
. (2002). An empirical investigation of the effects of three methods of handling guessing and risk taking on the psychometric indices of a test. Social Behavior and Personality, 30(7), 645–652. https://doi.org/10.2224/sbp.2002.30.7.645
(2011). The impact of applying concept mapping techniques on EFL learners’ knowledge of tenses. Journal of English Studies, 1, 39–61.
(2017). FAQ: Maintaining test security in the age of technology. https://www.apa.org/science/programs/testing/test-security-faq
. (2012). Gender differences in figural matrices: The moderating role of item design features. Intelligence, 40(6), 584–597. https://doi.org/10.1016/j.intell.2012.08.003
(2013). Reducing response elimination strategies enhances the construct validity of figural matrices. Intelligence, 41(4), 234–243. https://doi.org/10.1016/j.intell.2013.03.006
(2006). Automatic generation of quantitative reasoning items: A pilot study. Journal of Individual Differences, 27(1), 2–14. https://doi.org/10.1027/1614-0001.27.1.2
(2013). Gender differences in willingness to guess. Management Science, 60(2), 434–448. https://doi.org/10.1287/mnsc.2013.1776
(2019, April 11–12). Adapting STEM automated assessment system to enhance language skills. Paper presented at the 15th International Scientific Conference eLearning and Software for Education, Bucharest.
(2014). The construction task: Validation of a distractor-free item format for the presentation of figural matrices. Diagnostica, 61(1), 22–33. https://doi.org/10.1026/0012-1924/a000111
(2016). Preventing response elimination strategies improves the convergent validity of figural matrices. Journal of Intelligence, 4(2), 1–15. https://doi.org/10.3390/jintelligence4010002
(1998). Relationships between learning patterns and attitudes towards two assessment formats. Educational Research, 40(1), 90–98. https://doi.org/10.1080/0013188980400109
(2004). Figure Reasoning Test (FRT): Manual für FRT und FRT-J
([Figure Reasoning Test (FRT): Manual for FRT and FRT-J] . Swets Test Services.2020). Effects of response format on psychometric properties and fairness of a matrices test: Multiple choice vs. free response. Frontiers in Education, 5, Article
(15 . https://doi.org/10.3389/feduc.2020.000152022). Supplemental material to “Find the mistake! Psychometric properties of an innovative response format for figural matrices tasks”. https://osf.io/j9p2a/?view_only=caf80535472746fca79fc2a61403b78b
(1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. https://doi.org/10.1037/h0046016
(1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge University Press.
(1988). Statistical power analysis for the behavioral sciences. Erlbaum.
(2004). Sex differences on the progressive matrices are influenced by sex differences on spatial ability. Personality and Individual Differences, 37(6), 1289–1293. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&CSC=Y&NEWS=N&PAGE=fulltext&D=psyc4&AN=2004-19562-016
(2016, September 6). Cmaps with errors: Why not? Comparing two Cmap-based assessment tasks to evaluate conceptual understanding. Paper presented at the 7th International Conference on Concept Mapping, CMC 2016, Tallinn, Estonia.
(1987). Improving multiple-choice test performance for examinees with different levels of test anxiety. Journal of Experimental Education, 55(4), 201–205. https://doi.org/10.1080/00220973.1987.10806454
(2014). Test-taking skills of secondary students: The relationship with motivation, attitudes, anxiety and attitudes towards tests. South African Journal of Education, 34(2), Article
(866 . https://doi.org/10.15700/201412071153Dorans, N. J.Cook, L. L. (Eds.). (2016). Fairness in educational assessment and measurement. Routledge.
2007). An Examination of Factors Contributing to a Reduction in Subgroup Differences on a Constructed-Response Paper-and-Pencil Test of Scholastic Achievement. Journal of Applied Psychology, 92(3), 794–801. https://doi.org/10.1037/0021-9010.92.3.794
(2002). Wiener Matrizen-Test (EDV-Version 22.0)
([Viennese Matrices Test (IT-version 22.0)] . Schuhfried GmbH.2013). Introduction to mediation, moderation, and conditional process analysis: A regression-based approach. Guilford Press.
(1998). Matrizen-Test-Manual. Ein Handbuch zu den Standard Progressive Matrices von J. C. Raven
([Matrices Test Manual. A Guide to J. C. Raven’s Standard Progressive Matrices] . Beltz.2011). A Comparison of two forms of assessment in an introductory biology laboratory course. Journal of College Science Teaching, 40(5), 28–31. http://hdl.handle.net/1959.14/175225
(2013). The short version of the Hagen Matrices Test (HMT-S): A 6-item induction intelligence test. Methods, Data, Analyses, 7(2), 183–208. https://doi.org/10.12758/mda.2013.011
(1991). Das Prüfungsängstlichkeitsinventar TAI-G: Eine erweiterte und modifizierte Version mit vier Komponenten
([The Test Anxiety Inventory TAI-G: An expanded and modified version with four components] . Zeitschrift für Pädagogische Psychologie / German Journal of Educational Psychology, 5(2), 121–130.2000). Design and evaluation of an adaptive matrices test. Diagnostica, 46(4), 182–188. https://doi.org/10.1026/0012-1924.46.4.182
(1999). BOMAT-advanced – Bochumer Matrices Test. Manual. Hogrefe.
(2017). Shifting the load: A peer dialogue agent that encourages its human collaborator to contribute more to problem solving. International Journal of Artificial Intelligence in Education, 27, 101–129. https://doi.org/10.1007/s40593-015-0071-y
(2013). Which form of assessment provides the best information about student performance in chemistry examinations? Research in Science & Technological Education, 31(1), 49–65. https://doi.org/10.1080/02635143.2013.764516
(2019). IBM SPSS Statistics for Windows, Version 26.0. https://www.ibm.com/support/pages/ibm-spss-statistics-26-documentation
. (2014). International guidelines on the security of tests, examinations, and other assessments. https://www.intestcom.org/page/18
. (2005). Sex differences in means and variability on the progressive matrices in university students: A meta-analysis. British Journal of Psychology, 96(4), 505–524. https://doi.org/10.1348/000712605X53542
(2003). Cross-sectional and longitudinal confirmatory factor models for the German Test Anxiety Inventory: A construct validation. Anxiety, Stress, & Coping, 16(3), 251–270. https://doi.org/10.1080/1061580031000095416
(2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16, 871–882. https://www.rasch.org/rmt/rmt162f.htm
(2020). The effects of item format and cognitive domain on students’ science performance in TIMSS 2011. Research in Science Education, 50, 99–121. https://doi.org/10.1007/s11165-017-9682-7
(2005). What do raven’s matrices measure? An analysis in terms of sex differences. Intelligence, 33(6), 663–674. https://doi.org/10.1016/j.intell.2005.03.004
(2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847–862. https://doi.org/10.3758/BRM.42.3.847
(1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218. https://doi.org/10.1207/s15326985ep3404_2
(2004). Improving the fairness of multiple-choice questions: A literature review. Medical Teacher, 26(8), 709–712. https://doi.org/10.1080/01421590400013495
(2014). The new science of cognitive sex differences. Trends in Cognitive Sciences, 18(1), 37–45. https://doi.org/10.1016/j.tics.2013.10.011
(2008). The nasty distractors. The utility of a notional distractor analysis of items of matrices tests for the highly gifted. Diagnostica, 54(4), 193–201. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&CSC=Y&NEWS=N&PAGE=fulltext&D=psyn&AN=0211997
(2006). New guidelines for developing multiple-choice items. Methodology: European Journal of Research Methods for the Behavioral & Social Sciences, 2(2), 65–72. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&CSC=Y&NEWS=N&PAGE=fulltext&D=ovfth&AN=01222909-200602020-00002
(2011). The Analytical Test: Validation of a new personnel selection tool assessing reasoning. Zeitschrift fuer Arbeits- und Organisationspsychologie, 55, 1–16. https://doi.org/10.1026/0932-4089/a000031
(2003). Raven’s Standard Progressive Matrices: New school age norms and a study of the test’s validity. Personality and Individual Differences, 34(3), 375–386. https://doi.org/10.1016/S0191-8869(02)00058-2
(2013). Free response matrices (FRM) [Software and Manual]. Schuhfried GmbH.
(2012). Assessment bias: How to banish it (2nd ed.). Pearson.
(2012). High-stakes testing for students with mathematics difficulty: Response format effects in mathematics problem solving. Learning Disability Quarterly, 35(1), 3–9. https://doi.org/10.1177/0731948711428773
(2003). Diagnostik intellektueller Hochbegabung. Testentwicklung zur Erfassung der fluiden Intelligenz
([Assessment of giftedness. Test development to assess fluid intelligence] . Hogrefe.2021). R: A language and environment for statistical computing. https://www.R-project.org/
. (1998). Manual for Raven’s progressive matrices and vocabulary scales. Oxford Psychologists Press.
(2018). The relationship between test item format and gender achievement gaps on Math and ELA tests in fourth and eighth grades. Educational Researcher, 47(5), 284–294. https://doi.org/10.3102/0013189X18762105
(2018). TAM: Test analysis modules R package version 2.12-18. https://CRAN.R-project.org/package=TAM
(2009). Validity evidence of Raven’s Advanced Progressive Matrices in university students. Psico-USF, 14(2), 177–184. https://doi.org/10.1590/S1413-82712009000200006
(2010). Can we predict risk-taking behavior? Two behavioral tests for predicting guessing tendencies in a multiple-choice test. European Journal of Psychological Assessment, 26(2), 87–94. https://doi.org/10.1027/1015-5759/a000013
(2002).
(Variation in achievement scores related to gender, item format, and content area tested . In G. TindalT. M. HaladynaEds., Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (pp. 67–88). Erlbaum.2016). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 100 years of research findings (working paper). University of Iowa.
(1991). Agreement between expert-system and human raters’ scores on complex constructed-response quantitative items. Journal of Applied Psychology, 76(6), 856–862. https://doi.org/10.1037/0021-9010.76.6.856
(2012). Multiple-choice exams: An obstacle for higher-level thinking in introductory science classes. CBE Life Sciences Education, 11(3), 294–306. https://doi.org/10.1187/cbe.11-11-0100
(2011). Making sense of Cronbach’s alpha. International Journal of Medical Education, 2, 53–55. https://doi.org/10.5116/ijme.4dfb.8dfd
(1994). Error analysis of Raven test performance. Personality and Individual Differences, 16(3), 433–445. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&CSC=Y&NEWS=N&PAGE=fulltext&D=psyc3&AN=1994-31901-001
(2008). Eine Kurzform des Prüfungsängstlichkeitsinventars TAI-G
([Short version of the Test Anxiety Inventory TAI-G] . Zeitschrift für Pädagogische Psychologie / German Journal of Educational Psychology, 22(1), 73–81. https://doi.org/10.1024/1010-0652.22.1.732003). An instructor’s guide to understanding test reliability. Testing & Evaluation Services, University of Wisconsin.
(1998). Test anxiety: The state of the art. Springer. https://books.google.at/books?id=qmcQBwAAQBAJ&dq=test+anxiety+fear+zeidner&hl=de&lr=
(2016).
(Fairness in test design and development . In N. J. DoransL. L. CookEds., Fairness in educational assessment and measurement (pp. 9–31). Routledge.