Validity and Reliability of Automatically Generated Propositional Reasoning Items
A Multilingual Study of the Challenges of Verbal Item Generation
Abstract
Abstract. This study introduces a newly developed public-domain multilingual automatic item generator that creates propositional reasoning (PR) items belonging to 15 item families by using various inference rules. Psychometric properties of the resulting written PR test were investigated in three diverse samples in English, simplified Chinese, and German, respectively. Internal consistency was good to excellent across samples. The ICAR16 short form test of cognitive abilities (Condon & Revelle, 2014) was used to evaluate construct validity. Correlations of ICAR16 scores and PR scores were high. Furthermore, items within families appeared to be equivalent, with only minor differential item functioning between the Chinese- and English-speaking samples. Performance on the PR test was shown to be reasonably stable over the course of 1 week. Differences of total scores between test forms (pen and paper vs. computerized administration) were not detected. Findings suggest that the automatically generated PR test is a valuable instrument for the assessment of propositional reasoning ability.
References
2005). The effect of different types of perceptual manipulations on the dimensionality of automatically generated figural matrices. Intelligence, 33, 307–324. https://doi.org/10.1016/j.intell.2005.02.002
(2010). Evaluating the contribution of different item features to the effect size of the gender difference in three-dimensional mental rotation using automatic item generation. Intelligence, 38, 574–581. https://doi.org/10.1016/j.intell.2010.06.004
(2006). Automatic generation of quantitative reasoning items. Journal of Individual Differences, 27, 2–14. https://doi.org/10.1027/1614-0001.27.1.2
(2012). Using automatic item generation to simultaneously construct German and English versions of a word fluency test. Journal of Cross-Cultural Psychology, 43, 464–479. https://doi.org/10.1177/0022022110397360
(1986). Working memory, Clarendon Press.
(1999). Mental models in conditional reasoning and working memory. Thinking & Reasoning, 5, 289–302.
(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 1–48. https://doi.org/10.18637/jss.v067.i01
(1988). Synthesizing standardized mean-change measures. British Journal of Mathematical and Statistical Psychology, 41, 257–278. https://doi.org/10.1111/j.2044-8317.1988.tb00901.x
(2002).
(Generative testing: From conception to implementation . In S. H. IrvineP. C. KyllonenEds., Item generation for test development (pp. 199–217). Erlbaum.2002). A feasibility study of on-the-fly item generation in adaptive testing. ETS Research Report Series, 2002, i–44.
(2016). Task difficulty prediction of figural analogies. Intelligence, 56, 72–81. https://doi.org/10.1016/j.intell.2016.03.001
(2002). Sample size requirements for testing and estimating coefficient alpha. Journal of Educational and Behavioral Statistics, 27, 335–340. https://doi.org/10.3102/10769986027004335
(1978). On the relation between the natural logic of reasoning and standard logic. Psychological Review, 85, 1–21. https://doi.org/10.1037/0033-295X.85.1.1
(2011). Working memory, text comprehension, and propositional reasoning: A new semantic anaphora wm test. The Spanish Journal of Psychology, 14, 37–49. https://doi.org/10.5209/rev_SJOP.2011.v14.n1.3
(1993). Human cognitive abilities. Cambridge University Press. https://doi.org/10.1017/CBO9780511571312
(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29. https://doi.org/10.18637/jss.v048.i06
(2015). Extended mixed-effects item response models with the MH-RM algorithm. Journal of Educational Measurement, 52, 200–222.
(1985). Pragmatic reasoning schemas. Cognitive Psychology, 17, 391–416. https://doi.org/10.1016/0010-0285(85)90014-3
(2014). Additive multilevel item structure models with random residuals: Item modeling for explanation and item generation. Psychometrika, 79, 84–104. https://doi.org/10.1007/s11336-013-9360-2
(2014). The international cognitive ability resource: Development and initial validation of a public domain measure. Intelligence, 43, 52–64. https://doi.org/10.1016/j.intell.2014.01.004
(1989). Evolutionary psychology and the generation of culture, part II: Case study: A computational theory of social exchange. Ethology and Sociobiology, 10, 51–97.
(1992).
(Cognitive adaptations for social exchange . In J. BarkowL. CosmidesJ. ToobyEds., The adapted mind (pp. 163–228). Oxford University Press.1997). Evolutionary psychology: A primer, Center for Evolutionary Psychology.
(1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64, 407–433.
(1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374. https://doi.org/10.1016/0001-6918(73)90003-6
(2008). Explaining and controlling for the psychometric properties of computer-generated figural matrix items. Applied Psychological Measurement, 32, 195–210. https://doi.org/10.1177/0146621607306972
(2011). How to get really smart: Modeling retest and training effects in ability testing using computer-generated figural matrix items. Intelligence, 39, 233–243.
(2005). Watson-Glaser critical thinking appraisal, form-s for education majors. Journal of Instructional Psychology, 32, 9–12.
(2007). Mental models in propositional reasoning and working memory’s central executive. Thinking & Reasoning, 13, 370–393. https://doi.org/10.1080/13546780701203813
(2011). Modeling rule-based item generation. Psychometrika, 76, 337–359. https://doi.org/10.1007/s11336-011-9204-x
(2012). The role of item models in automatic item generation. International Journal of Testing, 12, 273–298. https://doi.org/10.1080/15305058.2011.635830
(2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27, 247–261. https://doi.org/10.1177/0146621603027004001
(2018). A new look to a classic issue: Reasoning and academic achievement at secondary school. Frontiers in Psychology, 9, Article 400. https://doi.org/10.3389/fpsyg.2018.00400
(2010).
(Gesamtdarbietung, Einzeltextdarbietung, Zeitbegrenzung und Zeitdruck: Auswirkungen auf Item- und Testkennwerte und C-Test-Konstrukt [Overall presentation, single text presentation, time limitation and time pressure: Effects on item and test characteristics and C-test construct] . In R. GrotjahnEd., Der C-Test: Beiträge aus der aktuellen Forschung (pp. 265–296). Peter Lang.2010).
(S-C-Tests: Messung automatisierter sprachlicher Kompetenzen anhand von C-Tests mit massiver textspezifischer Zeitlimitierung [S-C-Tests: Measurement of automated linguistic competence using C-tests with massive text-specific time limitation] . In R. GrotjahnEd., Der C-Test: Beiträge aus der aktuellen Forschung, Peter Lang, 297–319.2017). Fremd- und Zweitsprachenlernerfolg und seine Erklärung durch Erwerbsalter, kognitive, affektiv-motivationale und sozio-kulturelle Variablen: Eine empirische Studie,
([Foreign and second language learning success and its explanation by working age, cognitive, affective-motivational and socio-cultural variables: An empirical study] . Kassel University Press. (PhD thesis) https://dx.medra.org/10.19211/KUP97837376027302009). Automatic item generation of probability word problems. Studies in Educational Evaluation, 35, 71–76. https://doi.org/10.1016/j.stueduc.2009.10.004
(2018). Multilevel analysis: Techniques and applications (3rd ed.). Routledge.
(1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. https://doi.org/10.1080/10705519909540118
(2004). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed.). Sage.
(2002). Item generation for test development, Routledge. https://doi.org/10.4324/9781410602145
(2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349. https://doi.org/10.1207/S15324818AME1404_2
(1983). Mental models: Towards a cognitive science of language, inference, and consciousness, Cambridge University Press.
(1992). Propositional reasoning by model. Psychological Review, 99, 418–439. https://doi.org/10.1037/0033-295X.99.3.418
(1972). Reasoning and a sense of reality. British Journal of Psychology, 63, 395–400.
(1996). Illusory inferences about probabilities. Acta Psychologica, 93, 69–90.
(1997). Working memory involvement in propositional and spatial reasoning. Thinking & Reasoning, 3, 9–47.
(2016). Psychometric properties of the learning potential test. Procedia – Social and Behavioral Sciences, 217, 652–656. https://doi.org/10.1016/j.sbspro.2016.02.089
(2017). lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82, 1–26. https://doi.org/10.18637/jss.v082.i13
(2018). Evaluating an automated number series item generator using linear logistic test models. Journal of Intelligence, 6, 1–25. https://doi.org/10.3390/jintelligence6020020
(1965). Item sampling in test theory and in research design, Educational Testing Service. (Technical report).
(2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42, 847–862. https://doi.org/10.3758/BRM.42.3.847
(1999). Test theory: A unified treatment, Erlbaum.
(2009). CHC theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research. Intelligence, 37, 1–10. https://doi.org/10.1016/j.intell.2008.08.004
(1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114, 449–458. https://doi.org/10.1037/0033-2909.114.3.449
(2001). Propositional reasoning and working memory: The role of prior training and pragmatic content. Acta Psychologica, 106, 303–327. https://doi.org/10.1016/S0001-6918(00)00055-X
(1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692. https://doi.org/10.1093/biomet/78.3.691
(1994). Propositional reasoning by mental models? Simple to refute in principle and in practice. Psychological Review, 101, 711–724. https://doi.org/10.1037/0033-295X.101.4.711
(1989). Reliability and validity of the propositional logic test. Educational and Psychological Measurement, 49, 667–672. https://doi.org/10.1177/001316448904900320
(2018). R: A language and environment for statistical computing (Version 3.5.0), R Foundation for Statistical Computing. https://www.R-project.org
. (2018). psych: Procedures for psychological, psychometric, and personality research (Version 1.8.4), Northwestern University. https://cran.R-project.org/package = psych
(2017).
(Web and phone based data collection using planned missing designs . In N. FieldingR. M. LeeG. BlankEds., The SAGE handbook of online research methods (2nd ed., pp. 578–595). Sage.2010).
(Individual differences in cognition: New methods for examining the personality-cognition link . In A. GruszkaG. MatthewsB. SzymuraEds., Handbook of individual differences in cognition: Attention, memory and executive control (pp. 27–49). Springer. https://doi.org/10.1007/978-1-4419-1210-7_22001). FAM: Ein Fragebogen zur Erfassung aktueller Motivation in Lern- und Leistungssituationen
([FAM: A questionnaire to assess current motivation in learning and performance situations] . Diagnostica, 47, 57–66. https://doi.org/10.1026//0012-1924.47.2.572001). Propositional reasoning: The differential contribution of “rules” to the difficulty of complex reasoning problems. Memory & Cognition, 29, 165–175. https://doi.org/10.3758/BF03195750
(2002). The random weights linear logistic test model. Applied Psychological Measurement, 26, 271–285. https://doi.org/10.1177/0146621602026003003
(1983). Cognitive processes in propositional reasoning. Psychological Review, 90, 38–71. https://doi.org/10.1037/0033-295X.90.1.38
(1979). Further examination of formal operational reasoning abilities. Child Development, 50, 478. https://doi.org/10.2307/1129426
(1982). The formal operational reasoning test. The Journal of General Psychology, 106, 61–67. https://doi.org/10.1080/00221309.1982.9710973
(2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48, 1–36. https://doi.org/10.18637/jss.v048.i02
(2018). Retest effects in cognitive ability tests: A meta-analysis. Intelligence, 67, 44–66. https://doi.org/10.1016/j.intell.2018.01.003
(2012).
(The Cattell-Horn-Carroll model of intelligence . In D. P. FlanaganP. L. HarrisonEds., Contemporary intellectual assessment: Theories, tests, and issues (pp. 99–144). Guildford Press.2010). Testing reasoning ability with handheld computers, notebooks, and paper and pencil. European Journal of Psychological Assessment, 26, 284–292. https://doi.org/10.1027/1015-5759/a000038
(2003). Calibrating item families and summarizing the results using family expected response functions. Journal of Educational and Behavioral Statistics, 28, 295–313. https://doi.org/10.3102/10769986028004295
(1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x
(1993). Working memory and conditional reasoning. The Quarterly Journal of Experimental Psychology Section A, 46, 679–699.
(2016). Handbook of item response theory Three Volume Set. Chapman and Hall/CRC.
(2006). Motivational effects on self-regulated learning with different tasks. Educational Psychology Review, 18, 239–253. https://doi.org/10.1007/s10648-006-9017-0
(2000). Psychologie des schlussfolgernden Denkens: Differentialpsychologische Prüfung von Strukturüberlegungen
([Psychology of reasoning: Differential psychological testing of structural considerations] . Kovac.2002).
(Ability and achievement testing on the world wide web . In B. BatinicU.-D. ReipsM. BosnjakEds., Online social sciences (pp. 151–180). Hogrefe & Huber.1997). A measure of effect size for a model-based approach for studying DIF, Edgeworth Laboratory for Quantitative Behavioral Science, University of Northern British Columbia. (Working paper).
(