Skip to main content
Free AccessEditorial

The Potential of Machine Learning Methods in Psychological Assessment and Test Construction

Published Online:https://doi.org/10.1027/1015-5759/a000817

In psychological assessment and test construction, we are often faced with practical problems and challenges such as dealing with large, complex datasets, for example, for predicting offenders’ probability of relapse, educational outcomes, or treatment effects. Another challenge might be generating a sufficiently large pool of items from which to create parallel versions of tests to counteract item content becoming public online. In these – and many more – cases, machine learning might provide a useful toolbox for overcoming the challenges we face. Machine learning (ML) – a subfield of artificial intelligence (AI) research – has emerged as a catalyst for employing powerful tools in statistical learning that are reshaping the way we approach complex problems in industry, health, and science. More precisely, ML is a versatile framework for developing and evaluating algorithmic procedures for learning from data. That is, ML techniques use data to estimate or describe functional relationships between variables (e.g., James et al., 2023; Murphy, 2022). The most common variants of ML are supervised and unsupervised learning.

In supervised learning, the data contains predictors (referred to as features) and an outcome variable, referred to as the target. The data are split into a training set, the subset of data that a model uses to learn the relationship between features and targets and a test set, the subset of data in which the function derived from the training set is tested on new data. If the training set contains instances of the target variable, they can guide the learning process. Ultimately, the goal of supervised learning is to create a function that maps features to the target and accurately predicts unseen data in the test set. In contrast, unsupervised learning lacks a clear distinction between features and targets. There is just the set of inputs and the goal, simply put, is “to make sense of” them (Murphy, 2022, p. 14), for example, in exploratory factor analysis or latent class analysis.

The far-reaching influence of ML has already set foot in psychology and related fields (e.g., Dwyer et al., 2018, for an overview of clinical psychology, neuroscience, and psychiatry), including psychological assessment and psychometrics (e.g., Fokkema et al., 2022; Gonzalez, 2021). In part, this is due to the increasing amount of data available in these areas. For example, Gladstone et al. (2019) used ML techniques with over two million bank account records combined with survey responses from account holders to infer psychological traits such as the Big Five, materialism, and self-control from spending behavior. As another example, Stachl et al. (2020) predicted the Big Five personality traits from smartphone data using ML techniques, such as overall phone usage or specific app usage.

The goal of this editorial is to provide a brief overview of the potential of ML methods in psychological assessment and test construction. In the following, we focus on three applications we believe more research is needed on and which the European Journal of Psychological Assessment (EJPA) would welcome: (1) automated item generation, (2) automated test assembly, which both focus on test construction, and (3) clinical decision support systems, which address questions relevant to the psychological assessment and diagnosis of individuals. In addition, we (re-)introduce the technique of cross-validation. We end with noting some benefits, but also problems, of ML, highlighting open questions, and identifying future directions for the field of psychological assessment and test construction.

Automated Item Generation

Items have been generated for psychological tests automatically (i.e., automated item generation; AIG) for a long time, especially in the ability domain where narrow item types exist and variations of preexisting items can easily be constructed (e.g., figural matrices). With the development of large language models, it has now become possible to also use AIG to generate text-based items for the assessment of personality traits or other constructs typically assessed with self- (or other-) ratings of statements. Large language models are ML models that use a text-based input to predict words, sentences, or whole essays. Because they are trained on vast amounts of data, they are able to use contextual information to generate text that is more human-like than that of previous statistical language models (Demszky et al., 2023). For example, Götz et al. (2023) recently introduced the Psychometric Item Generator, a tool for generating questionnaire items that are based on GPT-2 (Radford et al., 2019). Using this tool, they were able to create a new, shorter version of a Big Five questionnaire that showed similar psychometric properties to the original, though human judgments of the generated items were necessary to pre-select items for this new questionnaire version. Large language models are not only useful for generating items, however. They can also be used to revise the item pool by rewriting items, contextually adapting items, translating items, or eliminating items that linguistically overlap too closely with other items.

The ability to create large numbers of text-based items automatically could also make computerized adaptive testing (CAT) more feasible in the assessment of non-cognitive traits. Most applications of CAT currently rely on a large item pool that has been calibrated using item response models, though ML methods are starting to be used to, for example, predict item difficulty based on the characteristics of automatically generated items. By combining AIG and ML, CAT could become more widespread. Thus, AIG using ML methods opens up a lot of new possibilities in test construction. We encourage the submission of manuscripts dedicated to this issue, both focusing on use cases and comparisons of AIG to traditional item generation procedures.

Automated Test Assembly

In automated test assembly (ATA), items are selected from a larger item pool to maximize a characteristic of the test, such as reliability or response time, potentially under several constraints, such as content coverage or avoiding similar items (van der Linden, 2005). For example, ATA can be used to construct a short form, a parallel test, or a test for a selection decision, where information peaks at a certain trait level. Once all the relevant information about the test and the goal for test construction are defined, an algorithm is needed to find the (optimal) solution. Linear problem solvers can be used as long as the criterion is linear or can be linearly approximated (van der Linden, 2005). Apart from that, local search heuristics can be applied. They are not guaranteed to find the optimal solution and are often specifically tailored to a certain application.

Recently, ML approaches based on lasso regression have been used to shorten instruments for estimating transdiagnostic mental health factors while maintaining predictive accuracy (e.g., Wise & Dolan, 2020). Promising local search heuristics also come from areas closely related to ML, such as computational and artificial intelligence: For example, Olaru and Danner (2021) constructed a short form of a Big Five questionnaire with optimized measurement invariance and validity. Kreitchmann et al. (2022) developed a genetic algorithm for pairing items to be presented in a forced-choice format that outperformed the best test from a large random sample. As the authors pointed out, linear programming algorithms were computationally unfeasible in this situation.

In sum, ML, optimization, and AI techniques can help in ATA when computation would be inefficient or unfeasible otherwise. This will hopefully in turn increase the use of ATA in the future. However, more research is needed on comparisons of different algorithms for ATA, both in the form of applications and simulations, and EJPA welcomes contributions to tackling this problem.

Clinical Decision Support Systems

Clinical decision support systems (CDSS) are computer-based tools integrated into healthcare information systems that are designed to improve clinician decision-making in real-time. By analyzing patient-specific data, including medical history and treatment records, CDSS aim to augment expert knowledge to provide personalized, evidence-based care to patients. New digital data sources, such as smartphone data, and recent developments in the field of ML, or AI more broadly, appear to be promising additions to improve traditional CDSS. One important contribution that ML can bring to CDSS is the ability to enhance predictions of treatment outcomes from complex data sources, such as neuroimaging data. For example, Nguyen et al. (2019) developed a dense feed-forward neural network that uses pretreatment task-based functional magnetic resonance imaging (fMRI) data to accurately predict individual response to bupropion, a common antidepressant. Especially with fMRI data, deep learning architectures show promising results that are difficult to achieve with other learning algorithms (e.g., Eitel et al., 2021; Squarcina et al., 2021).

To further increase the predictive accuracy of an algorithm, ensemble learning techniques can be considered. Ensemble learning is a general framework for combining the predictions of multiple models to ultimately improve their out-of-sample performance. Although not limited to CDSS, ensemble learning has shown promise in this field (e.g., Ragab et al., 2022, for a medical example). For example, to enhance the predictions of treatment outcomes, ensemble learning algorithms, such as the Super Learner (van der Laan et al., 2007), provide an asymptotically optimal strategy to combine multiple models that often outperforms any individual learning algorithm (Golmakani & Polley, 2020). In summary, ML can provide new ways to obtain or improve the prediction of treatment effects in CDSS. This area of applications has also not been explored enough in psychological research and we encourage submissions to EJPA that will either present use cases of ML in the development of CDSS or discuss issues of validity in order to gauge the impact of ML when compared with traditional approaches.

Cross-Validation

Cross-validation has a long history in psychology dating back to at least the 1950s (see Browne, 2000, for a review). The basic idea of cross-validation is to avoid judging a model’s performance based on the data it has learned from, and instead testing it on unseen data. This provides an indication of a model’s generalization ability or predictive performance as opposed to its retrodictive performance on the sample. However, in some assessment contexts (e.g., psychiatric contexts), it can be difficult to create independent data sets or obtain a sample large enough to split. In these settings, k-fold cross-validation can be applied. In k-fold cross-validation, the data are randomly partitioned into k equal parts, called folds. The model is trained on all but one fold, and the hold-out serves as a single test of generalization performance. The procedure is iterated for all combinations of training and hold-out folds, ultimately yielding a distribution of generalization performances. Different versions of k-fold cross-validation have been tried in sparse data settings to provide reliable estimates of the area under the receiver operating curve (ROC AUC; e.g., Airola et al., 2011).

Concluding Remarks

Of course, the applications described above are only a small selection of possible applications of ML in the area of psychological assessment and test construction. Others, for example, include enhancing item response modeling and CAT using ML and AI techniques such as deep learning, reinforcement learning, and (Bayesian) optimal experimental design (Keurulainen et al., 2023; Mujtaba & Mahapatra, 2020). Current hindrances to the use of ML in psychological assessment and test construction include a lack of transparency and interpretability (“black box models”) and – especially in the case of CDSS – a lack of trust in AI-based systems. However, probabilistic approaches to ML and advances in interpretable ML may fill this gap (e.g., Hüllermeier & Waegeman, 2021). In addition, software should be developed that is easy to use and flexible enough to adapt to practitioners’ needs.

We believe that ML has a lot of potential for psychological assessment and test construction despite current challenges. At the same time, many of these potential applications are only hinted at and not featured prominently in the psychological assessment literature. If they are to be picked up by our community and used on a larger scale, they need to be featured and dissected in terms of return (validity, especially incremental validity) on investment (which is relatively large, at least in terms of big data and computational modeling). Because of this need to generate impactful research on ML methods, at EJPA we encourage submissions that investigate and/or apply methods of automatically generating or revising items. For example, an interesting potential submission could focus on using large language models to adapt an instrument to a different language and culture and investigating the measurement invariance and validity of this new version. We also encourage submissions that apply ATA for constructing tests in different domains. With CDSS, we think promising topics for submission could address improving the prediction of outcomes (treatment outcomes, educational achievement, etc.) using digital data sources. Lastly, submissions including innovative applications of cross-validation are also of interest to us. These examples are just a few that come to mind from the applications we have focused on, but of course, other ideas, including from other applications of ML methods, are also welcome. Thus, at EJPA, we look forward to receiving research on the application of ML methods in psychological assessment and test construction.

References

  • Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828–1844. https://doi.org/10.1016/j.csda.2010.11.018 First citation in articleCrossrefGoogle Scholar

  • Browne, M. W. (2000). Cross-Validation Methods. Journal of Mathematical Psychology, 44(1), 108–132. https://doi.org/10.1006/jmps.1999.1279 First citation in articleCrossrefGoogle Scholar

  • Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., Jones, M., Krettek-Cobb, D., Lai, L., JonesMitchell, N., Ong, D. C., Dweck, C. S., Gross, J. J., & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701. https://doi.org/10.1038/s44159-023-00241-5 First citation in articleCrossrefGoogle Scholar

  • Dwyer, D. B., Falkai, P., & Koutsouleris, N. (2018). Machine Learning Approaches for Clinical Psychology and Psychiatry. Annual Review of Clinical Psychology, 14(1), 91–118. https://doi.org/10.1146/annurev-clinpsy-032816-045037 First citation in articleCrossrefGoogle Scholar

  • Eitel, F., Schulz, M.-A., Seiler, M., Walter, H., & Ritter, K. (2021). Promises and pitfalls of deep neural networks in neuroimaging-based psychiatric research. Experimental Neurology, 339, Article 113608. https://doi.org/10.1016/j.expneurol.2021.113608 First citation in articleCrossrefGoogle Scholar

  • Fokkema, M., Iliescu, D., Greiff, S., & Ziegler, M. (2022). Machine learning and prediction in psychological assessment: Some promises and pitfalls. European Journal of Psychological Assessment, 38(3), 165–175. https://doi.org/10.1027/1015-5759/a000714 First citation in articleLinkGoogle Scholar

  • Gladstone, J. J., Matz, S. C., & Lemaire, A. (2019). Can psychological traits be inferred from spending? Evidence from transaction data. Psychological Science, 30(7), 1087–1096. https://doi.org/10.1177/0956797619849435 First citation in articleCrossrefGoogle Scholar

  • Golmakani, M. K., & Polley, E. C. (2020). Super Learner for Survival Data Prediction. The International Journal of Biostatistics, 16(2), Article 20190065. https://doi.org/10.1515/ijb-2019-0065 First citation in articleCrossrefGoogle Scholar

  • Gonzalez, O. (2021). Psychometric and machine learning approaches for diagnostic assessment and tests of individual classification. Psychological Methods, 26(2), 236–254. https://doi.org/10.1037/met0000317 First citation in articleCrossrefGoogle Scholar

  • Götz, F. M., Maertens, R., Loomba, S., & van der Linden, S. (2023). Let the algorithm speak: How to use neural networks for automatic item generation in psychological scale development. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000540 First citation in articleCrossrefGoogle Scholar

  • Hüllermeier, E., & Waegeman, W. (2021). Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110(3), 457–506. https://doi.org/10.1007/s10994-021-05946-3 First citation in articleCrossrefGoogle Scholar

  • James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. E. (2023). An introduction to statistical learning: With applications in Python. Springer. First citation in articleCrossrefGoogle Scholar

  • Keurulainen, A., Westerlund, I., Keurulainen, O., & Howes, A. (2023). Amortised Design Optimization for Item Response Theory. In N. WangG. Rebolledo-MendezV. DimitrovaN. MatsudaO. C. SantosEds., Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky (Vol. 1831, pp. 359–364). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-36336-8_56 First citation in articleCrossrefGoogle Scholar

  • Kreitchmann, R. S., Abad, F. J., & Sorrel, M. A. (2022). A genetic algorithm for optimal assembly of pairwise forced-choice questionnaires. Behavior Research Methods, 54(3), 1476–1492. https://doi.org/10.3758/s13428-021-01677-4 First citation in articleCrossrefGoogle Scholar

  • Mujtaba, D. F., & Mahapatra, N. R. (2020). Artificial Intelligence in Computerized Adaptive Testing. 2020 International Conference on Computational Science and Computational Intelligence (CSCI), 649–654. https://doi.org/10.1109/CSCI51800.2020.00116 First citation in articleCrossrefGoogle Scholar

  • Murphy, K. P. (2022). Probabilistic machine learning: An introduction. The MIT Press. First citation in articleGoogle Scholar

  • Nguyen, K. P., Fatt, C. C., Treacher, A., Mellema, C., Trivedi, M. H., & Montillo, A. (2019). Predicting response to the antidepressant bupropion using pretreatment fMRI. In I. RekikE. AdeliS. H. ParkEds., Predictive intelligence in medicine (Vol. 11843, pp. 53–62). Springer International Publishing. https://doi.org/10.1007/978-3-030-32281-6_6 First citation in articleCrossrefGoogle Scholar

  • Olaru, G., & Danner, D. (2021). Developing cross-cultural short scales using ant colony optimization. Assessment, 28(1), 199–210. https://doi.org/10.1177/1073191120918026 First citation in articleCrossrefGoogle Scholar

  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. https://api.semanticscholar.org/CorpusID:160025533 First citation in articleGoogle Scholar

  • Ragab, M., Albukhari, A., Alyami, J., & Mansour, R. F. (2022). Ensemble Deep-Learning-Enabled Clinical Decision Support System for Breast Cancer Diagnosis and Classification on Ultrasound Images. Biology, 11(3), Article 439. https://doi.org/10.3390/biology11030439 First citation in articleCrossrefGoogle Scholar

  • Squarcina, L., Villa, F. M., Nobile, M., Grisan, E., & Brambilla, P. (2021). Deep learning for the prediction of treatment response in depression. Journal of Affective Disorders, 281, 618–622. https://doi.org/10.1016/j.jad.2020.11.104 First citation in articleCrossrefGoogle Scholar

  • Stachl, C., Au, Q., Schoedel, R., Gosling, S. D., Harari, G. M., Buschek, D., Völkel, S. T., Schuwerk, T., Oldemeier, M., Ullmann, T., Hussmann, H., Bischl, B., & Bühner, M. (2020). Predicting personality from patterns of behavior collected with smartphones. Proceedings of the National Academy of Sciences, 117(30), 17680–17687. https://doi.org/10.1073/pnas.1920484117 First citation in articleCrossrefGoogle Scholar

  • van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology, 6(1), 1–21. https://doi.org/10.2202/1544-6115.1309 First citation in articleCrossrefGoogle Scholar

  • van der Linden, W. J. (2005). Linear models for optimal test design. Springer. https://doi.org/10.1007/0–387-29054–0 First citation in articleCrossrefGoogle Scholar

  • Wise, T., & Dolan, R. J. (2020). Associations between aversive learning processes and transdiagnostic psychiatric symptoms in a general population sample. Nature Communications, 11(1), Article 4179. https://doi.org/10.1038/s41467-020-17977-w First citation in articleCrossrefGoogle Scholar