A Common Measurement Scale for Self-Report Instruments in Mental Health Care
T Scores With a Normal Distribution
Abstract
Abstract: The diversity of measures in clinical psychology hampers a straightforward interpretation of test results, complicates communication with the patient, and constitutes a challenge to the implementation of measurement-based care. In educational research and assessment, it is common practice to convert test scores to a common metric, such as T scores. We recommend applying this also in clinical psychology and propose and test a procedure to arrive at T scores approximating a normal distribution that can be applied to individual test scores. We established formulas to estimate normalized T scores from raw scale scores by regressing IRT-based θ scores on raw scores. With data from a large population and clinical samples, we established crosswalk formulas. Their validity was investigated by comparing calculated T scores with IRT-based T scores. IRT and formulas yielded very similar T scores, supporting the validity of the latter approach. Theoretical and practical advantages and disadvantages of both approaches to convert scores to common metrics and alternative approaches are discussed. Provided that scale characteristics allow for their computation, T scores will help to better understand measurement results, which makes it easier for patients and practitioners to use test results in joint decision-making about the course of treatment.
References
2007). A cluster-randomised trial evaluating an intervention for patients with stress-related mental disorders and sick leave in primary care. PLoS Clinical Trials, 2(6), Article
(e26 . https://doi.org/10.1371/journal.pctr.00200262018). Assessing distress in the community: Psychometric properties and crosswalk comparison of eight measures of psychological distress. Psychological Medicine, 48(8), 1316–1324. https://doi.org/10.1017/S0033291717002835
(1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 84767, 307–310. https://doi.org/10.1016/S140-6736(86)90837-8
(2002). The perfidy of percentiles. Archives of Clinical Neuropsychology, 17(3), 295–303. https://doi.org/10.1016/S0887-6177(01)00116-0
(1992). Cross-validation in regression and covariance structure analysis: An overview. Sociological Methods & Research, 21(1), 89–115. https://doi.org/10.1177/0049124192021001004
(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06
(2014). Establishing a common metric for depressive symptoms: Linking the BDI-II, CES-D, and PHQ-9 to PROMIS Depression. Psychological Assessment, 26(2), 513–527. https://doi.org/10.1037/a0035768
(2015). Establishing a common metric for self-reported pain: Linking BPI Pain Interference and SF-36 Bodily Pain Subscale scores to the PROMIS Pain Interference metric. Quality of Life Research, 24(10), 2305–2318. https://doi.org/10.1007/s11136-014-0790-9
(2009). Percentiles please: The case for expressing neuropsychological test scores and accompanying confidence limits as percentile ranks. The Clinical Neuropsychologist, 23(2), 193–204. https://doi.org/10.1080/13854040801968450
(2017). Investigating the practical consequences of model misfit in unidimensional IRT models. Applied Psychological Measurement, 41(6), 439–455. https://doi.org/10.1177/0146621617695522
(1984). Essentials of psychological testing (4th ed.). Harper & Row.
(2019). Approaches to denote treatment outcome: Clinical significance and clinical global impression compared. International Journal of Methods in Psychiatric Research, 28(4), Article
(e1797 . https://doi.org/10.1002/mpr.17972005). De Outcome Questionnaire (OQ-45): Een meetinstrument voor meer dan alleen psychische klachten
([The Outcome Questionnaire (OQ-45): A measure for psychiatric symptoms and more] . De Psycholoog, 40(1), 53–63.2011). Routine outcome monitoring in the Netherlands: Practical experiences with a web-based strategy for the assessment of treatment outcome in clinical practice. Clinical Psychology & Psychotherapy, 18(1), 1–12. https://doi.org/10.1002/cpp.696
(2022). Common measures or common metrics? A plea to harmonize measurement results. Clinical Psychology and Psychotherapy, 29(5), 1755–1767. https://doi.org/10.1002/cpp.2742
(2022). A common measurement scale for self-report instruments in mental health care: T scores with a normal distribution (supplementary materials). https://www.psycharchives.org/en/item/86e598e9–4828-4127–86ae-5f0d18e9586a
(2006). De Brief Symptom Inventory (BSI): De betrouwbaarheid en validiteit van een handzaam alternatief voor de SCL-90
([The Brief Symptom Inventory: Reliability and validity of a handy alternative for the SCL-90] . Maandblad Geestelijke Volksgezondheid, 61, 120–141.2007). The Outcome Questionnaire (OQ-45) in a Dutch population: A cross-cultural validation. Clinical Psychology & Psychotherapy, 14(4), 288–301. https://doi.org/10.1002/cpp.529
(1975). The Brief Symptom Inventory. Clinical Psychometric Research.
(Dorans, N. J.Pommerich, M.Holland, P. W. (Eds.) (2007). Linking and aligning scores and scales. Springer.
2013). Item response theory for psychologists. Erlbaum.
(2016). www.common-metrics.org: A web application to estimate scores from different patient-reported outcome measures on a common scale. BMC Medical Research Methodology, 16(1), Article
(142 . https://doi.org/10.1186/s12874-016-0241-02011). How to compare scores from different depression scales: Equating the Patient Health Questionnaire (PHQ) and the ICD-10-Symptom Rating (ISR) using item response theory. International Journal of Methods in Psychiatric Research, 20(4), 203–214. https://doi.org/10.1002/mpr.350
(2019). Measuring fatigue in cancer patients: A common metric for six fatigue instruments. Quality of Life Research, 28(6), 1615–1626. https://doi.org/10.1007/s11136-019-02147-3
(2011). Measurement-based care in psychiatric practice: A policy framework for implementation. Journal of Clinical Psychiatry, 72(8), 1136–1143. https://doi.org/10.4088/JCP.10r06282whi
(2006).
(Equating test scores . In C. R. RaoS. SinharayEds., Handbook of statistics (Vol. 26, pp. 169–203). https://doi.org/10.1016/S0169-7161(06)26006-12018). Measuring and improving the quality of mental health care: A global perspective. World Psychiatry, 17(1), 30–38. https://doi.org/10.1002/wps.20482
(2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer Science & Business Media.
(2004).
(The Outcome Questionnaire – 45 . In M. E. MaruishEd., The use of psychological testing for treatment planning and outcomes assessment: Volume 3: Instruments for adults (3rd ed., pp. 191–234). Erlbaum. http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2004-14941-006&site=ehost-live2018). The merits of implementing routine outcome monitoring in clinical practice. Clinical Psychology: Science and Practice, 25(4), Article
(e12268 . https://doi.org/10.1111/cpsp.122681972). Quantitative aspects of psychological assessment (Vol. 1). Duckworth.
(1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8(4), 453–461. https://doi.org/10.1177/014662168400800409
(1922). How to measure in education. MacMillan.
(2011). A conceptual introduction to psychometrics: Development, analysis and application of psychological and educational tests. Eleven International. https://books.google.es/books?id=jRJJYAAACAAJ
(2015). Beyond measures and monitoring: Realizing the potential of feedback-informed treatment. Psychotherapy, 52(4), 449–457. https://doi.org/10.1037/pst0000031
(2008). Recent advances in shared decision making for mental health. Current Opinion in Psychiatry, 21(6), 606–6012. https://doi.org/10.1097/YCO.0b013e32830eb6b4
(2012). Lavaan: An R package for structural equation modeling and more. Version 0.5–12 (BETA). Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02
(2014). Establishing a common metric for self-reported anxiety: Linking the MASQ, PANAS, and GAD-7 to PROMIS Anxiety. Journal of Anxiety Disorders, 28(1), 88–96. https://doi.org/10.1016/j.janxdis.2013.11.006
(2015). Establishing a common metric for physical function: Linking the HAQ-DI and SF-36 PF subscale to PROMIS® Physical Function. Journal of General Internal Medicine, 30(10), 1517–1523. https://doi.org/10.1007/s11606-015-3360-0
(2018).
(“True” longitudinal and probability-based Internet panels: Evidence from the Netherlands . In M. DasP. EsterL. KaszmirekEds., Social and behavioral research and the Internet (pp. 77–104). Routledge. https://doi.org/10.4324/9780203844922-42011).
(How representative are online panels? Problems of coverage and selection and possible solutions . In M. DasP. EsterL. KaczmirekEds., Social and behavioral research and the Internet: Advances in applied methods and research strategies (pp. 105–132). Taylor & Francis.2006). Reporting structural equation modeling and confirmatory factor analysis results: A review. The Journal of Educational Research, 99(6), 323–338. https://doi.org/10.3200/JOER.99.6.323-338
(2008). Recovery: An international perspective. Epidemiology and Psychiatric Sciences, 17(2), 128–137. https://doi.org/10.1017/S1121189X00002827
(2016). On the effect of adding clinical samples to validation studies of patient-reported outcome item banks: A simulation study. Quality of Life Research, 25(7), 1635–1644. https://doi.org/10.1007/s11136-015-1199-9
(2020). A study of alternative approaches to non-normal latent trait distributions in item response theory models used for health outcome measurement. Statistical Methods in Medical Research, 29(4), 1030–1048. https://doi.org/10.1177/0962280220907625
(2018). GAMLSS: A distributional regression approach. Statistical Modelling, 18(3–4), 248–273. https://doi.org/10.1177/1471082X18759144
(2013). Development and evaluation of a crosswalk between the SF-36 Physical Functioning Scale and Health Assessment Questionnaire Disability Index in rheumatoid arthritis. Health and Quality of Life Outcomes, 11, 199. https://doi.org/10.1186/1477-7525-11-199
(2016). The Four-Dimensional Symptom Questionnaire (4DSQ) in the general population: Scale structure, reliability, measurement invariance and normative data: A cross-sectional survey. Health and Quality of Life Outcomes, 14(1), Article
(130 . https://doi.org/10.1186/s12955-016-0533-42006). The Four-Dimensional Symptom Questionnaire (4DSQ): A validation study of a multidimensional self-report questionnaire to assess distress, depression, anxiety and somatization. BMC Psychiatry, 6(1), Article
(1 . https://doi.org/10.1186/1471-244X-6-342017). Cut-off scores and clinical change indices for the Dutch Outcome Questionnaire (OQ-45) in a large sample of normal and several psychotherapeutic populations. Clinical Psychology & Psychotherapy, 24(1), 72–81. https://doi.org/10.1002/cpp.1979
(2021). A tutorial on regression-based norming of psychological tests with GAMLSS. Psychological Methods, 26(3), 357–373. https://doi.org/10.1037/met0000348
(2009). Representativity of the LISS panel. Statistics Netherlands.
(2000). Comparing a urea kinetic monitor with Daugirdas formula and dietary records in children. Pediatric Nephrology, 14(4), 280–283. https://doi.org/10.1007/s004670050759
(2012). Measuring agreement, more complicated than it seems. Nephron Clinical Practice, 120(3), c162–c167. https://doi.org/10.1159/000337798
(2014). Standardization of depression measurement: A common metric was developed for 11 self-report depression measures. Journal of Clinical Epidemiology, 67(1), 73–86. https://doi.org/10.1016/j.jclinepi.2013.04.019
(2015). IRT Modeling in the presence of zero-Inflation with application to psychiatric disorder severity. Applied Psychological Measurement, 39(8), 583–597. https://doi.org/10.1177/0146621615588184
(