Skip to main content
Free Access

Degrees of Freedom in Multigroup Confirmatory Factor Analyses

Are Models of Measurement Invariance Testing Correctly Specified?

Published Online:


Abstract. Measurement invaraiance is a key concept in psychological assessment and a fundamental prerequisite for meaningful comparisons across groups. In the prevalent approach, multigroup confirmatory factor analysis (MGCFA), specific measurement parameters are constrained to equality across groups. The degrees of freedom (df) for these models readily follow from the hypothesized measurement model and the invariance constraints. In light of research questioning the soundness of statistical reporting in psychology, we examined how often reported df match with the df recalcualted based on information given in the publications. More specifically, we reviewed 128 studies from six leading peer-reviewed journals focusing on psychological assessment and recalculated the df for 302 measurement invariance testing procedures. Overall, about a quarter of all articles included at least one discrepancy with metric and scalar invariance being more frequently affected. We discuss moderators of these discrepancies and identify typical pitfalls in measurement invariance testing. Moreover, we provide example syntax for different methods of scaling latent variables and introduce a tool that allows for the recalculation of df in common MGCFA models to improve the statistical soundness of invariance testing in psychological research.

Psychology as a discipline has adopted a number of strategies to improve the robustness and trustworthiness of its findings (Chambers, 2017; Eich, 2014): emphasizing statistical power (Bakker, van Dijk, & Wicherts, 2012), acknowledging uncertainty in statistical results (Cumming, 2014), and disclosing flexibility in data collection and analysis (Nelson, Simmons, & Simonsohn, 2018; Simmons, Nelson, & Simonsohn, 2011). Especially, by making all material of the study – its questionnaires, experimental manipulations, raw data, and analyses scripts – available to others, the replicability of the published findings is expected to increase (Nosek et al., 2015; Simonsohn, 2013).This transparency can be helpful to clarify why many peer-reviewed articles in psychology contain inconsistent statistical results that might impact the interpretation of its reported findings (Bakker & Wicherts, 2011; Cortina, Green, Keeler, & Vandenberg, 2017; Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016). Recent reviews highlighted major weaknesses in the reporting of null-hypothesis significance tests (NHSTs) and structural equation models (SEMs) that seriously undermine the trustworthiness of psychological science. In the present study, we review potential deficits in the modeling of multigroup measurement invariance testing.

Discrepancies in Statistical Results

Statistical results of journal articles are typically vetted by multiple peer reviewers and sometimes additionally statistical editors. Despite the thorough review process, many published articles contain statistical ambiguities. For example, Bakker and Wicherts (2011) scrutinized 281 articles from 6 randomly selected psychological journals (three with high impact factor and three with low impact factor) and found around 18% of the statistical results incorrectly reported. Most recently, Nuijten, Hartgerink, van Assen, Epskamp, and Wicherts (2016) revigorated this line of research by introducing the R package statcheck that automatically scans publications for reporting errors, that is, inconsistencies between a reported test statistic (e.g., t-value, F-value), the degrees of freedom (df), and its corresponding p-value. The sobering result of scanning over 250,000 publications of eight top-tier peer-reviewed journals (Nuijten et al., 2016) was that half of the articles contained at least one inconsistent p-value. Moreover, around 12% of the articles contained a discrepancy that changed the results significantly, often in line with the researchers’ expectations. Even though the text recognition and evaluation routine have been criticized for being too sensitive (Schmidt, 2017), the study points to serious issues in the way researchers report their findings.

Considering the comprehensive methodological toolbox of psychologists, test statistics regularly used in NHST are comparatively simple. In applied research, more often sophisticated latent variable techniques are used to test structural hypotheses between several variables of interest. Recently, Cortina and colleagues (2017) reviewed 784 SEMs published in two leading organizational journals to examine whether the reported df matched the information given in the text. In case all necessary information was available to recalculate the df, they only matched in 62% of the time; the discrepancies were particularly prevalent in structural (rather than measurement) models and were often large in magnitude. Thus, the trustworthiness of model evaluations seems questionable for a significant number of SEMs reported in the literature. In test and questionnaire development, methods used to examine the internal structure, to determine the reliability, and to estimate the validity of measures typically also rely on latent variable modeling. The implementation of such procedures in standard statistical software packages also extends the spectrum of test construction – besides the traditional topics of reliability and validity – to other pressing issues such as test fairness and comparability of test scores across groups.

Measurement Invariance in Multigroup Confirmatory Factor Analysis

Measurement invariance (MI) between two or more groups is given if individual differences in psychological test results can be entirely attributed to differences in the construct in question rather than membership to a certain group (see AERA, APA, & NCME, 2014). Thus, MI is an essential prerequisite to ensure valid and fair comparisons across cultures, administration modes, language versions, or sociodemographic groups (Borsboom, 2006a). Contemporary psychometric approaches to test for MI include various latent variable modeling techniques (e.g., Raju, Laffitte, & Byrne, 2002). In a SEM framework, MI is often tested with multigroup confirmatory factor analysis (MGCFA). Analogously, in item response theory (IRT), invariance or bias is assessed by studying differential item functioning. Besides different traditions and a focus on either the scale level (SEM) or the item level (IRT), both techniques share the same logic and concepts (Millsap, 2011). In the remainder of this article, we will focus on the SEM approach.

Although different sequences can be implemented to test for MI in MGCFA (Cheung & Rensvold, 2002; Wicherts & Dolan, 2010), often a straightforward procedure of four hierarchical nested steps is followed (Millsap, 2011). In case constraining certain types of measurement parameters to equality leads to a considerable deterioration in model fit, the invariance assumption is violated. In the first step, configural MI, all model parameters except for necessary identification constraints are freely estimated across groups. For metric or weak MI, the factor loadings are constrained to invariance across groups allowing for comparisons of bivariate relations (i.e., correlations and regressions). In the third step, scalar or strong MI, the intercepts are set to be invariant in addition to the factor loadings. If scalar invariance holds, it is possible to compare the factor means across groups. In the last step, strict MI, additionally, the item residuals are constrained to be equal across groups.

Depending on the chosen identification scheme for the latent factors (i.e., marker variable method, reference group method, and effects-coding method), different additional constraints have to be introduced (see Table 1): The default setting, the marker variable method, sets the factor loading of a marker variable to 1 and its intercept is fixed to 0 in all MI steps outlined above. In the reference group method, the variances of the latent variables are set to 1 in a reference group and the factor loadings are freely estimated. This approach is preferable because the marker variable method relies on a non-invariant marker variable across groups in metric MI (and above), which might lead to convergence problems or otherwise affect the results (Millsap, 2001). In practice, researchers frequently adopt a hybrid approach by fixing the factor loading of a marker variable to 1 and the mean of the latent variables in a reference group to 0 because this allows to interpret differences in factor means directly. Other identification schemes are possible and equally valid, but require different sets for identifying constraints. For example, Little, Slegers, and Card (2006) proposed the effects-coding method, a nonarbitrary way of identifying the mean and covariance structure by constraining the mean of the loadings to 1 and the sum of the intercepts to 0 for each factor. Importantly, the choice of identification constraints does not affect the number of estimated parameters or the results of the MI tests. To facilitate the implementation of MI testing, we provide example syntax for these MI steps for all three methods of identification in lavaan (Rosseel, 2012) and Mplus (Muthén & Muthén, 1998–2017) in the Electronic Supplementary Material, ESM 1.

Table 1 Constraints in MGCFA tests for measurement invariance

The Present Study

Given several critical reviews highlighting inconsistencies in NHST and SEM (Bakker & Wicherts, 2011; Cortina et al., 2017; Nuijten et al., 2016), we were pursuing two objectives: First, we examined the extent of discrepancies in MI testing with MGCFA. Because the number of df for each MI step is mathematically determined through the hypothesized measurement model, we recalculated the df for the aforementioned MI steps based on the information provided in articles that were published in major peer-reviewed journals focusing on psychological assessment in the last 20 years. Second, we tried to identify potential causes for the misspecification (e.g., the complexity of the model or the used software packages). Furthermore, we highlight potential pitfalls when specifying the different steps of MI testing. To this end, we also provide example syntax for MI testing and introduce an easy to handle statistics application that allows double-checking the df in MI testing. Thus, the overarching aim is to improve the statistical soundness of MI testing in psychological research.


Inconsistent df in MI tests of MGCFA were identified among issues of six leading peer-reviewed journals during a period of 20 years (1996–2016) that regularly report on test development and questionnaire construction: Assessment (ASMNT), European Journal of Psychological Assessment (EJPA), Journal of Cross-Cultural Psychology (JCCP), Journal of Personality Assessment (JPA), Psychological Assessment (PA), and Personality and Individual Differences (PAID). Studies were limited to reports of MGCFA that included one or more of the four MI steps outlined above. Not considered were single group tests of MI (i.e., longitudinal MI or multi-trait multi-method MI), second-order models, exploratory structural equation models, or MI testing with categorical data.

We first recalculated the df for all MI models from the information given in the text, tables, and figures (e.g., regarding the number of indicators, latent factors, cross-loadings). A configural model was coded as incorrect if the reported and recalculated df did not match. Then, the df for the metric, scalar, and strict MI models were also recalculated and compared to the reported df. In case inconsistent df were identified at a specific step, the df for subsequent models were also recalculated by taking the reported (inconsistent) df of the previous step into account, which adopts a more liberal perspective. For example, if an author claimed to have tested metric invariance while also constraining the factor variances across all groups, this step was coded as incorrect. However, if in scalar MI testing the intercepts were additionally set to be invariant, this was coded as correct (despite the constrained factor variances). The coding was limited to the four types of MI as outlined above, and we did not code partial MI. Both authors coded about half of the studies. In case, inconsistent df were identified, the other author independently coded the respective study again. Diverging evaluations were discussed until a consensus was reached. We provide our coding sheets and all syntax within the Open Science Framework (Soderberg, 2018) at All analyses were conducted with R 3.4.4 (R Development Core Team, 2018).


We identified a total of 302 MI testing sequences that were published in 128 different research articles. Most articles were published in PA (31.3%) and PAID (23.4%), followed by EJPA (16.4%) and ASMNT (13.3%), whereas fewer articles were retrieved from JCCP and PA (7.8% each). The number of articles reporting MI testing within a MGCFA framework recorded a sharp increase in recent years. Nearly two-thirds of the articles were published within 5 years between 2012 and 2016 and over 88% within the last 10 years (see Figure 1). Whereas the absolute number of discrepancies exhibited a slight increase in recent years, the percentage of discrepancies between reported and recalculated df remained rather stable around 5% in the last years.

Figure 1 Studies reporting measurement invariance tests over time. The thin solid line represents the number of studies reporting MI tests; the dashed line represents the number of studies with at least one discrepancy. The bold black line gives the percentage of discrepancies.

Out of 128 articles, 49 (38.3%) used Mplus to conduct MI testing, 24 (18.8%) used LISREL, and 23 (18.0%) used AMOS. The remaining articles relied on specialized software such as EQS (n = 10) or R (n = 4), did not report their software choice (n = 17), or used more than one program (n = 1). On average, each article reported on 2.36 MI testing sequences (SD = 2.29). Further descriptive information on the model specification grouped by journal and publication year is summarized in Table S1 of the ESM 1.

Discrepancies Between Reported and Recalculated Degrees of Freedom

Half of the studies (48.4%) reported multiple MI tests (e.g., for age and sex groups); that is, the identified MI tests were not independent. Since variation of the identified discrepancies (0 = no discrepancy in df and 1 = discrepancy in df) was found on the study level rather than the MI test level (intra-class correlation = .995), we analyzed discrepancies in df on the level of studies rather than single tests of MI. Therefore, we aggregated the results to the article level and examined for each article whether at least one inconsistent df was identified for the different models in each MI step. The analyses revealed that out of 120 studies reporting configural MI, only 7 studies showed discrepancies (5.8%, see Table 2). In contrast, tests for metric MI and scalar MI exhibited larger discrepancies between the reported and recalculated df (15.9% and 21.1%, respectively). Only two studies reported incorrect df in strict MI.

Table 2 Discrepancies between reported and recalculated degrees of freedom

To shed further light on potential predictors of the discrepancies, we conducted a logistic regression analysis (0 = no discrepancy, 1 = at least one discrepancy). We added the (1) complexity of the model, (2) publication year, (3) journal, and (4) software package as predictors (see Table 3). The complexity of the model did not predict the occurrence of reporting errors. In contrast, the year of publication influenced the error rate with more recent publications exhibiting slightly more discrepancies. Given that most of the studies have been reported in recent years, the average marginal effect (AME; Williams, 2012) for an article including a discrepancy was about 3.0% (p = .003) per year. Across all journals, a quarter of all published articles on MI included at least one df that we were unable to replicate (see Figure 2).

Figure 2 Reporting inconsistencies across journals. n = 128. The dashed line indicates the average of the discrepancies across journals. The number below the journal abbreviation represents the number of studies. ASMNT = Assessment; EJPA = European Journal of Psychological Assessment; JCCP = Journal of Cross-Cultural Psychology; JPA = Journal of Personality Assessment; PA = Psychological Assessment; PAID = Personality and Individual Differences.
Table 3 Predicting occurrence of discrepancies based on study characteristics

A comparison of the journals demonstrates subtle differences: In comparison with PA, the outlet that published most MI tests, JCCP (AME = 22.5%, p = .13) and PAID (AME = 17.4%, p = .05) reported slightly more inconsistent df. The highest rate of discrepancies between reported and recalculated df was found for JPA (AME = 48.3%, p = .001) – with 5 out of 10 studies. The most important predictor in the logistic analysis was the software package used in MI testing. In comparison with Mplus, studies using other software packages were more likely to have discrepancies: AMOS (AME = 22.3%, p = .02), LISREL (AME = 26.2%, p = .01), and most severely EQS (AME = 57.0%, p < .001).

Pitfalls in Testing Measurement Invariance

Without inspecting the analysis syntax of the reported studies, we can only speculate about the reasons of the discrepancies. However, in our attempts to replicate the df, we spotted two likely sources of model specification: In testing metric MI, discrepancies seem to have resulted in many cases (13 of 20 flagged publications) from a misspecified model using the reference group approach for factor identification. As a reminder, the configural model includes fixing the variances of the latent variables to 1 in all groups, while freely estimating all factor loadings. The metric model, however, requires equality constraints on the factor loadings across groups, while relaxing constraints on the variances of the latent variables except for the reference group. It seems that some authors neglected to free the factor variances and, thus, instead of testing a metric MI model, evaluated a model with invariant loadings and variances. This is important because the reference group method is sometimes preferred over the marker variable method which presupposes an invariant marker variable. Fixing the factor loading of a non-invariant marker variable in metric MI might lead to convergence problems or otherwise biased estimates (Millsap, 2001). To identify such invariant indicators, several methods have been proposed (Rensvold & Cheung, 2001). For instance, Yoon and Millsap (2007) prefer the reference group method (i.e., fixing the variance of the latent variables to one in the first group only and fixing all factor loadings to equity across groups) and then – in case of lacking full metric invariance – to systematically free loading constraints based on modification indices to identify non-invariant factor loadings and to establish partial metric invariance.

Issues in reporting scalar MI can in many instances (12 out of 20 flagged studies) be traced back to a misspecified mean structure. SEM is a variance–covariance-based modeling approach, and in a single group case, researchers are usually not interested in the mean structure. Therefore, scalar MI tests, in which the mean structure plays a vital role, seem to present particular difficulties. Again, we suspect that researchers adopting the reference group or hybrid approach for factor identification neglected to free previously constrained latent factor means (see Table 1). As a result, instead of testing for scalar MI, these models in fact evaluated a model with invariant intercepts and means fixed to 0 across groups. Such model misspecifications are not trivial and have severe consequences for model fit evaluations: In a simulated MGCFA MI example, we compared a correctly specified scalar MI model with freely estimated latent factor means (except for the necessary identifying constraint) to a model, in which all factor means were fixed to zero.

Figure 3 demonstrates that already moderate differences in the latent means (d ≈ .50) result in a drop in the comparative fit index (CFI) from an initially good fitting model (CFI = .98) to values below what is usually considered acceptable (CFI ≥ .95). Thus, if the means are constrained to zero, any differences in the latent means are passed on to the intercepts; if these are also constrained to equality, the unmodeled mean differences can result in a substantial model deterioration. As a consequence, misspecified scalar MI models can lead to erroneous rejection of the scalar MI model.

Figure 3 Consequences of fixing the means to 0 in scalar measurement invariance testing on model fit.


The concept of measurement equivalence is pivotal for psychological research and practice. To address substantial research questions, researchers depend on information about the psychometric functioning of their instruments across sex and ethnic groups, clinical populations, etc. Accordingly, reporting issues in MI testing are not restricted to a specific field but affect different disciplines such as clinical (Wicherts, 2016) and I/O psychology (Vandenberg & Lance, 2000). The extent of discrepancies we found in the psychological assessment literature was rather surprising: One out of four studies reporting MI tests included an incorrectly specified or, at least, incorrectly described model. Thus, a substantial body of literature on the measurement equivalence of psychological instruments seems to be questionable or inaccurate. This percentage is probably a lower boundary of the true error rate due to the way we coded the MI tests (i.e., no subsequent errors). Since our analysis was limited to discrepancies in the df, it is possible that additional errors may have occurred (e.g., handling of missing data, incorporating nested structures). To identify these and similar flaws, both the raw data and the analyses scripts would be necessary to reanalyze the data. As outlined above, we also did not consider single group tests of MI, (i.e., longitudinal MI or multi-trait multi-method MI), second-order models, exploratory structural equation models, or MI testing with different estimators that are more appropriate for categorical data. In our assessment, it is likely that these statistically often more complex scenarios of MI testing offer additional potential for misspecification.

Regarding the cause of inconsistencies, the results of the logistic regression provide us with some valuable clues: The increased popularity of MGCFA MI testing in psychological research was accompanied by an increase in discrepancies. This is not an unusual pattern in the dissemination of psychological methods: After the formal (and often formalized) introduction of a new method by psychometricians, more and more users adopt and apply the method – sometimes without a deeper understanding of the underlying statistics. However, the strongest effect was observed for the software package used to conduct MI tests. In comparison with Mplus, other software packages performed worse, which might be due to the extensive documentation and training materials. Or, it can be more likely attributed to a selection effect, because more advanced users prefer scripting languages. Taken together, we think that the results point to a general problem with the formal methodological and statistical training of psychologists (Borsboom, 2006b).

The two issues that are most predominantly causing discrepancies – keeping the inadvertently fixing of the factor variances across groups in the metric MI model and fixing the factor means to 0 in the scalar MI model – presumably point to a conceptual misunderstanding: Measurement invariance is, technically speaking, only concerned with the relationship between indicator variables and latent variables. Variances, covariances, and means of latent variables, however, deal with a different aspect of invariance, called structural invariance (Vandenberg & Lance, 2000), because they are concerned with the properties of the latent variables themselves (see also Beaujean, 2014). Often researchers are especially interested in how these structural parameters vary across groups. Confusing measurement and structural parameters and specifying more restrictive models than necessary can result in failing to establish MI even though the difference is in the structural parameters. Accordingly, meaningful differences might be wrongly attributed to a measurement artifact.

Recommendations for Conducting and Reporting MGCFA MI Testing

In the following, some recommendations are given to improve the accuracy of conducting and reporting statistical results in the framework of MI testing. These recommendations apply to all parties involved in the publication process – authors, reviewers, editors, and publishers.

First, familiarize yourself with the constraints of MI testing using different identification strategies (see Table 1) and pay attention to the aforementioned pitfalls. Furthermore, we encourage researchers to use the effects-coding method (Little et al., 2006), which allows to estimate and test the factor loadings, variances, and latent means simultaneously. In contrast to other scaling methods, effects-coding method does not rely on fixing single measurement parameters to identify the scale, which might lead to problems in MI testing if these parameters function differently across groups, but are constrained to be equal. This method might be helpful in finding a measurement model that is only partially invariant (Rensvold & Cheung, 2001; Yoon & Millsap, 2007).

Second, describe the measurement model in full detail (i.e., number of indicators, factors, cross-loadings, residual covariances, and groups) and explicitly state which parameters are constrained at the different MI steps, so that it is clear which models are nested within each other. In addition, use unambiguous terminology when referring to specific steps in MI testing. In our literature review, we found several cases, in which the description in the method section did not match the restrictions given in the respective table. One way to clarify which model constraints have been introduced is to label the invariance step by the parameters that have been fixed (e.g., “invariance of factor loadings” instead of “metric invariance”).

Third, in line with the recommendations of the Association of Psychological Science (Eich, 2014) and the efforts of the Open Science Framework (Nosek et al., 2015) to make scientific research more transparent, open, and reproducible, we strongly advocate to make the raw data and the used analysis syntax available in freely accessible data repositories. As a pleasant side effect, there is also evidence that sharing detailed research data is associated with increased citation rate (Piwowar, Day, & Fridsma, 2007). If legal restrictions or ethical considerations prevent the sharing of raw data, it is possible to create synthesized data sets (Nowok, Raab, & Dibben, 2016).

Fourth, we encourage authors and reviewers to routinely double-check the df of the reported models. In this context, we welcome the recent effort of journals in psychology to include soundness checks on manuscript submission by default to improve the accuracy of statistical reporting. To this end, one may refer to ESM 1 that includes example syntax for all steps of MI in lavaan and Mplus for different ways of scaling latent variables or use our JavaScript tool to double-check the df in MI testing (

Fifth, statistical and methodological courses need to be taught more rigorously in university teaching, especially in structured PhD programs. A vigorous training should include both conceptual (Borsboom, 2006a; Markus & Borsboom, 2013) and statistical work (Millsap, 2011). To bridge the gap between psychometric researchers and applied working psychologists, a variety of teaching resources can be recommended that introduce invariance testing in general (Cheung & Rensvold, 2002; Wicherts & Dolan, 2010) or specific aspects of MI such as longitudinal MI (Geiser, 2013), and MI with categorical data (Pendergast, von der Embse, Kilgus, & Eklund, 2017).

Electronic Supplementary Material

The electronic supplementary material is available with the online version of the article at


  • American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. First citation in articleGoogle Scholar

  • Bakker, M., van Dijk, A. & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543–554. First citation in articleCrossrefGoogle Scholar

  • Bakker, M. & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods, 43, 666–678. First citation in articleCrossrefGoogle Scholar

  • Beaujean, A. A. (2014). Latent variable modeling using R: A step by step guide. New York, NY: Routledge/Taylor & Francis Group. First citation in articleCrossrefGoogle Scholar

  • Borsboom, D. (2006a). When does measurement invariance matter? Medical Care, 44, 176–181. First citation in articleCrossrefGoogle Scholar

  • Borsboom, D. (2006b). The attack of the psychometricians. Psychometrika, 71, 425–440. First citation in articleCrossrefGoogle Scholar

  • Chambers, C. (2017). The seven deadly sins of Psychology: A manifesto for reforming the culture of scientific practice. Princeton, NJ: Princeton University Press. First citation in articleCrossrefGoogle Scholar

  • Cheung, G. W. & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. First citation in articleCrossrefGoogle Scholar

  • Cortina, J. M., Green, J. P., Keeler, K. R. & Vandenberg, R. J. (2017). Degrees of freedom in SEM: Are we testing the models that we claim to test? Organizational Research Methods, 20, 350–378. First citation in articleCrossrefGoogle Scholar

  • Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7–29. First citation in articleCrossrefGoogle Scholar

  • Eich, E. (2014). Business not as usual. Psychological Science, 25, 3–6. First citation in articleCrossrefGoogle Scholar

  • Geiser, C. (2013). Data analysis with Mplus. New York, NY: Guilford Press. First citation in articleGoogle Scholar

  • Little, T. D., Slegers, D. W. & Card, N. A. (2006). A non-arbitrary method of identifying and scaling latent variables in SEM and MACS models. Structural Equation Modeling, 13, 59–72. First citation in articleCrossrefGoogle Scholar

  • Markus, K. A. & Borsboom, D. (2013). Frontiers of test validity theory: Measurement, causation, and meaning. New York, NY: Routledge. First citation in articleCrossrefGoogle Scholar

  • Millsap, R. E. (2001). When trivial constraints are not trivial: The choice of uniqueness constraints in confirmatory factor analysis. Structural Equation Modeling, 8, 1–17. First citation in articleCrossrefGoogle Scholar

  • Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York, NY: Routledge. First citation in articleGoogle Scholar

  • Muthén, L. K. & Muthén, B. O. (1998–2017). Mplus user’s guide (8th ed). Los Angeles, CA: Muthén & Muthén. First citation in articleGoogle Scholar

  • Nelson, L. D., Simmons, J. & Simonsohn, U. (2018). Psychology’s renaissance. Annual Review of Psychology, 69, 511–534. First citation in articleCrossrefGoogle Scholar

  • Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., … Yarkoni, T. (2015). Promoting an open research culture. Science, 348, 1422–1425. First citation in articleCrossrefGoogle Scholar

  • Nowok, B., Raab, G. M. & Dibben, C. (2016). Synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74, 1–26. First citation in articleCrossrefGoogle Scholar

  • Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S. & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 48, 1205–1226. First citation in articleCrossrefGoogle Scholar

  • Pendergast, L., von der Embse, N., Kilgus, S. & Eklund, K. (2017). Measurement equivalence: A non-technical primer on categorical multi-group confirmatory factor analysis in school psychology. Journal of School Psychology, 60, 65–82. First citation in articleCrossrefGoogle Scholar

  • Piwowar, H. A., Day, R. S. & Fridsma, D. B. (2007). Sharing detailed research data is associated with increased citation rate. PLoS One, 2, e308. First citation in articleCrossrefGoogle Scholar

  • R Development Core Team. (2018). R: A language and environment for statistical computing. [Computer software]. Retrieved from First citation in articleGoogle Scholar

  • Raju, N. S., Laffitte, L. J. & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–529. First citation in articleCrossrefGoogle Scholar

  • Rensvold, R. B. & Cheung, G. W. (2001). Testing for metric invariance using structural equation models: Solving the standardization problem. In C. A. SchriesheimL. L. NeiderEds., Equivalence in measurement Research in management (pp. 21–50). Greenwich, CT: Information Age. First citation in articleGoogle Scholar

  • Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48, 1–36. First citation in articleCrossrefGoogle Scholar

  • Schmidt, T. (2017). Statcheck does not work: All the numbers. Reply to Nuijten et al. (2017). First citation in articleGoogle Scholar

  • Simmons, J. P., Nelson, L. D. & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. First citation in articleCrossrefGoogle Scholar

  • Simonsohn, U. (2013). Just post it: The lesson from two cases of fabricated data detected by statistics alone. Psychological Science, 24, 1875–1888. First citation in articleCrossrefGoogle Scholar

  • Soderberg, C. K. (2018). Using OSF to share data: A step-by-step guide. Advances in Methods and Practices in Psychological Science, 1, 115–120. First citation in articleCrossrefGoogle Scholar

  • Vandenberg, R. J. & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. First citation in articleCrossrefGoogle Scholar

  • Wicherts, J. M. (2016). The importance of measurement invariance in neurocognitive ability testing. The Clinical Neuropsychologist, 30, 1006–1016. First citation in articleCrossrefGoogle Scholar

  • Wicherts, J. M. & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis: An illustration using IQ test performance of minorities. Educational Measurement: Issues and Practice, 29, 39–47. First citation in articleCrossrefGoogle Scholar

  • Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata Journal, 12, 308–331. First citation in articleCrossrefGoogle Scholar

  • Yoon, M. & Millsap, R. E. (2007). Detecting violations of factorial invariance using data-based specification searches: A Monte Carlo study. Structural Equation Modeling, 14, 435–463. First citation in articleCrossrefGoogle Scholar

Ulrich Schroeders, Psychological Assessment, Institute of Psychology, University of Kassel, Holländische Str. 36-38, 34127 Kassel, Germany, E-mail