Skip to main content
Open AccessOriginal Article

Using Differential Item Functioning to Analyze the Domain Generality of a Common Scientific Reasoning Test

Published Online:https://doi.org/10.1027/1015-5759/a000662

Abstract

Abstract. A significant problem that assessments of scientific reasoning face at the level of higher education is the question of domain generality, that is, whether a test will produce biased results for students from different domains. This study applied three recently developed methods of analyzing differential item functioning (DIF) to evaluate the domain generality assumption of a common scientific reasoning test. Additionally, we evaluated the usefulness of these new, tree- and lasso-based, methods to analyze DIF and compared them with methods based on classical test theory. We gave the scientific reasoning test to 507 university students majoring in physics, biology, or medicine. All three DIF analysis methods indicated a domain bias present in about one-third of the items, mostly benefiting biology students. We did not find this bias by using methods based on classical test theory. Those methods indicated instead that all items were easier for physics students compared to biology students. Thus, the tree- and lasso-based methods provide a clear added value to test evaluation. Taken together, our analyses indicate that the scientific reasoning test is neither entirely domain-general, nor entirely domain-specific. We advise against using it in high-stakes situations involving domain comparisons.

The assessment of scientific reasoning has been highlighted as a particularly important challenge of this century (Osborne, 2013). Not all conceptualizations of scientific reasoning include the same skills, but there is considerable overlap and at its core the set comprises skills like formulating questions, formulating hypotheses, gathering evidence, evaluating evidence, explaining results, and communicating results (Fischer et al., 2014; National Research Council [NRC], 2012). These skills are not only relevant in conducting scientific studies but also in professional practice. While there is no shortage of scientific reasoning tests in the literature, a lot of them are of unknown quality and many assumptions remain untested (Opitz et al., 2017). This article will focus on one aspect of the assessment of scientific reasoning: the domain generality of scientific reasoning tests when applied to higher education students.

Whether scientific reasoning is domain-general or domain-specific has been an issue for decades. The position of domain generality entails that the core set of scientific reasoning skills is very similar or the same in domains like chemistry or physics. It also implies that the domain context of a test has no or almost no influence on the assessment of scientific reasoning skills. Proponents of this position claim that it is possible to negate the potential influence of test elements specific to a certain domain if content is used that students are familiar with and thus prevent that scientific reasoning tests are dominated by content effects (Harlen, 1999). Early on, test authors, inspired by the works of Jean Piaget (e.g., Inhelder & Piaget, 1958), tended toward such domain-general conceptualizations. The popularity of this position declined over time, though, as doubts arose about whether a universally applicable scientific method exists (Kind & Osborne, 2017). Thus, declaring scientific reasoning to be domain-specific seemed to be the logical conclusion. The position of domain specificity proposes a very close connection between scientific reasoning skills and the domain that they are tested in. According to this position, one cannot infer from the results of a scientific reasoning test in one context to the results in another context. The proponents of this position think that there should be no separation between reasoning and knowledge and that real-life scenarios will always involve domain knowledge and thus no useful insights can be drawn from tasks that are knowledge-lean. (Kind, 2013; Osborne, 2013; Zimmerman, 2000). Additionally, some conceptualizations of scientific reasoning have moved away from a strict general versus specific dichotomy (Hetmanek et al., 2018; Karmiloff-Smith, 2012; Niaz, 1995). They postulate that scientific reasoning skills apply to more than just one specific context but they are also not as general as intelligence and some scientific reasoning skills can be more general than others.

We chose the area of higher education because knowing the degree of generality of scientific reasoning skills is especially relevant in order to find out how successful universities are in teaching scientific reasoning skills independent of a particular major. Considering the general relevance of scientific reasoning skills in both academic and non-academic work environments, it would be ideal that students acquire these skills independent of their major, especially within the sciences. Assessing whether the acquisition of scientific reasoning skills independent of a major is indeed happening, requires an evaluation of whether a test can be given to students from different majors without producing biased results.

One test that makes such a claim about its domain generality is the Classroom Test of Scientific Reasoning (CTSR; Lawson, 2000). As it is one of the few common scientific reasoning tests that have been used in multiple studies, we selected it for this study for an evaluation of its domain generality assumption. The test has been used in higher education settings with study participants majoring in science as well as study participants not majoring in science (Bao et al., 2009; Coletta & Phillips, 2005; Lawson, Alkhoury, et al., 2000; Lawson, Clark, et al., 2000). Other tests that aim to measure scientific reasoning on a higher education level use items that are similar to CTSR items (see e.g., Gormally et al., 2012; Tobin & Capie, 1981), so it is reasonably representative of scientific reasoning assessments. Other authors who evaluated the CTSR described the relevance of knowledge that is domain-specific and required for the test as “minimal” (Osborne, 2013, p. 269).

Unfortunately, the domain generality assumptions of scientific reasoning tests, including the CTSR, are rarely tested (Opitz et al., 2017). The few studies that do test domain-related assumptions are problematic: They use methods from Classical Test Theory (CTT) that falsely equate observed test scores with underlying latent abilities (Cloonan & Hutchinson, 2011; Weld et al., 2011). In order to better understand why this is problematic, we can use the study by Weld et al. (2011) as an example. The authors compared test results from elementary education and biology majors and used the absence of a significant difference as support for their assumption of domain generality. However, this line of reasoning allows for a scenario where the results for the two groups of students are only on the same level because one group benefited from an unfair advantage. This advantage might have overshadowed the reality that the actual latent ability of one group is below the latent ability of the other group. This approach also allows for another problematic scenario that fails to separate between two potential realities: If the students majoring in biology would have achieved a significantly better result, there are two possibilities. First, it could mean that their latent scientific reasoning ability is higher. This would still be in accordance with an assumption of domain generality. Second, it could mean that students majoring in biology benefited from an unfair advantage due to biased elements in the test. This would contradict the assumption of domain generality. These two possibilities cannot be separated from one another by just looking at the difference of mean scores between two groups.

Another insufficiency of the previously used methods is that they only focus on the bias on the level of the total score but the absence of bias on this level is not necessarily an indicator for the absence of bias on the level of individual items (Borsboom, 2006). This means, for instance, that a test could contain biased items with a biology context and biased items with a physics context but a comparison of means between physics and biology students might not indicate any bias because the biased items are canceling each other out on the level of the total score.

This article will address these shortcomings of previously used CTT methods by employing analysis techniques that were developed within the framework of item-response theory (IRT). The IRT concept that is most relevant for our study is called differential item functioning (DIF). We will use DIF to judge whether the measurement properties of a test are invariant across different groups of interest. If the analyses point to an absence of measurement invariance, we have to assume that the results of an assessment were influenced by characteristics of group membership unrelated to the trait being measured. This would imply that it is not possible to compare group means without bias. We propose as a new idea for the present study that DIF analyses as an indicator of measurement invariance can be applied to the issue of domain generality. This idea is based on the following assumption: If DIF analyses indicate that the measurement properties of an assessment vary between students who are majoring in different domains, we should not expect that this test can assess scientific reasoning in a way that would be considered domain-general. While previous techniques to analyze DIF were limited to comparisons of two groups, we will employ more recent methods that are able to get rid of this limitation. We will employ these recent techniques to see whether a bias is found and if so which items are responsible for the bias. Using more than one method allows us to see how stable the results of IRT-based bias analyses are when different methods are used on the same dataset.

Specifically, we selected so-called tree models as one of our methods, which are also known under their technical term as model-based recursive partitioning (Strobl et al., 2009). We employed two tree-model techniques: We used a Rasch tree, to analyze DIF on a global test level, that is, without looking at specific items (Strobl et al., 2015). The evaluation of DIF at the level of specific items was done with a second technique, namely item-focused trees (Tutz & Berger, 2016). Both tree-based techniques are following the same general procedure. The first step is to estimate all parameters jointly for the whole sample. The second step is to check how stable these parameter estimates are when the covariates are considered that might cause DIF. For instance, we will check in this article how stable parameter estimates are for different student groups. If DIF is indeed present, the introduction of the covariate leads to a systematic deviation in parameter estimates and not just a random fluctuation and thus the sample should be split into subsamples with different estimates. To test this, one transforms the deviation into a test statistic that can be submitted to a significance test. As this is done simultaneously for all potential splits at the same time the α-level of the significance test is adjusted for multiple testing. This adjustment is very important in order to control the false alarm rate, that is, without it we would find DIF in places where it does not exist. If at least one covariate leads to a significant deviation, the sample is split in a third step. If multiple splits would be significant the one that improves the model fit the most is chosen. These three steps are repeated within the subsample branches that are produced by the split until no more significant deviations remain (or the sub-sample size gets below a predetermined threshold). Thus, a tree emerges. As pointed out above, the technique by Strobl et al. (2015) applies these steps on a global level while the technique by Tutz and Berger (2016) applies it on the item level. The advantage compared to prior methods that checked DIF on an item level is that all items are considered simultaneously and not independently of another. The latter approach assumes for every item test that all other items are free of DIF, which is often unrealistic. Items that are selected for at least one split are considered to exhibit DIF, while items that are not selected at all do not exhibit DIF.

As our second method, we selected a technique that is known as the least absolute shrinkage and selection operator (lasso; Tibshirani 1996). Specifically, we chose a technique devised by Tutz and Schauberger (2015). This lasso-based technique introduces many parameters for potential DIF and then reduces them in a way that only genuine DIF is left in the model. In the first step, DIF parameters are introduced for every covariate in every item. The eventual goal is that every parameter unequal from zero will indicate DIF. The problem is that many of these parameters will vary from zero just by chance and we need to separate those from the parameters that indicate genuine DIF. This problem is solved by the lasso technique by introducing a penalization to the parameter estimation. The penalization is introduced in the form of the tuning parameter λ that shrinks the DIF parameter estimates. If λ is set to zero, we would get the standard estimate resulting in a high rate of false hits indicating DIF where there is none. When λ approaches infinity, all of these extra DIF parameters would shrink to zero and no item would be considered to exhibit DIF. However, these are extreme cases only used for explanation purposes, and for typical λ values most but not all parameters will be reduced to zero. If λ is carefully set then all the parameters which are not zero indicate genuine DIF. To find the optimal λ value the lasso uses the Bayesian information criterion (BIC; Schwarz, 1978), a criterion that balances model complexity with the model fit.

To summarize, we employed and compared the following methods to look for domain bias: On the level of the whole test, we compared the previously used CTT method of comparing means with a Rasch tree DIF analysis by Strobl et al. (2015). On the item level, CTT only allows for looking at item difficulties. This was compared with tree- and lasso-based IRT techniques (Tutz & Berger, 2016; Tutz & Schauberger, 2015) to analyze DIF.

We selected these techniques because it was shown that they can detect DIF because they rarely produce false alarms, that is, they rarely indicate DIF where it is not present, and because the models they produce are easy to interpret (Strobl et al., 2015; Tutz & Berger, 2016; Tutz & Schauberger, 2015). Additionally, the analyses on the item level provide an advantage if we discover DIF: In that case, we can gain insights about the connection of the domain specificity of items with their respective DIF results, that is, we can see if items that experts consider to be more domain-specific have a higher rate of DIF. We do want to make a note, though, about the lasso method: As all parameters get shrunk in its calculation, it underestimates the bias’ exact size (Tutz & Schauberger, 2015).

Using the aforementioned methods, we wanted to address the following two research questions:

Research Question 1 (RQ 1):

Can the CTSR be considered domain-general when used in higher education? A high amount of bias between domains and according to performance advantages would speak against this.

Research Question 2 (RQ 2):

How useful are the employed tree- and lasso-based DIF analyses compared to previously used methods to analyze domain generality assumptions? Especially the following two aspects would speak in favor of the new analyses being useful: First, do they effectively address the insufficiencies of previously used methods of classical test theory when analyzing domain bias? Second, do they provide conceptual and practical insights on an item level that go beyond just analyzing bias on the level of the complete test?

Method

Sample

We started data collection with the goal of including 500 participants. This number was informed by studies about the methods we wanted to use in which simulations had revealed an acceptable relation of true and false-positive indications of DIF for that sample size (Strobl et al., 2015; Tutz & Berger, 2016; Tutz & Schauberger, 2015). Our final sample was 507 students studying at university (249 male students, 256 female students). The students were M = 23.01 years (SD = 2.91) and had already been at university for M = 6.76 semester (SD = 2.53). The students were majoring in physics (192 participants), biology (167 participants), or medicine (148 participants). We chose biology and physics students for our sample because there are items in the CTSR with contexts from these two domains. Additionally, we included students majoring in medicine, because we were interested to see whether we could find a domain bias between participants enrolled in a science major – the combination of biology and physics majors in our case – and participants studying a related discipline.

In terms of exclusion criteria, we removed students who stopped taking the test before they completed the full assessment. This was the case for 27 participants. We also excluded students if their time for completing the test was below 40% of the full testing time (which was found to be the minimum amount of time to finish the assessment in a pilot study if one actually worked on the questions) and the rate of correct answers was also below the chance level at the same time. If these two criteria were met, we thought it safe to assume that the according student had randomly answered the test questions. We made one exclusion based on this procedure. The inclusion and exclusion criteria were established prior to data analysis.

Test Instruments

We translated the questions of the CTSR into German. The phrasing was checked in a pilot phase of the study and we observed no issues with it. The majority of the 24 CTSR questions are paired: First, the participants choose what they think is the correct answer. Second, the students choose what they think is the according to justification for this answer. The assessment is then scored based on a suggestion of the authors of the CTSR: The paired questions are merged into a single item for each pair. The only exception to this procedure is questions 23 and 24. They form items of their own. The total possible score resulting from this procedure is 13.

In order to check the degree that the solution to biased items depends on domain-specific aspects, we gave the items to two researchers from the department of physics and two researchers from the department of biology. They provided us with a context rating for the items in terms of their domain dependency. We had created a rating scheme for this task that had four possible classifications: Items with no biology or physics context were rated with a zero. A rating of 1 was given if an item had a biology or physics context but the raters did not expect that context to induce an advantage for a specific domain. An item received a rating of 2 if the raters did expect an advantage for students from a specific domain when solving the item, but the raters also thought it possible that a scientific reasoning skill that might be useful across domains, such as the interpretation of data, could also be applied in solving the item. Last, the raters applied a rating of 3 if they thought it necessary for test-takers to achieve mastery of a domain-specific concept to find the correct solution for an item. Table 1 contains a list of item numbers, information about which questions were combined to form the item, abbreviated names for the items, a simplified explanation of the contents of the questions (which is different from their exact wording), as well as the rating given to the item context by the experts. The context rating was established via consensus ratings of the experts.

Table 1 Overview of scientific reasoning test items and their context rating

Additionally, in order to control whether possible domain differences are based on differences in general reasoning, we chose three subscales from the Intelligence Structure Test, revised version (IST 2000 R; Amthauer et al., 2001), selected 10 items from each subscale, and added them to the test booklet. These subscales assessed numerical, verbal, and figural reasoning.

Analysis

The DIF analyses that were used do not return parameter estimates in situations in which every student (or all students from one domain if domains were compared) had correctly solved an item and for students who correctly answered all questions. Based on this, we had to exclude one item from the DIF calculations because every student majoring in physics had answered it correctly (Item 1 in Table 1). Additionally, we had to remove the students who had achieved the maximum score of 13 in the CTSR from our DIF calculations. After this removal, we ended up with n = 461 students who could be included in the DIF calculations. We created two dummy variables to represent the three domains of our participants in the DIF calculations. Domain differences were additionally analyzed with CTT-based methods that were used to establish the domain generality status of tests in the past. These methods included an analysis of variance (ANOVA), an analysis of covariance (ANCOVA), and the calculation of item easiness. Numerical, verbal, and figural reasoning served as control variables in the ANCOVA. While we found ceiling effects (see the Results section for more details) that violate the normality assumption of the ANOVA and ANCOVA, the assumption of homogenous slopes was met for the ANCOVA.

We calculated descriptive analyses as well as the CTT-based scale characteristics and mean comparisons (i.e., ANOVA and ANCOVA) with IBM® SPSS® 23. We conducted the analysis of DIF with R, version 3.2.5, using the DIFlasso (Schauberger, 2016), DIFtree (Berger, 2016), and psychotree (Strobl et al., 2015) packages.

Results

Before we present the results of the DIF analyses, several important descriptive statistics are given. The mean total score in the CTSR for the whole sample was M = 9.91 (SD = 2.22), with a maximum total score of 13. The reliability (Cronbach’s α) of the CTSR was .63 and for figural, verbal, and numerical reasoning it was .67, .47, and .77, respectively. Table 2 displays the easiness of the CTSR items according to CTT. The table contains the values for the complete sample and the three subsamples grouped by major. It is noteworthy that all values for biology majors are below the value for physics majors (i.e., the items were more difficult for biology students in CTT terms).

Table 2 Item easiness of the scientific reasoning items according to CTT

Differential Item Functioning Analyses

The first DIF analysis was conducted on the level of the complete test using the tree-based technique described in the introduction. The analysis suggested two splits: The first split was made between physics students and all remaining students, p < .001. The second split was made within the group of non-physics students for biology versus medicine, p = .014. Based on these splits we would expect that comparing any of the three subgroups of students will produce a biased result.

As the previous analysis only covered the level of the whole test, we need to turn to the analyses using the item-focused tree as well as the lasso to single out the items that violate the assumption of measurement invariance. The item difficulties (in the form of logit values) produced by these analyses are shown in Table 3. When looking at the numbers in the Table 3, it is important to remember the note from the introduction: The lasso analysis underestimates the absolute value of the difference between student groups. The emphasis of the lasso analysis is more about accurate and important information regarding which items are biased in which direction and less about estimating the bias size.

Table 3 Item difficulty parameter from the tree- and lasso-based analyses

The two analyses consistently pointed to domain bias in four items (6, 7, 10, and 13). One additional biased item (Item 11) was present in the lasso results. A comparison between participants from biology and physics regarding these biased items shows consistent matches between the bias and the context domain (which can be seen in the last columns of Tables 1 and 3). In the biased items with a biology context (6, 7, 10, and 13) the bias favored participants from biology. The bias in Item 11, which is embedded into a physics context, favors participants from physics. At the same time, two other items (2 and 5) that were embedded into a physics context seem to be unbiased based on the results. However, these items were not especially difficult. When compared with all other items in the DIF calculations, there was just one other item (Item 8) that had a lower difficulty.

Domain Differences in ANOVA and ANCOVA Analyses

A one-way ANOVA comparing the scientific reasoning scores of the student subgroups revealed a significant difference, F(2, 503) = 24.82, p < .001, ηp2 = .09. We used a Bonferroni corrected post hoc analysis to compare the three subgroups and found that biology majors achieved significantly lower scores than medical students, who achieved significantly lower scores than physics students. In a subsequent ANCOVA, the domain effect remained significant after we controlled for numerical, verbal, and figural reasoning, F(2, 473) = 11.94, p < .001. The effect size was reduced to a ηp2 = .05.

Discussion

With the help of DIF analyses, we discovered a domain bias in multiple CTSR items between two domains of science, physics and biology. Thus, on the one hand, the answer to our first research question, whether the CTSR can be considered domain-general, seems to be that the evidence produced by this study is weak. The context rating conducted by domain experts for this study shows two things: First, the biased items are embedded in a context that matches the bias, for example, the items that were biased toward biology majors are embedded into a biological context. Second, that it is possible, in theory, to solve these items without domain-specific knowledge. It seems that bias can occur in an item just because some of its context features are domain-specific, even if no domain-specific knowledge is absolutely necessary for providing the correct solution. On the other hand, the CTSR does not seem to be a completely domain-specific test either. When we look at the items that favored biology majors, we do not observe a high fail rate for physics majors. In fact, one could argue that the opposite is more accurate: In terms of CTT, the biased items were actually more difficult for biology majors when compared to physics majors. It seems as if the physics students managed to use their scientific reasoning skills to solve these items. The same picture emerges on the level of the complete test: The physics students were significantly better than the biology students, even when controlling for general reasoning.

Based on these observations, we advise being cautious if authors of scientific reasoning assessments claim that their assessment is measuring scientific reasoning in a fully domain-specific or domain-general fashion. This has implications for the debate surrounding the domain generality or specificity of scientific reasoning. Based on the presented results it seems reasonable for researchers to explore alternative conceptualizations that go beyond the classical dichotomy (Hetmanek et al., 2018; Karmiloff-Smith, 2012; Niaz, 1995; Zimmerman, 2000) and instead imagine scientific reasoning as a set of skills that are relevant in some but not all contexts. The way forward then is to test the limits of such conceptualizations: For instance, looking at the item content of the biased items in this study, we might hypothesize that items involving the interpretation of experimental designs are more specific.

The results also have implications for the evaluation and construction of scientific reasoning tests. Based on this study we can only speculate about the cause of the domain bias, for example, whether motivational factors are at work. To explore this in more detail we might have to analyze the way the items are solved with the help of think-aloud interviews. A study by Adams and Wieman (2015) using problem-solving tasks did exactly that and might serve as an inspiration for such an approach.

As it is very common for scientific reasoning items to be embedded in a domain context (see e.g., Gormally et al., 2012; Schwichow et al., 2016) test creators should pay particular attention to the potential bias this induces, especially in regard to the interpretation of test results. It might be tempting to simply scratch all biased items but this could leave out important aspects of scientific reasoning so this should be considered very carefully. As an alternative for long assessments with many questions, it might be feasible to produce domain-specific difficulty estimates for items exhibiting DIF by using the items without bias as a fixed comparison (Boone et al., 2014).

Besides these theoretical and psychometrical implications, we also want to consider the practical implications of our results. First, it should be noted that an absence of measurement invariance does not necessarily imply a simultaneous absence of predictive invariance (Millsap, 1995). Whether the presence of bias means that the bias will be a concern in the application of the test, depends on the purpose the test is used for (Borsboom, 2006). Considering that predictive invariance might be present if the focus of an application of the CTSR is on the relationship with other variables, the test might still produce unbiased results for that specific purpose. However, the bias we found should not be neglected. A good way to look at this that is more convenient to interpret is the transformation of logit values into the probabilities that a hypothetical person would solve the item with and without the bias: The largest difference in item difficulty was 1.76 logits between medicine and physics in Item 13. A bias of that size means that the probability to solve such an item for a person who has a 50% chance of solving the item without the bias, would jump to 85% if the person benefits from the bias. Therefore, we would advise against using the CTSR, or similar scientific reasoning tests, in situations that involve comparisons of students who were not previously enrolled in the same major, such as the selection of students for PhD positions, as the results would be of doubtful validity. Another area that might be affected by biased test results is the evaluation of educational programs. The outcomes of the PISA assessment (Organisation for Economic Co-operation and Development [OECD], 2007) had a substantial influence on educational policies in secondary education, and similar assessments are in demand for higher education (Zlatkin-Troitschanskaia et al., 2015). If these assessments want to measure scientific reasoning for a wide range of students with supposedly domain-general items, it is important to consider the domain bias that is introduced by item context alone in order to not make decisions based on biased results.

The second goal of this study was to find out how useful IRT-based, and in particular DIF-based, methods are to evaluate the domain generality assumption of a scientific reasoning test. This is in line with others who have pointed toward the importance of IRT-based methods in the assessment of scientific reasoning (Edelsbrunner & Dablander, 2019) and the analyses in this study clearly show the added value of the applied DIF techniques. They helped us to reveal biased items, mostly in favor of biology students, and we would have not flagged these as biased toward biology majors if we had only considered CTT-based difficulty values. The added value of the DIF techniques becomes even more apparent if we contrast them with simple comparisons of mean values. These mean comparisons are one way that has been previously used by test authors to make claims about their assumptions regarding domain specificity or generality (Cloonan & Hutchinson, 2011; Weld et al., 2011). If we would have used only this previous method we might have come to a different bias evaluation compared to the bias analyses based on DIF, namely that the bias is directed against biology students. In contrast, the results from the DIF calculations imply that, on a latent level, our biology majors had lower scientific reasoning skills, and any differences in comparisons of means would actually be more substantial without the bias that favors biology majors.

Among the strengths of the applied analyses was the detailed information provided on an item level, which was one benefit we aimed to achieve by using more sophisticated analyses. It was this item-level information that allowed us to understand the role of domain specificity of item contexts. This detailed item-level information can help identify problematic items during test development and evaluation. The high convergence of two techniques using different calculations indicates the reliability of the findings gained from them. Based on these strengths we recommend continuing to apply the methods developed by Tutz and Schauberger (2015), Tutz and Berger (2016), and Strobl et al. (2015).

In terms of limitations of the present study, we need to mention that the CTSR scores of our participants were on the higher end of possible values. We are not the first to encounter this problem with the CTSR. In a study by Bao et al. (2009) participants from a higher education setting achieved a similar mean score at around 75% of the maximum score. Studies in higher education settings likely ask for scientific reasoning items with a more advanced difficulty. Additionally, we want to mention that our reliability value for the CTSR is below the value of .81 that at least one other study achieved (Lawson, Alkhoury, et al., 2000). We want to point out, though, that consistency of .63 is also not without precedent. Lawson, Clark, et al. (2000) recorded a comparable reliability value of .65 in a group of 667 college students. As we cannot rule out that the low difficulty and reliability affected the results, it is worthwhile to consider how this effect might look like. Based on what is known in general about the influence of ceiling effects and low reliability, it seems most reasonable to assume that they lead to a reduction of systematic variance, therefore making it harder to detect differences between groups (Charter, 2003; Šimkovic & Träuble, 2019). Thus, it seems reasonable to assume that the bias we found exists, especially as it was found with two different techniques, but it might be an underestimation of the overall bias. In particular, we consider it possible that items with a strong physics context, for example, Item 2, might have exhibited bias if only its overall difficulty would have been higher. It should be stressed that this assumption holds only if the measurement error is purely random. Over- or underestimations of group differences can, on the other hand, occur under systematic or differential measurement error (van Smeden et al., 2020).

Last, it could be said that our conclusions are tied to one particular test. In response, we would ask to consider how commonly used the CTSR is and its similarities to other scientific reasoning tests, which commonly cover skills such as generating and evaluating evidence and drawing conclusions, too (Opitz et al., 2017). Thus, we are confident that this study has consequences beyond just one test and that our conclusions are valid for scientific reasoning assessments in general.

Conclusion

In summary, based on our findings we advise against using the CTSR in high-stake situations that involve domain comparisons. Furthermore, we demonstrated that at a higher education level DIF offers insights about domain-induced bias that go beyond the insights offered by CTT. DIF methods offer more information not only on tests as a whole but also on specific items. We think this line of research deserves to be continued.

We thank Richard Shavelson for his input on the study design, Stella Bollmann for answering questions about the DIF analysis methods, and Benjamin Schweizer for his help in constructing the online version of the test instruments.

References