Skip to main content
Open AccessReview Article

Reducing Literature Screening Workload With Machine Learning

A Systematic Review of Tools and Their Performance

Published Online:https://doi.org/10.1027/2151-2604/a000509

Abstract

Abstract. In our era of accelerated accumulation of knowledge, the manual screening of literature for eligibility is increasingly becoming too labor-intensive for summarizing the current state of knowledge in a timely manner. Recent advances in machine learning and natural language processing promise to reduce the screening workload by automatically detecting unseen references with a high probability of inclusion. As a variety of tools have been developed, the current review provides an overview of their characteristics and performance. A systematic search in various databases yielded 488 eligible reports, revealing 15 tools for screening automation that differed in methodology, features, and accessibility. For the review on the performance of screening tools, 21 studies could be included. In comparison to sampling records randomly, active screening with prioritization approximately halves the screening workload. However, a comparison of tools under equal or at least similar conditions is needed to derive clear recommendations.

Need for Automation in Systematic Reviews

The massive increase in published research findings (Bornmann et al., 2021) sets a limit to traditional research synthesis workflows. On the one hand, the high velocity of paper publication demands to speed up and regularly update research syntheses (Beller et al., 2018). On the other hand, the high volume of scientific output increasingly exceeds the capacities of human labor in eligibility screening (Borah et al., 2017). When human information processing comes to a limit, machine learning (ML) and natural language processing (NLP) can help to cope with the information overload. In recent years, various tools have been developed to automate different steps of the review process, such as deduplication, data extraction, or reference screening.

Typically, many abstracts have to be screened for a meta-analysis, while only a small proportion is relevant. Hence, there is great potential for the reduction of workload: With the help of screening automation tools, only a subset of the data needs to be screened manually. This subset is used to train an algorithm that learns the relationship between text features of the screened papers and the decision of inclusion or exclusion. Applied to the remaining set of unscreened publications, the algorithm can make predictions of inclusion probability. The papers that are most likely relevant are then presented to the researcher earlier, and thus, relevant articles (or most of them) can be found before screening all search results. This kind of interaction between the user and the list of prioritized literature items is called active learning (in the field of review automation, in computer science, the meaning of the term is slightly different). With the help of the continuous user feedback on inclusion or exclusion, the tool improves the accuracy of the predictions of relevance. By doing so, the automatically generated priority ranking (papers with highest predicted inclusion probability are displayed first) is updated continuously.

A very simple classification model for screening automation might be a logistic regression with every word appearing in the data set (i.e., vocabulary of the text corpus) as a predictor. This can be done by transforming documents to numeric vectors with length of the vocabulary size number. For instance, if the vocabulary V would comprise a total of six words (V = {this, is, a, very, simple, example}), a document (publication) i with the text di = {this is very very simple} would be transformed into the numeric vector vi = {1, 1, 0, 2, 1, 0}. This vector can be regarded as a document-specific frequency distribution of the vocabulary words. The transformation of all documents to numeric vectors results in a document-feature matrix, with documents in rows and features (here: words) in columns. These columns can be used as predictors in a logistic regression model. In reality, the vocabulary can consist of thousands or 10 thousands of words, and the plain words and logistic regression might not yield the best predictions. This is where machine learning comes into play: Both feature extraction and classification can be optimized. Instead of words, latent topics or semantic features can improve predictive modeling. Instead of linear regression, neural networks (deep learning) or pattern recognition algorithms can avoid overfitting in the training data.

One of the first trials to evaluate the usefulness of automated classification to reduce the workload for literature screening was conducted by Cohen et al. (2006). Text mining and classification approaches were applied to annotated literature references from 15 systematic drug class reviews to construct a machine learning-based classification system. The study already showed the potential of semiautomated screening to significantly reduce the workload while maintaining a high level of recall, meaning that almost all items of interest can be found without screening the complete collection of abstracts. Since then, algorithms and tools have been further developed and meanwhile reached maturity (Marshall & Wallace, 2019), leading to research question 1 (RQ1): Which tools have been employed for the automation of the literature selection process for systematic reviews? In which aspects do they differ?

An overview on existing tools for systematic review automation is given by Marshall and Wallace (2019). As the focus is not explicitly on abstract screening, but the review covers different steps of systematic reviews, and also due to the rapid development in the field, only six tools for screening are mentioned. Information on the performance of the tools is not provided. In their cost-effectiveness analysis, Shemilt et al. (2016) compare the use of text mining in addition to single screening with the conventional double screening approach and find screening workload reductions of more than 60%. Yet, their comparison was only based on one case study. However, a systematic review of text mining approaches for study screening reveals similar workload reductions with potential savings of between 30% and 70% (O’Mara-Eves et al., 2015).

Comparative evaluations of screening performance are mainly available for classifiers and limited sets of example data (e.g., Liu et al., 2018; Yu et al., 2018). However, many practitioners might be mainly interested in easy-to-use tools with user-friendly web applications and look for guidance and benchmarks on their performance. One of the first studies to explicitly compare three machine learning tools on the same data by means of retrospective screening simulations is Gates et al. (2019). As in Shemilt et al. (2016), supplementing single screening with the use of a tool is compared to double screening. Abstrackr performs best in the three case studies and seems to provide an alternative to double screening. In the simulations, the predictions of the tools were based on 200 screening decisions that were each given to the tools as a training set. A consecutive study of Abstrackr with a sample of 16 reviews (Gates et al., 2020a) also indicates that a considerable amount of screening time can be saved, while the risk of excluding a relevant paper is limited.

Another quite recent study providing an overview on existing ML screening tools and comparing the performance of three of the tools is Robledo et al. (2021). However, the comparison is based on one case study that is relatively small resulting in small workload saving potentials and limited generalizability. Only recently, various new applications have emerged (e.g., Research Screener in 2018, ASReview in 2019, and SWIFT Active Screener in 2020), and often, they are presented to the research community at least with case studies proving their functionality (e.g., Chai et al. (2021) for Research Screener, van de Schoot et al. (2021) for ASReview, and Howard et al. (2020) for SWIFT Active Screener). This suggests that there should be potential to summarize and compare performance tests of different tools and answer the second research question (RQ2): How have these tools been tested and what is known about their performance (recall, work savings) so far?

All of these single studies are highly informative, yet each of them suffer from being limited to a small number of tools or lack comparability between tools due to very few data sets used as case studies. However, a problem of the limited case studies is that the user in a real setting does not know when the user has screened enough records (i.e., reached a certain recall). Adequate stopping rules to save a considerable amount of time while still achieving a sufficient recall are still a question under consideration (Callaghan & Müller-Hansen, 2020). More practical applications and further evaluation of the tools are therefore needed for clear guidance and recommendations on their use. Therefore, the third research question (RQ3) should give practitioners a notion of how much work is needed to achieve sufficient recall: How much work can be reduced to achieve certain recall rates (92.5%, 95%, 97.5%)?

In this paper, an overview of existing tools for semiautomated abstract screening will be given. For each of the tools identified, a systematic search for previous practical applications and evaluations of their performance will be conducted and results presented and discussed. The overall aim of the review is to draw a broader picture of the use and performance of automation tools for literature screening in systematic reviews.

Method

Inclusion and Exclusion Criteria

This review is reported in accordance with the PRISMA 2020 statement (Page et al., 2021). To define the eligibility criteria for the review, SPIDER (Cooke et al., 2012) was used instead of PICO, as it is less focused on clinical research and better suited for mixed methods evidence syntheses. As can be seen in Table 1, the categories defined as relevant are Sample, Phenomenon of Interest, Design, Evaluation, and Research Type.

Table 1 Eligibility criteria according to the SPIDER tool

The sample of interest for the overview of screening automation tools and their performance were studies that used tools for (semi)automated literature screening. Hence, we excluded studies that described the use of tools only for other steps of a systematic review (e.g., duplicate detection) or reported the performance of specific algorithms that were not included in user-friendly software tools. The phenomenon of interest was the evaluation of the screening quality of the tools. Relevant study designs were practical applications as literature reviews or scoping reviews with a retrospective evaluation of a (semi)automated screening process or a comparison of the automated screening with available screening decisions from manual coding only as a gold standard. We excluded subjective evaluations of tools, for example, via interviews or surveys. Instead, only objective evaluations were of interest. In the case of screening, measures of recall rates, precision, and work reduction through the use of the tool were relevant.

Search Strategies and Screening Procedures

To find relevant literature, we queried PubMed (medicine), Web of Science (various fields of science), and PubPsych (psychology and related sciences). In addition, we looked for relevant articles in the journal Research Synthesis Methods. The searches were conducted in November 2021. As search terms, we used the names of relevant tools, descriptions of the focus (abstract screening and systematic review automation), and relevant methods in the field of review automation. A list of the search terms used and the results per database can be found online in PsychArchives in supplementary material 1 (Burgard & Bittermann, 2022a). Apart from the database search, forward and backward searches were conducted. Originating from a starter set of 11 relevant publications in the field, forward searches were conducted via Web of Science and PubMed. Backward searches were conducted via Web of Science and Systematic Reviews journal. We only chose records that were published from 2006 on, taking the first trials on screening automation of Cohen et al. (2006) as a starting point.

During abstract screening, one third of these records were screened by two coders. Coder 1 was a postdoc with experience in abstract screening and meta-analysis. Coder 2 was a scientific assistant screening for the first time. The total inter-rater agreement was 77%. Taking into account random agreement, Cohen’s κ (Cohen, 1960) resulted in 0.48, representing moderate agreement (Landis & Koch, 1977). All discrepancies regarding the 700 abstracts that were double-screened were checked and discussed in detail. This procedure led to complete consensus and supported the decisions of Coder 1 in all cases. Most importantly, there was no record that was relevant and yet not detected. Coder 1 thus coded the remaining records.

During full text screening, some basic information, such as the type of research, the tool (if any) used or described, and – in case of exclusion – the exclusion reason, was already coded. Next to Coder 1 from the abstract screening, another postdoc with less experience in literature screening participated in full text screening. A Google spreadsheet was used to collaboratively code the needed information from the full texts. In case of uncertainties, the respective fields were marked and open questions collected. In regular meetings, these comments were discussed and resolved.

Data Items and Performance Measurement

For the records finally included, more information was collected. Regarding the report, information such as the first author, the publication year, and the type of study was coded. Each set of potential candidate papers used for the screening process in the selected reports was identified with an ID. That was necessary, as there are available sets of literature that are used in several studies. The topic of interest and the screening tool used in the study were also coded. To describe the screening process, the number of candidate papers, the number of papers manually screened, and the number of studies classified as relevant are collected for each study.

To evaluate the performance of algorithms and tools for the screening process, we compared tool-assisted screening to complete manual screening. There are a number of data that can be used to compute a performance measure, depending on the information given in the individual reports:

  • The true number of relevant studies identified: This is the gold standard that the results from the corresponding experiment are compared to.
  • The number of true positives (TP, relevant papers found), true negatives (TN, irrelevant papers correctly classified), false positives (FP, irrelevant papers classified as relevant), and false negatives (FN, relevant papers missed; Swets et al., 2000).
  • Precision (P): TP/(TP + FP): The share of relevant papers from all papers classified as relevant (Liu et al., 2018).
  • Recall (R): TP/(TP + FN): The share of papers found from all relevant papers (Przybyła et al., 2018).
  • X95: Number of papers needed to be reviewed for 95% recall (Yu et al., 2018).
  • WSS: Work saved over sampling for a certain recall: ((Number of candidate papers − Number of manual screenings)/Number of candidate papers) − (1 − Recall achieved) (Van de Schoot et al., 2021).
  • F1: Balanced F-Score: (2 × P × R)/(P + R) (Liu et al., 2018).
  • F2: Balanced F-Score with higher weight on recall: (5 × P × R)/(4 × P + R) (Van Rijsbergen, 1979).

As Shemilt et al. (2016) point out, cost-effectiveness in terms of abstract screening means that the number of manual screenings (cost) should be reduced while avoiding false negatives (effect). The goal of the screening process is to find all relevant studies. Therefore, to evaluate the effectiveness of an assisted literature screening process, a high recall is more important than a high reduction of manual labor. Put differently, the goal is to reduce screening costs as long as a high recall can still be maintained. That is why the F2-score is preferable toward the F1-score in the context of evaluating literature screening tools, as it puts more weight on recall.

Review Methods and Analyses

To answer the first research question on which tools have been employed for the automation of the literature selection process for systematic reviews and in which aspects they differ, all reports that described these tools or reported their application were used. On the basis of these reports, the characteristics and use of these tools were summarized.

For the second and third research questions (actual performance/work reduction), data from a complete manual screening were used as gold standard to compare the machine-assisted screening with. Only if sufficient data on this comparison were available to compute F-scores or workload savings, a report could be included in the analyses of tool effectiveness. In case of incomplete reports, we computed precision, recall, workload savings, and F-scores as described above in the listing of the performance measures.

To review the evidence on performance of the screening tools, the tools are compared regarding the Work Saved over Sampling (WSS) that resulted in their evaluations. In this context, the association between the share of manually screened abstracts and the recall rate is examined. This gives a first overview of the effectiveness of the tools to reduce manual labor while considering the achieved recall rates. The performance of different tools regarding WSS is compared. Furthermore, a question of practical relevance is, how many abstracts had to be screened manually to achieve certain recall rates that are considered acceptable? Therefore, the association between the number of manual screenings and recall is further examined.

As an alternative measure for the performance of machine-assisted screening, the F2-score is compared to the WSS. Due to limitations in the data given in the reports, there are many missing values for the F2-score. Nevertheless, the association between F2 and WSS in the complete data set can be assessed to identify and discuss differences between the two measures to evaluate screening tools.

Finally, there are data sets that were used for the evaluation of different tools: in some cases, within the same study (e.g., Gates et al, 2019; Robledo et al., 2021) and sometimes also between studies, for example, with the data sets of Cohen et al. (2016) (e.g., Howard et al., 2016). These direct comparisons will be further investigated to examine whether there are tools that are always better than others or whether the performance of tools depends on the actual data set.

All analyses were performed using the statistical programming language R (R Core Team, 2022), version 4.0.2. The analysis script and data are made available online in the supplementary materials (Burgard & Bittermann, 2022b). Additional packages used during the analyses are readxl (Wickham & Bryan, 2019), ggplot2 (Wickham, 2016), dplyr (Wickham et al., 2020), GGally (Schloerke et al, 2020), and ggExtra (Attali & Baker, 2022).

Results

After deduplication of the records found in database searches, forward searches, and backward searches, 2,101 records remained to be screened. The results of the screening process are depicted in the PRISMA flowchart in Figure 1. During abstract screening, 1,602 records were excluded in total. This led to 499 reports that were sought for retrieval for full text screening. Of these, 11 reports could not be retrieved. Based on the remaining 488 reports, we compiled a list of tools that offer screening automation (see the Overview and Functioning of Screening Automation Tools section).

Figure 1 PRISMA 2020 flowchart.

To examine the empirical evidence on performance of these tools, the 488 reports were assessed for eligibility by two coders. The main reasons for exclusions were that tools were used only to support collaborative screening and not to reduce the number of reports to be screened by machine-assisted prioritization. Furthermore, in some reports, there were no screening tools used at all, or the study was not about screening. In the end, only 21 studies could be used for the review on the performance of screening tools (see the Applications and Performance of Screening Automation Tools section).

Overview and Functioning of Screening Automation Tools

First of all, to answer the first research question, the existing tools for abstract screening automation were identified. While the 15 tools found offer many functionalities that support researchers during all phases of literature review (e.g., deduplication, collaborative screening and conflict resolution, full text integration), we focused on features directly related to screening automation. As can be seen in Table 2, the most popular tools were Covidence (n = 181, 37.09% of the 488 reports), Rayyan (n = 66, 13.52%), and Abstrackr (n = 17, 3.48%). However, not all studies utilized the specific screening automation features of these tools; rather, they used them for organizing the review process, deduplication, and manual screening. The main commonalities and special features of these and the remaining tools will be described in the following, focusing on the role of machine learning and the type of (semi)automation. In Table 2, we report at which step machine learning is used by the respective tool.

Table 2 Basic information on tools to assist literature screening

Abstrackr (Wallace et al., 2012, 2013), DistillerSR, Rayyan (Ouzzani et al., 2016), EPPI-Reviewer (Thomas et al., 2022), Covidence, and RobotAnalyst (Przybyła et al., 2018) are online tools for screening automation that employ support vector machines (SVM) for predicting the inclusion probability of unreviewed references. Simply put, the manually reviewed references (i.e., training data) are used to identify word combinations that differentiate between included and excluded references. Once the predictive model is computed, inclusion probabilities are assigned to the remaining records. Thus, unreviewed references can be sorted according to their relevance. By providing more decisions on inclusion or exclusion, the performance of the model can be further improved (active learning). While Abstrackr and Rayyan can be used for free after registration, DistillerSR, EPPI-Reviewer, and Covidence require a fee. For using RobotAnalyst, the provider needs to be contacted.

With a focus on medical and health science, Covidence and RobotSearch (Marshall et al., 2018) use machine learning for identifying randomized controlled trials (RCTs) in the data set. Both tools integrate a SVM ensemble model (i.e., different SVMs combined) trained on title-abstract records from biomedicine and pharmacology that were manually labeled by the Cochrane community. The model presented in Marshall et al. (2018) was the basis for the Cochrane RCT Classifier (Thomas et al, 2021), which was integrated in Covidence and EPPI-Reviewer. Thomas et al. (2021) report a recall of 99% of the classifier. Regarding screening automation, these tools are limited to reviews focusing on RCTs. With a similar focus on syntheses of biomedical research, RobotReviewer (Marshall et al., 2016) employs machine learning (SVM) for risk of bias assessment.

In contrast to the previously mentioned web tools, Research Screener (Chai et al., 2021), Colandr (Cheng et al., 2018), and Concept Encoder (Yamada et al., 2020) use machine learning algorithms only for feature extraction. Specifically, variants of word embeddings are employed, a NLP procedure that transforms words to numerical vectors. By doing so, semantic nearness can be numerically expressed (e.g., by correlating two words’ vectors). It is noteworthy that Research Screener requires seed articles, i.e., references known to be representative for inclusion or exclusion, and that Colandr offers the possibility of full text screening. ML-based NLP for feature extraction is also part of RobotAnalyst (Przybyła et al., 2018), RobotReviewer (Marshall et al., 2016), and ASReview (van de Schoot et al., 2021). Moreover, ASReview offers multiple machine learning algorithms for both feature extraction and classification and has a strong open science focus.

SWIFT-Active Screener (Howard et al., 2020) is a web-based tool that features a recall estimation of the included documents, thus guiding reviewers when to stop the screening processes. The related SWIFT-Review software (Howard et al., 2016) requires a local installation and has different drawbacks compared to SWIFT-Active Screener (i.e., large training set needed, no recall estimation, single-user application). A recall estimation feature is also included in FAST2 (Yu et al., 2018), a SVM-based tool that expanded FASTREAD (Yu et al., 2018).

Different from the other tools, the R package revtools (Westgate, 2019) employs only topic modeling to facilitate the screening process. Users can inspect the articles’ topics in an R Shiny App (Chang et al., 2021) to find those documents that are representative for the topic of interest. Conversely, a topic that includes terms that indicate exclusion can be used to exclude respective articles. This method is unsupervised, meaning that no training data of manually screened articles are necessary. On the downside, it does not make specific predictions of inclusion. Hence, the author stresses that revtools should be used for very broad classification only. Although users can exclude articles, words, and topics and recalculate the topic model at any time, active learning in the sense of better predictions of inclusion during the screening process is not possible.

While screening the 488 studies for this overview, we identified two additional machine-learning-based tools. They are not included in Table 2, as information on these tools was scarce or could not be retrieved. All of them were reported only once. These tools are Twister (Kreiner et al., 2018), a tool for increasing the rate of screening using visual aids, and JBI Sumari (Munn et al., 2019), a fee-based commercial tool with automated assessment of risk of bias. In supplementary material 2 (Burgard & Bittermann, 2022a), we provide a list of literature on screening automation, sorted by tools.

Discussion

Applications and Performance of Screening Automation Tools

In the following, the evidence on the performance of screening automation tools known from previous applications will be summarized in a systematic review. The dependent variable that describes the trade-off between screening effort and recall best is the WSS that was introduced in the “Method” section. It indicates how much work could be reduced using a screening tool compared to randomly sampling abstracts for screening. Without prioritization, it can be expected that after screening 50% of records, about 50% of relevant papers are found. If, due to prioritization, 50% of relevant papers are already found after screening 25% of papers, WSS is 25%. In the 21 studies that reported performance measures of screening tools, 250 results were reported that allowed computation of WSS. The mean WSS is 0.55, 95% CI [0.51, 0.58]. In half of the trials reported, at least 58% of manual screening could be saved.

The study designs and reported data were diverse in the studies. For means of comparability of the performance, we used WSS as a measure of trade-off between screening costs and recall throughout. The WSS was reported in eight of the 21 studies (38.10%). For the remaining 13 studies, we computed the WSS using the formula reported in van de Schoot et al. (2021). Recall was reported in about a half of the studies (11 of 21, 52.38%). For the remaining studies, we computed the recall values following Przybyla et al. (2018). Concerning the assumption of when to stop screening, most studies focused on a certain recall value (10 of 21, 47.62%). Seven studies (33.33%) stopped screening after a certain number of papers, and four studies (19.05%) relied on a relevance value provided by the tool.

A first overview on costs and benefits of using semiautomated screening tools in relation to WSS is given in Figure 2. It depicts the relationship between the share of manually screened abstracts, recall, and WSS. There is a positive association between the amount of manual screenings given as input and the achieved recall rate (r = .24, t = 3.84, df = 248, p <. 001). The more training data a tool gets, the better the performance of the predictions. However, there are also cases with a small share of manual screenings and a high recall. Those are the cases that achieve high values in the WSS. The black dotted diagonal on which the share of screened abstracts is equal to the recall rates represents a WSS of 0, as this result would be achieved by chance without prioritization of the records. The more distant a result is from this diagonal, the more work could be saved due to automation.

Figure 2 Costs and benefits of using semiautomated screening tools.

On the right of the plot, the univariate distribution of recall is shown. Most of the trials (76%) achieved recall rates of 95% and more. In the case of the experiments with true decisions known, some studies defined a certain recall rate in advance and tested how many screenings are needed to achieve it. Having said this, if a threshold for the size of the training data or a stopping criterion for manual screening would have been set without knowing the true decisions, the distribution of recall would probably look different.

In Figure 3, the tools with at least 16 data points are compared concerning the WSS. The WSS relates the workload for manual screening to the recall rate achieved and thus is suited well as a performance measure of the tools. Four of the tools achieve work savings of about 50% in the median. However, the range of WSS per tool is wide. The greatest variation shows Abstrackr, at the same time the tool with the most data points (n = 94), suggesting that the heterogeneity in the results might be caused by differences in the studies. The same holds for the conspicuity concerning Concept Encoder that shows very low variation between the results and very high work savings. The results on Concept Encoder stem from only one study with eight different data sets from the field of medicine (Yamada et al., 2020). Several trials of Concept Encoder with different training data were conducted, and the performance was consistently high.

Figure 3 Work saved over sampling (WSS) per tool. Note. The results for Concept Encoder stem from different data sets reported in only one study.

Potential differences between the studies that could lead to the variation in results were analyzed and can be found in supplementary material 3 (Burgard & Bittermann, 2022a). For example, some studies compare the screening results to final inclusions after full text screening (FTS) to measure how many finally included records were really missed due to semiautomating the screening. Others take as a reference point the results from a manual title and abstract screening (TAS), directly comparing manual and semiautomated TAS. Figure S1 in supplementary material 3 (Burgard & Bittermann, 2022a) shows that to achieve recall rates of 95% and 100%, respectively, more manual screening is needed when the reference point is the TAS. This is plausible as more potentially relevant studies are found during TAS than are finally included in a review. Hence, the tool has to find more records to achieve a certain recall rate, and TAS is thus stricter as a reference point. Another variable that correlates with WSS is the number of candidate papers (r = .22, t = 3.62, df = 248, p <. 001). This is also plausible, as in screening projects with many candidate papers a lower share of manual screenings is already a sufficient number of training data and the savings that can then be achieved with the help of the tool are bigger. Put differently, it pays off more to give training data to a tool, and the more additional records are screened afterward by the tool. Finally, the performance of a tool can also vary due to the set of records to be screened. It may depend, for example, on the coherence of the finally relevant records or the structure of the abstracts in the data set. Therefore, we also conducted case studies to compare certain tools that were used on the same data with the same data sets (see Figure S2 and Figure S3 in supplementary material 3; Burgard & Bittermann, 2022a).

Taking the results across all tools, a question of practical relevance is, how many screenings are needed to reach certain recall rates? Figure 4 depicts the association between the share of manual screenings and the achieved recall rate. The blue line is the estimation of the linear association with the confidence interval of the estimation as area around the line. The turquoise dotted lines point to the estimated recall of 90% that is reached after about 12% of screenings. Put differently, 90% recall can be achieved while saving 78% of manual screening. The yellow dotted lines illustrate that for 95% recall, about 60% of abstracts have to be screened manually. This only equals a WSS of 35%, yet 95% recall typically is considered to be acceptable.

Figure 4 Share of screenings to achieve certain recall rates. Note. The dashed lines indicate the share of manual screened abstracts to achieve recall of .90 and .95, respectively.

A noticeable aspect of Figure 4 are the black dots that are far under the estimated regression line. In supplementary material 3, it is shown that these outliers are mainly from the experiments with the Prostate Cancer data set in Reddy et al. (2020). Figure S4 shows the difference in the estimations between the Prostate Cancer data set and all other results. Figure S5 then replicates the analyses as in Figure 4 without the 37 data points from the Prostate Cancer data set. This leads to a WSS of 85% for 90% recall and a WSS of 52% for 95% recall. As in Figure 4, to achieve 5% more recall, about 40% more abstracts have to be screened. However, screening about half of the candidate abstracts results in at least 95% recall.

A typical measure for the performance of algorithms is the F2-score, as defined in “Method” section. However, as F2 depends on recall and precision and the latter could not be calculated from the data given in all studies, only n = 76 data points are available for F2. Nevertheless, for these 76 comparisons, Figure 5 depicts the relationship of F2 with WSS. Despite the positive relationship of both performance measures (r = .29, t = 2.61, df = 74, p < .01), it is obvious from the graphical display that F2 depends mainly on precision (r = .88, t = 15.99, df = 74, p < .001), whereas WSS is equally related to recall (r = .23, t = 3.75, df = 248, p < .001) as to precision (r = .23, t = 2.03, df = 76, p = .046). Although F2 already weighs recall higher than the F1 score (which is also used as a measure for the performance of algorithms), the association between recall and F2 is rather low (r = .32, t = 2.93, df = 74, p < .01) compared to precision and F2. However, in case of screening, recall is more important. Thus, WSS is not only preferable due to complete data but is also suited better as a measure for the performance of screening tools.

Figure 5 Relationship between F2-score and work saved over sampling (WSS).

In this study, we aimed at giving an overview on existing tools for semiautomated abstract screening, as well as their previous practical applications and performance.

In summary, the review identified 15 tools that can assist during the research process (RQ1). Most tools employ machine learning classifiers (support vector machines in particular) as backbone for predicting the inclusion probabilities of unreviewed references. Some tools utilize machine-learning-based natural language processing also to preprocess the data and improve predictions. Tools that are free to use and can be accessed in the web browser without local installation will be attractive to many users, especially researchers trying screening automation for the first time or researchers/students with limited financial resources. We found five respective tools: Rayyan, Abstrackr, Colandr, Research Screener, and RobotReviewer/RobotSearch. With the exception of the latter, these tools also feature active learning.

For most researchers, an easy-to-use web application at reasonable cost will be paramount, alongside reliable and time saving reduction of the screening load. Our review indicates that in comparison to sampling records randomly, active screening with prioritization approximately halves the screening workload (RQ2). For most experiments, recall rates of 95% and more are achieved while reducing the amount of papers to be screened manually by approximately 50%. For 90% recall, about 10% of screenings are already sufficient, yielding work savings of 80% (RQ3).

The comparison of the tools regarding their performance in terms of WSS did not yield clear recommendations. Instead, characteristics of the experiments and the set of records to be screened seem to play a crucial role for the performance of the tools. One limitation of the review is thus that these potentially influencing factors on the performance were not controlled for. Above this, there were few targeted experiments that compared different tools or assessed the performance over the course of screening to allow conclusions on potential stopping rules. Another obvious limitation is that for Rayyan, which is the active screening tool mentioned most often in the literature overview (see Table 2), only one data point was available in the experiments with results on the performance.

For future research, it is of interest how easily a direct and comprehensive comparison of tools could be implemented. In Table 2, we listed the specific machine learning methods that were employed. However, it remains unclear which specific parameter settings for these methods are relevant, as most tools are not licensed as open source software. Only three tools have open source code available: ASReview (Apache 2.0 license), FASTREAD (MIT license), and RobotReviewer/RobotSearch (GPL-3.0 license). For Colandr, only the front end is published under MIT license.

All in all, it can be stated that machine learning yields a high potential to reduce screening time, especially for systematic reviews with many candidate papers. However, some open questions remain and need further investigation. Above all, the comparison of different tools under equal or at least similar conditions and for various data sets is of interest to get a better understanding on how the structure of the search results impacts the performance of different tools. Especially, the application of different stopping rules should be part of a planned experiment to evaluate at which point the tools certainly have enough training data and whether this varies between the tools.

The authors wish to thank Julian Scherhag, Sarah Marie Müller, and Lisa Clef for their assistance during the screening process and in preparing the manuscript.

References*Reports used for the review of performance evaluation

  • Attali, D., & Baker, C. (2022). ggExtra: Add Marginal Histograms to 'ggplot2', and More 'ggplot2' Enhancements. R package version 0.10.0. https://CRAN.R-project.org/package=ggExtra First citation in articleGoogle Scholar

  • Beller, E., Clark, J., Tsafnat, G., Adams, C., Diehl, H., Lund, H., Ouzzani, M., Thayer, K., Thomas, J., Turner, T., Xia, J., Robinson, K., & Glasziou, P., & Founding Members of the ICASR Group. (2018). Making progress with the automation of systematic reviews: Principles of the International Collaboration for the Automation of Systematic Reviews (ICASR). Systematic Reviews, 7, Article 77. 10.1186/s13643-018-0740-7 First citation in articleCrossrefGoogle Scholar

  • Borah, R., Brown, A. W., Capers, P. L., & Kaiser, K. A. (2017). Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open, 7(2), Article e012545. 10.1136/bmjopen-2016-012545 First citation in articleCrossrefGoogle Scholar

  • Bornmann, L., Haunschild, R., & Mutz, R. (2021). Growth rates of modern science: A latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications, 8, Article 224. 10.1057/s41599-021-00903-w First citation in articleCrossrefGoogle Scholar

  • Burgard, T., & Bittermann, A. (2022a). Supplemental materials to „Reducing literature screening workload with machine learning: A systematic review of tools and their performance”. https://doi.org/10.23668/psycharchives.8405 First citation in articleGoogle Scholar

  • Burgard, T., & Bittermann, A. (2022b). Supplemental material to „Reducing literature screening workload with machine learning: A systematic review of tools and their performance”. https://doi.org/10.23668/psycharchives.8404 First citation in articleGoogle Scholar

  • Burgard, T., & Bittermann, A. (2022c). Supplemental materials to „Reducing literature screening workload with machine learning: A systematic review of tools and their performance”. https://doi.org/10.23668/psycharchives.8406 First citation in articleGoogle Scholar

  • Callaghan, M. W., & Müller-Hansen, F. (2020). Statistical stopping criteria for automated screening in systematic reviews. Systematic Reviews, 9, Article 273. 10.1186/s13643-020-01521-4 First citation in articleCrossrefGoogle Scholar

  • *Chai, K. E. K., Lines, R. L. J., Gucciardi, D. F., & Ng, L. (2021). Research screener: A machine learning tool to semi-automate abstract screening for systematic reviews. Systematic Reviews, 10, Article 93. 10.1186/s13643-021-01635-3 First citation in articleCrossrefGoogle Scholar

  • Chang, W., Cheng, J., Allaire, J., Sievert, C., Schloerke, B., Xie, Y., Allen, J., McPherson, J., Dipert, A., & Borges, B. (2021). shiny: Web Application Framework for R. R package version 1.7.1. https://CRAN.R-project.org/package=shiny First citation in articleGoogle Scholar

  • Cheng, S. H., Augustin, C., Bethel, A., Gill, D., Anzaroot, S., Brun, J., Dewilde, B., Minnich, R. C., Garside, R., Masuda, Y. J., Miller, D. C., Wilkie, D., Wongbusarakum, S., & McKinnon, M. C. (2018). Using machine learning to advance synthesis and use of conservation and environmental evidence. Conservation Biology, 32(4), 762–764. 10.1111/cobi.13117 First citation in articleCrossrefGoogle Scholar

  • Cohen, A. M., Hersh, W. R., Peterson, K., & Yen, P. Y. (2006). Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association, 13(2), 206–219. 10.1197/jamia.m1929 First citation in articleCrossrefGoogle Scholar

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. 10.1177/001316446002000104 First citation in articleCrossrefGoogle Scholar

  • Cooke, A., Smith, D., & Booth, A. (2012). Beyond PICO: The SPIDER tool for qualitative evidence synthesis. Qualitative Health Research, 22(10), 1435–1443. 10.1177/1049732312452938 First citation in articleCrossrefGoogle Scholar

  • *Gartlehner, G., Wagner, G., Lux, L., Affengruber, L., Dobrescu, A., Kaminski-Hartenthaler, A., & Viswanathan, M. (2019). Assessing the accuracy of machine-assisted abstract screening with DistillerAI: A user study. Systematic Reviews, 8(1), Article 277. 10.1186/s13643-019-1221-3 First citation in articleCrossrefGoogle Scholar

  • Gates, A., Gates, M., DaRosa, D., Elliott, S. A., Pillay, J., Rahman, S., Vandermeer, B., & Hartling, L. (2020a). Decoding semi-automated title-abstract screening: Findings from a convenience sample of reviews. Systematic Reviews, 9(1), Article 272. 10.1186/s13643-020-01528-x First citation in articleCrossrefGoogle Scholar

  • *Gates, A., Guitard, S., Pillay, J., Elliott, S. A., Dyson, M. P., Newton, A. S., & Hartling, L. (2019). Performance and usability of machine learning for screening in systematic reviews: A comparative evaluation of three tools. Systematic Reviews, 8(1), Article 278. 10.1186/s13643-019-1222-2 First citation in articleCrossrefGoogle Scholar

  • *Gates, A., Gates, M., Sebastianski, M., Guitard, S., Elliott, S. A., & Hartling, L. (2020b). The semi-automation of title and abstract screening: A retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews. BMC Medical Research Methodology, 20(1), Article 139. 10.1186/s12874-020-01031-w First citation in articleCrossrefGoogle Scholar

  • *Gates, A., Johnson, C., & Hartling, L. (2018). Technology-assisted title and abstract screening for systematic reviews: A retrospective evaluation of the Abstrackr machine learning tool. Systematic Reviews, 7(1), Article 45. 10.1186/s13643-018-0707-8 First citation in articleCrossrefGoogle Scholar

  • *Giummarra, M. J., Lau, G., & Gabbe, B. J. (2020). Evaluation of text mining to reduce screening workload for injury-focused systematic reviews. Injury Prevention, 26(1), 55–60. 10.1136/injuryprev-2019-043247 First citation in articleCrossrefGoogle Scholar

  • *Hamel, C., Kelly, S. E., Thavorn, K., Rice, D. B., Wells, G. A., & Hutton, B. (2020). An evaluation of DistillerSR’s machine learning-based prioritization tool for title/abstract screening—Impact on reviewer-relevant outcomes. BMC Medical Research Methodology, 20(1), Article 256. 10.1186/s12874-020-01129-1 First citation in articleCrossrefGoogle Scholar

  • *Howard, B. E., Phillips, J., Miller, K., Tandon, A., Mav, D., Shah, M. R., Holmgren, S., Pelch, K. E., Walker, V., Rooney, A. A., Macleod, M., Shah, R. R., & Thayer, K. (2016). SWIFT-Review: A text-mining workbench for systematic review. Systematic Reviews, 5, Article 87. 10.1186/s13643-016-0263-z First citation in articleCrossrefGoogle Scholar

  • *Howard, B. E., Phillips, J., Tandon, A., Maharana, A., Elmore, R., Mav, D., Sedykh, A., Thayer, K., Merrick, B. A., Walker, V., Rooney, A., & Shah, R. R. (2020). SWIFT-Active Screener: Accelerated document screening through active learning and integrated recall estimation. Environment International, 138, Article 105623. 10.1016/j.envint.2020.105623 First citation in articleCrossrefGoogle Scholar

  • Kreiner, K., Hayn, D., & Schreier, G. (2018). Twister: A tool for reducing screening time in systematic literature reviews. Studies in Health Technology and Informatics, 255, 5–9. 10.3233/978-1-61499-921-8-5 First citation in articleCrossrefGoogle Scholar

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. 10.2307/2529310 First citation in articleCrossrefGoogle Scholar

  • Liu, J., Timsina, P., & El-Gayar, O. (2018). A comparative analysis of semi-supervised learning: The case of article selection for medical systematic reviews. Information Systems Frontiers, 20, 195–207. 10.1007/s10796-016-9724-0 First citation in articleCrossrefGoogle Scholar

  • Marshall, I. J., Kuiper, J., & Wallace, B. C. (2016). RobotReviewer: Evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Association: JAMIA, 23(1), 193–201. 10.1093/jamia/ocv044 First citation in articleCrossrefGoogle Scholar

  • Marshall, I. J., Noel-Storr, A., Kuiper, J., Thomas, J., & Wallace, B. C. (2018). Machine learning for identifying randomized controlled trials: An evaluation and practitioner's guide. Research Synthesis Methods, 9(4), 602–614. 10.1002/jrsm.1287 First citation in articleCrossrefGoogle Scholar

  • Marshall, I. J., & Wallace, B. C. (2019). Toward systematic review automation: A practical guide to using machine learning tools in research synthesis. Systematic Reviews, 8(1), 1–10. 10.1186/s13643-019-1074-9 First citation in articleCrossrefGoogle Scholar

  • Munn, Z., Aromataris, E., Tufanaru, C., Stern, C., Porritt, K., Farrow, J., Lockwood, C., Stephenson, M., Moola, S., Lizarondo, L., McArthur, A., Peters, M., Pearson, A., & Jordan, Z. (2019). The development of software to support multiple systematic review types: The Joanna Briggs Institute system for the Unified Management, assessment and Review of Information (JBI SUMARI). JBI Evidence Implementation, 17(1), 36–43. 10.1097/XEB.0000000000000152 First citation in articleCrossrefGoogle Scholar

  • *Odintsova, V. V., Roetman, P. J., Ip, H. F., Pool, R., Van der Laan, C. M., Tona, K.-D., Vermeiren, R. R. J. M., & Boomsma, D. I. (2019). Genomics of human aggression: Current state of genome-wide studies and an automated systematic review tool. Psychiatric Genetics, 29(5), 170–190. 10.1097/YPG.0000000000000239 First citation in articleCrossrefGoogle Scholar

  • O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., & Ananiadou, S. (2015). Using text mining for study identification in systematic reviews: A systematic review of current approaches. Systematic Reviews, 4, Article 5. 10.1186/2046-4053-4-5 First citation in articleCrossrefGoogle Scholar

  • Ouzzani, M., Hammady, H., Fedorowicz, Z., & Elmagarmid, A. (2016). Rayyan—a web and mobile app for systematic reviews. Systematic Reviews, 5(1), Article 210. 10.1186/s13643-016-0384-4 First citation in articleCrossrefGoogle Scholar

  • Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., McGuiness, L. A., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372, Article n71. 10.1136/bmj.n71 First citation in articleCrossrefGoogle Scholar

  • *Przybyła, P., Brockmeier, A. J., Kontonatsios, G., Le Pogam, M., McNaught, J., von Elm, E., Nolan, K., & Ananiadou, S. (2018). Prioritising references for systematic reviews with RobotAnalyst: A user study. Research Synthesis Methods, 9(3), 470–488. 10.1002/jrsm.1311 First citation in articleCrossrefGoogle Scholar

  • R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ First citation in articleGoogle Scholar

  • *Rathbone, J., Hoffmann, T., & Glasziou, P. (2015). Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Systematic Reviews, 4, Article 80. 10.1186/s13643-015-0067-6 First citation in articleCrossrefGoogle Scholar

  • *Reddy, S. M., Patel, S., Weyrich, M., Fenton, J., & Viswanathan, M. (2020). Comparison of a traditional systematic review approach with review-of-reviews and semi-automation as strategies to update the evidence. Systematic Reviews, 9(1), Article 243. 10.1186/s13643-020-01450-2 First citation in articleCrossrefGoogle Scholar

  • *Robledo, S., Aguirre, A. M. G., Hughes, M., & Eggers, F. (2021). “Hasta la vista, baby”—Will machine learning terminate human literature reviews in entrepreneurship? Journal of Small Business Management. 10.1080/00472778.2021.1955125 First citation in articleCrossrefGoogle Scholar

  • Schloerke, B., Cook, D., Larmarange, J., Briatte, F., Marbach, M., Thoen, E., Elberg, A., & Crowley, J. (2020). GGally: Extension to 'ggplot2'. R package version 2.0.0. https://CRAN.R-project.org/package=GGally First citation in articleGoogle Scholar

  • Shemilt, I., Khan, N., Park, S., & Thomas, J. (2016). Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews. Systematic Reviews, 5(1), Article 140. 10.1186/s13643-016-0315-4 First citation in articleCrossrefGoogle Scholar

  • Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Psychological science can improve diagnostic decisions. Psychological Science in the Public Interest, 1(1), 1–26. 10.1111/1529-1006.001 First citation in articleCrossrefGoogle Scholar

  • Thomas, J., Graziosi, S., Brunton, J., Ghouze, Z., O'Driscoll, P., Bond, M., & Koryakina, A. (2022). EPPI-Reviewer: advanced software for systematic reviews, maps and evidence synthesis. EPPI-Centre, UCL Social Research Institute, University College London. First citation in articleGoogle Scholar

  • Thomas, J., McDonald, S., Noel-Storr, A., Shemilt, I., Elliott, J., Mavergames, C., & Marshall, I. J. (2021). Machine learning reduced workload with minimal risk of missing studies: Development and evaluation of a randomized controlled trial classifier for Cochrane Reviews. Journal of Clinical Epidemiology, 133(May), 140–151. 10.1016/j.jclinepi.2020.11.003 First citation in articleCrossrefGoogle Scholar

  • *Tsou, A. Y., Treadwell, J. R., Erinoff, E., & Schoelles, K. (2020). Machine learning for screening prioritization in systematic reviews: Comparative performance of Abstrackr and EPPI-Reviewer. Systematic Reviews, 9(1), Article 73. 10.1186/s13643-020-01324-7 First citation in articleCrossrefGoogle Scholar

  • *van de Schoot, R., de Bruin, J., Schram, R., Zahedi, P., de Boer, J., Weijdema, F., Kramer, B., Huijts, M., Hoogerwerf, M., Ferdinands, G., Harkema, A., Willemsen, J., Ma, Y., Fang, Q., Hindriks, S., Tummer, L., & Oberski, D. L. (2021). An open source machine learning framework for efficient and transparent systematic reviews. Nature Machine Intelligence, 3(2), 125–133. 10.1038/s42256-020-00287-7 First citation in articleCrossrefGoogle Scholar

  • Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth-Heinemann. http://www.dcs.gla.ac.uk/Keith/Preface.html First citation in articleGoogle Scholar

  • Wallace, B. C., Dahabreh, I. J., Moran, K. H., Brodley, C. E., & Trikalinos, T. A. (2013). Active literature discovery for scoping evidence reviews how many needles are there? KDD Workshop on Data Mining for Healthcare. http://chbrown.github.io/kdd-2013-usb/workshops/DMH/doc/dmh217_wallace.pdf First citation in articleGoogle Scholar

  • *Wallace, B. C., Small, K., Brodley, C. E., Lau, J., & Trikalinos, T. A. (January 2012). Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. In Proceedings of the 2nd ACM SIGHIT international health informatics symposium, Association for Computing Machinery, New York (pp. 819–824). Association for Computing Machinery. 10.1145/2110363.2110464 First citation in articleCrossrefGoogle Scholar

  • Westgate, M. J. (2019). revtools: An R package to support article screening for evidence synthesis. Research Synthesis Methods, 10(4), 606–614. 10.1002/jrsm.1374 First citation in articleCrossrefGoogle Scholar

  • Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag. First citation in articleCrossrefGoogle Scholar

  • Wickham, H., & Bryan, J. (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl First citation in articleGoogle Scholar

  • Wickham, H., François, R., Henry, L., & Müller, K. (2020). dplyr: A Grammar of Data Manipulation. R package version 1.0.0. https://CRAN.R-project.org/package=dplyr First citation in articleGoogle Scholar

  • *Yamada, T., Yoneoka, D., Hiraike, Y., Hino, K., Toyoshiba, H., Shishido, A., Noma, H., Shojima, N., & Yamauchi, T. (2020). Deep neural network for reducing the screening workload in systematic reviews for clinical guidelines: Algorithm validation study. Journal of Medical Internet Research, 22(12), Article e22422. 10.2196/22422 First citation in articleCrossrefGoogle Scholar

  • *Yu, Z., Kraft, N. A., & Menzies, T. (2018). Finding better active learners for faster literature reviews. Empirical Software Engineering, 23(6), 3161–3186. 10.1007/s10664-017-9587-0 First citation in articleCrossrefGoogle Scholar

  • *Yu, Z., & Menzies, T. (2019). FAST2: An intelligent assistant for finding relevant papers. Expert Systems with Applications, 120(April), 57–71. 10.1016/j.eswa.2018.11.021 First citation in articleCrossrefGoogle Scholar