Systematic Review of Health Economic Evaluations Focused on Arti ﬁ cial Intelligence in Healthcare: The Tortoise and the Cheetah

Objectives: This study aimed to systematically review recent health economic evaluations (HEEs) of arti ﬁ cial intelligence (AI) applications in healthcare. The aim was to discuss pertinent methods, reporting quality and challenges for future implementation of AI in healthcare, and additionally advise future HEEs. Methods: A systematic literature review was conducted in 2 databases (PubMed and Scopus) for articles published in the last 5 years. Two reviewers performed independent screening, full-text inclusion, data extraction, and appraisal. The Consolidated Health Economic Evaluation Reporting Standards and Philips checklist were used for the quality assessment of included studies. Results: A total of 884 unique studies were identi ﬁ ed; 20 were included for full-text review, covering a wide range of medical specialties and care pathway phases. The most commonly evaluated type of AI was automated medical image analysis models (n = 9, 45%). The prevailing health economic analysis was cost minimization (n = 8, 40%) with the costs saved per case as preferred outcome measure. A total of 9 studies (45%) reported model-based HEEs, 4 of which applied a time horizon . 1 year. The evidence supporting the chosen analytical methods, assessment of uncertainty, and model structures was underreported. The reporting quality of the articles was moderate as on average studies reported on 66% of Consolidated Health Economic Evaluation Reporting Standards items. Conclusions: HEEs of AI in healthcare are limited and often focus on costs rather than health impact. Surprisingly, model-based long-term evaluations are just as uncommon as model-based short-term evaluations. Consequently, insight into the actual bene ﬁ ts offered by AI is lagging behind current technological developments.


Introduction
Within the healthcare sector, artificial intelligence (AI) has seen a substantial rise in development over the past years because of growing interest and its potential impact on healthcare delivery and effectiveness. 1 Advancements in computing power and algorithms together with the digitization of large volumes of health data have made AI-supported healthcare increasingly more common.The progression of AI is taking center stage in how healthcare is personalized and delivered to patients, leading to new opportunities and challenges in clinical practice. 1,2A fundamental challenge in today's healthcare is the growth of digital health data quickly exceeding the human capacity to process and analyze it in routine clinical practice.The advancement of AI carries the potential to address this gap and simultaneously improve patient care in clinical practice. 1Additionally, impeding healthcare staff shortages, aging populations, and increasing costs at narrowing budgets are asserting pressure on healthcare systems.Consequently, the healthcare industry is progressively and understandably resorting to AI to address these challenges.Nowadays, AI is growing in different domains of healthcare, from the automation of clinical workflows to the interpretation of clinical findings and the prediction of health outcomes, treatment response, and disease recurrence. 3At the rate at which AI applications are being developed, augmented, and used, AI creates an opportunity for accessible and evidence-based decision making within the global health community. 4][7] Therefore, processing digital health data with AI could support the delivery of effective and efficient healthcare. 8Nonetheless, even though the advancement of AI carries much potential, what value AI can and will deliver in actual clinical practice remains a central question and proper implementation guidance is crucial.Although the number of publications describing AI applications in a healthcare setting has been growing rapidly over the past years, the majority solely report on their accuracy and precision. 9evertheless, neither excellent prediction accuracy nor clear explainable relations between patient or image characteristics and outcomes guarantee clinical effectiveness and adoption.Plus the commonly used area under the curve of a receiver operating characteristic of a detection task does not unquestionably reflect clinical applicability. 10At the same time, evidence supporting clinical effectiveness, specifically comparative effectiveness, costeffectiveness, or other formal health technology assessment (HTA), of AI in a clinical healthcare setting appears to be limited. 1TA serves an important purpose for stakeholders and decision makers as a method to establish policies making most efficient use of available health resources before being implemented in clinical practice.Regulatory policy is currently lacking, hindering the adoption of AI.The medical community is overwhelmed by the large number of developed AIs, yet the absence of clear guidelines makes it difficult for researchers, policy makers, and developers to determine when an AI is indeed qualified for clinical adoption.
Therefore, the aim was to systematically review recent health economic evaluations (HEEs) of AI applications in healthcare and to discuss their methods, outcomes, and reported challenges.This systematic review focuses specifically on formal HEEs, such as cost-effectiveness and cost-utility analyses, as one of the dimensions of HTA, generating evidence on (long-term) impact to support implementation and financing decisions. 11In addition, the quality of the reported studies was assessed based on published checklists.Challenges for the future HEEs and implementation of AI in healthcare were identified and discussed.

Methods
A systematic literature review of health economic analysis of AI applications was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses and described in the following sections. 12

Literature Search Strategy
Considering that AI is currently developing at an accelerated pace and novel AI applications will likely outperform older AI applications, the systematic literature search focused on studies published within the last 5 years between January 1, 2016, and April 1, 2021.The studies were extracted from the PubMed and Scopus databases separately using the search queries provided in Table 1.To identify potential studies regarding health economic analysis, relevant free-text terms included "health outcomes" or "effects," "economic," "cost," "budget," and "quality."These terms were required to be present in article titles only, because these were too broad when applied for abstract searches.Cost-related free-text terms included terms such as "reduction," "minimization," "benefit," and "sensitive."Free-text terms for AI were used in both title and abstract and included "artificial intelligence," "machine learning," "deep learning," "computer-aided," and "data-driven."Besides broad terms such as "artificial intelligence," "machine learning," and "deep learning," additionally more specialized terms such as "support vector machine," "neural network," or "random forest" in similar manner are apt within the AI paradigm.Medical Subject Headings terms and keywords describing AI and cost analysis methodologies were used to further limit results.On account of the presumption that any relevant studies would not only contain specialized terms, but at least 1 broader term too, a sensitivity analysis for "neural networks," "support vector machine," and "random forest" in both title or abstract and Medical Subject Headings terms was conducted.Databases and search strategies were discussed with information specialists and search strategies were pilot tested to ensure all studies previously identified by authors were captured.The final database searches were performed on May 10, 2021.

Inclusion and Exclusion Criteria
Studies that evaluated an AI application compared with standard care, other types of care, or another AI within the same healthcare setting and reported a quantified impact evaluation in terms of costs, health-related or process outcomes, or resources were included for analysis.Studies that did not comply with the inclusion criteria were excluded, as were those outside of healthcare, without any type of quantitative HEE of the AI application or not available in English.Moreover, studies of other types than "original research" or "systematic review," such as "commentary," "letter to the editor," or "editorial" were also excluded.Reviewers M.V. and R.K. independently screened the titles and abstracts of all identified records after duplicates were removed.Definitive inclusion or exclusion of the studies was concluded by the same 2 reviewers, who independently reviewed the full texts of the included studies.Persisting uncertainties or disagreements between reviewers during the screening or full-text review process were settled after consulting a third independent reviewer (H.K.).

Data Extraction
Relevant data of the included studies were extracted independently by the 2 reviewers (M.V. and R.K.).A data extraction form was designed and included several general aspects: year of publication, patient population, study location, and funding source.Specific information extracted regarding the subject AI included description of the AI, field of application, care pathway phase (prevention, diagnostics, intervention, prognosis, etc), and technological aspects (ie, pattern recognition, natural language processing, virtual reality, etc).Finally, extracted information related to the HEE involved type of health economic analysis, intervention and comparator, time horizon, perspective, health or cost outcomes, modeling technique, and sensitivity analysis.

Quality Assessment
The methodological quality of the included studies was evaluated using the Consolidated Health Economic Evaluation Reporting Standards (CHEERS) and Philips checklist. 13,146][17] CHEERS includes 24 items subdivided into 6 main categories to conduct a thorough reporting of HEEs, but does not include items directly related to model-based HEEs.The Philips checklist was specifically designed to analyze the quality of model-based HEEs and includes 3 pillars: modeling approach, model data, and assessment of uncertainty.The decision for a specific model type and justification of the model parameter values and handling of uncertainty are crucial to the perceived quality of the studies. 18Therefore, the Philips checklist was applied in addition to the CHEERS checklist to model-based studies.Points were awarded for each of the criteria met.A point was withheld if the criterion was not completely met.A checklist score was derived for each included study based on the proportion of the 24 criteria met.The final included studies were independently reviewed by the same 2 reviewers (M.V. and R.K.).

Search Results
The databases search identified 982 records in total, of which 98 were excluded as duplicates.In the remaining sample of 884 unique records, 853 records were excluded after title and abstract screening based on the exclusion criteria.The sensitivity analysis yielded 62, 13, and 1 additional articles, respectively.None were eligible for full-text screening after assessment of title and abstract.A total of 31 full-text studies were screened, after which another 11 studies were excluded because they did not meet the inclusion criteria.A flow diagram of records found, screened, selected, and excluded with corresponding exclusion criteria is shown in Figure 1.0][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38] The relevant data were extracted and the 2 reviewers independently further assessed the quality of the articles per the CHEERS and Philips checklists.

General Overview of the Included Studies
A general overview of the included studies, including the details of the reported AI applications, is provided in Table 2. [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38] The majority of studies was published in 2019 or later.The studies were conducted in a range of medical specialties, yet ophthalmology was evidently the dominant field.A total of 4 of 20 articles (20%) involved ophthalmology, all of which evaluated the same AI application for diabetic retinopathy screening. 20,21,28,296][36][37][38] Nevertheless, all phases of the care pathway from prevention to follow-up were supported by AI in at least 1 included study; screening and treatment monitoring (intervention) were the most prevalent phases of the care pathway in which AI was applied.Notably, 7 of 20 studies (35%) reported complete government funding, and 6 (30%) reported industry funding.One study (5%) reported that the industry funder participated in the analysis, data interpretation, and writing of the article.36

Health Economic Analysis
The primary health economic analysis was a cost-minimization analysis with the costs per case as primary outcome (n = 8, 40%).Most cost-minimization analysis adopted the hospital perspective (n = 4, 50%), and 2 others adopted the payer perspective.The remaining 2 studies applied a health system or patient perspective.A total of 3 studies incorporated the societal perspective in their evaluation, although the definition of this perspective varied. 32,34,37One study defined the societal perspective by including healthcare utilization and productivity losses 34 and another by including reimbursement, opportunity, and additional hospitalization costs. 37The third study did not explicitly elaborate on its definition of societal perspective. 32The second most prevailing health economic analysis was incremental cost-effectiveness analysis (n = 6, 30%), with different health outcomes.One study reported improvements in life expectancy.A total of 3 studies performed incremental cost-utility analysis, using incremental quality-adjusted life-years as the health outcome.The remaining 3 studies reported the incremental effectiveness in context-specific outcome measures.Overall, the time horizons adopted by the studies ranged from 28 days to lifetime.A total of 6 studies (30%) reported a time horizon shorter than 12 months, 7 (35%) adopted a time horizon of 12 months, and 3 studies (15%) included patient lifetime as the time horizon.Of these studies, 4 (20%) reported the applied discount rates for health and economic outcomes.Two studies (10%) did not report a time horizon; nevertheless, 1 study evaluated incremental effectiveness on overall survival as health outcome. 23A complete overview of the health economic methodological details of the included studies can be found in Table 3. [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38]

Quality Assessment
The methodological quality of studies was evaluated using the CHEERS and Philips checklists.A score was calculated as the percentage of criteria fulfilled for the analysis of the CHEERS results.As shown in Table 2, [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38] the scores ranged from 25% to 96%, and the average score was 66%.The study with the highest score was an elaborate HTA study commissioned by the National Health Service in the United Kingdom. 20A short article presumably containing preliminary results received the lowest score. 24Additionally, Figure 2 specifies the number of studies that satisfied each checklist item, ranging from 1 to 20.All studies provided an explicit statement of the study context and the main question in the introduction.None but 1 study reported on handling variations among subgroups through conducting tests with different training and test cohorts. 23A total of 6 studies (30%) described the analytical methods, 4 of which were modeling-based studies.Furthermore, 1 study evaluated an AI algorithm to maximize hospital profit based on retrospective data, but did not compare the outcomes to standard care profit. 24This was the only study to not report a comparator.Finally, 11 (55%) and 13 of 20 studies (65%) explicitly mentioned study location and perspective, respectively.
The review included 9 modeling-based studies, and a summary of characteristics of these modeling-based studies is shown in Table 4. 20,21,25,[28][29][30]32,37,38 The Philips checklist was used to report on the quality of the modeling-based studies. 14 Amongthe 9 modeling-based studies, merely 2 distinct model types were identified: decision trees and Markov models.A total of 4 of 9 modeling-based studies listed their specific reason for choosing a model type.29,32,37,38 Markov modeling was particularly chosen because this allows incorporating recurring events 32 and timedependent transitions. 37Reasons listed for choosing decision trees were the trade-off between interpretability and accuracy 38 and the simplicity of the model.29 Studies that did not list the specific rationale for the model structure used made it difficult to determine whether the modeling type was sufficiently reasoned (as reflected in Philips' items on structural assumptions and model type).The decision trees and the Markov models had simplistic structure.All decision trees had 2 or 3 arms, 1 for the AI application under evaluation and 1 or 2 for the comparator strategies, with similar events/states in each arm.The cycle lengths used in the Markov models ranged from 1 day to 2 years and were mostly determined based on healthcare protocols (ie, testing frequencies in practice) rather than the natural progression of disease.Regarding model data, only 3 of 9 studies reported to have used multiple studies or systematic reviews to synthesize model parameter values.25,29,30 The other 6 studies (66%) reported a single data source, in 2 studies even from the same group or institution.21,37 None of the 9 studies reported how heterogeneity and structural uncertainties were addressed.Methodological uncertainties were evaluated in 2 studies, both by means of sensitivity analysis with different clinical scenarios.25,37 Nevertheless, all 9 studies reported on parameter uncertainty through sensitivity analysis. The majorit28 One study conducted 3-way sensitivity analysis. 38

Discussion
In this systematic review, 20 studies were identified that reported on a HEE of an AI application in healthcare.Given the large total number of studies describing the development of AI (over 120 000 in 2019) or the number of HEEs (nearly 20 000 in 2016 alone), 20 included studies are quite a limited number. 39,40The most common AI technology was automated medical image analysis in a variety of care pathway phases.The majority of studies (n = 17) compared an AI application with usual care and concluded the AI to be cost saving.Additionally, a large number of studies reported no details regarding characterizing uncertainty (n = 12), model assumptions (n = 11), and analytical methods (n = 14).
These limitations may be attributed to the choice of health economic method, because the included studies used relatively simple modeling methods, such as Markov models and decision trees.Therefore, one important finding of this study is that current HEEs of AI applications are unfortunately both quantitatively and qualitatively limited.The fact that only few assessments are published, often of suboptimal quality, may severely hinder the adoption of AI into clinical practice.Considerable challenges still need to be overcome to progress beyond the generally limited adoption in individual institutions and achieve its full potential. 9,41he continuous development of innovations is at the core of improving outcomes and affordability in healthcare.There is an abundance of exiting AI innovations and inspiring initiatives, but clinical practice and policy makers are asking for suitable methodologies and outcomes measures relevant to assess (added) value and improve patient care.The availability and applicability of these outcomes measures and methodologies are the basis on which appropriate research can be performed and the collection of evidence can be further improved.Hence, continuous developments in AI require an accompanying regulatory framework.Although the Food and Drug Administration (FDA), the European Commission, and many European countries individually are developing strategies and policies to regulate the development of AI, the process is time consuming. 2,42The disparity between AI development and implementation stems from the fact that AI is uniquely situated among healthcare innovations because of its ability to learn and improve performance from experience through retraining.Currently, the regulatory groundwork for AI applications as a medical device considers AI to be static models, meaning that alterations (ie, retraining) are difficult to regulate.To this end, the FDA is developing a framework to allow adjustments to AI applications and support a total product life cycle approach. 42In Europe, since the introduction of the Medical Device Regulation in May 2020, approval for dynamic AI exists, yet in many situations still requires renewed risk assessment.In these proposals, the FDA and European Commission expect AI applications to demonstrate analytical and clinical validation, yet validation guidelines have not been established. 43Thus AI researchers do not know when AI performance is acceptable in the clinical validation setting or if the AI needs further adjustments.The quality of data from daily clinical practice may be much lower given that healthcare professionals do not always collect all the necessary clinical information in real time. 43Therefore, an AI ready for clinical adoption should be able to manage low quality data adequately, but this is not explicitly addressed by the proposed frameworks.This is reflected in this systematic review, given that only 5 included studies mentioned validation of their AI 20,25,30,33,34 and data uncertainty (n = 8) and heterogeneity (n = 1) were greatly underreported.Thereupon, until regulatory policy is adopted for AI, the translation of AI toward clinical practice will remain fragmented and the incentive for qualitatively thorough evaluations remains low.Currently, no governing body has clear and definitive guidelines on the admission procedure for AI applications.The traditional paradigm of regulation of medical devices was not designed for adaptive AI technologies, resulting in inadequate high-quality evidence to support clinical implementation. 44,45Several of the included studies affirmed the fragmented translation.They accentuated how AI implementation requires organizational development, 34 how evidence about longterm costs and benefits is incomplete, 21 and that they were the first to conduct a HEE in their field. 27onetheless, the number of Conformité Européene-marked and FDA-approved AI-based medical devices is increasing substantially, indicating that there is a degree of scientific evidence showing safety and effectiveness available. 46Nevertheless, few of the currently approved AIs have yet been proven to be "value for money" from a societal perspective, given that this requires   4. 20,21,25,[28][29][30]32,37,38 performing a formal HEE. 47 Of th6 FDA-or Conformité Européene-approved AIs included in this review, only 1 29 is currently adopted in a clinical setting.[19][20][21][22]27 Between market approval and clinical adoption, the question on how AI applications can best be deployed as an integrated part of a healthcare system rises. 48If the AI application needs to be reimbursed by health insurance, an economic evaluation becomes increasingly important to provide insight into the health gained and the timeframe in which costs are incurred and benefits are used.Unfortunately, the proposed regulatory frameworks for AI do not explicitly mention HEE, and therefore, it remains unclear whether HEE will become a necessary condition for health insurance reimbursement.A positive HEE also does not guarantee the adoption of AI in clinical practice.The current absence of regulatory frameworks could explain why so few HEEs are reported. Futhermore, a previous systematic review published by Wolff et al 49 in early 2020 evaluating economic impact of AI in healthcare, similarly reported a scarcity of publications conducting extensive and qualitatively sound economic impact evaluations.
Comparable with findings in our study, this previous review concluded that the economic evaluations of AI in healthcare often focused on specific elements within health economics (eg, only included direct costs) rather than performing a comprehensive analysis.
Compared with many other interventions, the primary goal of an AI may not be to improve health outcomes but instead to improve other outcomes, such as shared decision making, wellbeing, or patient independency.This makes performing HEEs more challenging because such outcomes are notoriously difficult to value, for example, in monetary terms.Nevertheless, the society's expectations are growing toward reflecting the broader benefits of healthcare interventions, not captured in traditional HEEs.Therefore, future HEEs of AI need to include additional benefits other than in standard health dimensions, so that these benefits may weigh in as contextual factors in decision making. 50urthermore, Guo et al 51 identified a gap in the evaluation methods used in practice for digital health innovations including AI-based software as a medical device.The wider use of different types of simulation approaches such as computational, clinical, and system simulation is needed to overcome limitations and support better decision making.Applying advanced simulation methods such as system dynamics, discrete event, and agent- based simulation maintaining a complex system view could provide a methodology highly capable of modeling the effects of a complex intervention such as AI from a systems perspective. 52,53In this systematic review, 9 articles included simulation-based modeling and many mentioned limitations of their simulation models.These limitations will continue to persist unless new and advanced simulation methods are used for the evaluation of advanced and complex technologies such as AI.Other limitations discussed were the underestimations of costs.Costs not directly linked to the assessed invention were often not considered in the evaluation.For example, costs incurred from increased staff time, physician training, or software updates were not included. 19,26Equivalently, excluding future health benefits resulting from effective interventions leads to an underestimation of benefits.

Limitations
Despite the development of AI taking off in the last 2 decades, the number of scientific publications has been increasing even faster in recent years and justifies our search for publications since 2016. 9Nevertheless, it is possible that relevant articles not written in English were missed.Many articles were excluded in this review based on title and abstract screening.Remarkably, we found that many of the articles in our initial search results claimed to describe a cost-effective AI application, yet did not conduct a HEE to justify those claims, leading to a high exclusion rate.Additionally, articles evaluating only patient health outcomes or hospital process outcomes were not included, even though such improved outcomes could lead to a reduction in costs, per the concepts of value-based healthcare. 54Finally, following the CHEERS and Philips checklists can ensure economic evaluations include the appropriate components, but it does not necessarily reflect the correct implementation of the item.Although the computed score can be questioned for assuming equal weight for each checklist item, it does provide an estimation of the completeness of the evaluation per study.

Conclusions
This systematic review exposes an important gap in the methods used for HEE of AI applications in healthcare.In the context of health economics, the cheetah of AI innovation is only at a slow pace pursued by the tortoise (formal HEE).Currently, HEEs of AI are incapable of capturing the complexities and clinical applicability needed to support appropriate decision making.Unless this tortoise catches up, beneficial AI applications run the risk of not being adopted because of lack of proven health and economic benefits.Moreover, there is a risk of nonvaluable AI applications being adopted based on poor and limited evidence.Both situations lead to potential health loss and unnecessary costs and will likely persist until AI, with its seemingly endless possibilities, is recognized as an intervention that can and should be properly assessed.Therefore, further work to enhance health economic assessment of AI will likely be crucial to their future adoption into clinical practice.

Figure 1 .
Figure 1.PRISMA flowchart describing study selection and reasons for exclusion during full-text screening.Identification of studies via databases

Figure 2 .
Figure 2. Overview of proportion of studies reporting CHEERS checklist items.Green, yes; red, no.

Table 1 .
Search queries for title, abstracts, and keywords in the database search conducted on May 10, 2021.

Table 2 .
Continued AI, artificial intelligence; CAD, computer aided diagnosis; CE, Conformité Européene; CHEERS, Consolidated Health Economic Evaluation Reporting Standards; CNN, convolutional neural network; CT, computed tomography; DESP, Diabetic Eye Screening Program; HER, electronic health record; EU, European Union; FDA, Food and Drug Administration; HAPI, hospital acquired pressure injury; ML, machine learning; NHS, National Health Service; NICE, National Institute for Health and Care Excellence; NR, not reported; UK, United Kingdom; USA, United States of America.
*Item not reported.† Model-based study, included in Table