Estimating Cost-Effectiveness Using Alternative Preference-Based Scores and Within-Trial Methods: Exploring the Dynamics of the Quality-Adjusted Life-Year Using the EQ-5D 5-Level Version and Recovering Quality of Life Utility Index

Objectives: This study aimed to explore quality-adjusted life-year (QALY) and subsequent cost-effectiveness estimates based on the more physical health–focused EQ-5D 5-level version (EQ-5D-5L) value set for England or cross-walked EQ-5D 3-level version UK value set scores or more mental health recovery-focused Recovering Quality of Life Utility Index (ReQoL-UI), when using alternative within-trial statistical methods. We describe possible reasons for the different QALY estimates based on the interaction between item scores, health state profiles, preference-based scores, and mathematical and statistical methods chosen. Methods: QALYs are calculated over 8weeks froma case study 2:1 (intervention:control) randomized controlled trial in patients with anxiety or depression. Complete case andwithmissing cases imputed using multiple-imputation analyses are conducted, using unadjusted and regression baseline-adjusted QALYs. Cost-effectiveness is judged using incremental cost-effectiveness ratios and acceptability curves. We use previously established psychometric results to reflect on estimated QALYs. Results: A total of 361 people (241:120) were randomized. EQ-5D-5L crosswalk produced higher incremental QALYs than the value set for England or ReQoL-UI, which produced similar unadjusted QALYs, but contrasting baseline-adjusted QALYs. Probability of cost-effectiveness ,£30 000 per QALY ranged from 6% (complete case ReQoL-UI baseline-adjusted QALYs) to 64.3% (multiple-imputation EQ-5D-5L crosswalk unadjusted QALYs). The control arm improved more on average than the intervention arm on the ReQoL-UI, a result not mirrored on the EQ-5D-5L nor condition-specific (Patient-Health Questionnaire-9, depression; Generalized Anxiety Disorder-7, anxiety) measures. Conclusions: ReQoL-UI produced contradictory cost-effectiveness results relative to the EQ-5D-5L. The EQ-5D-5L’s better responsiveness and “anxiety/depression” and “usual activities” items drove the incremental QALY results. The ReQoL-UI’s single physical health item and “personal recovery” construct may have influenced its lower 8-week incremental QALY estimates in this patient sample.


Introduction
Economic evaluation evidence helps inform resource allocation between alternative care interventions within a finite care budget. 1 Cost-effectiveness analysis via cost per quality-adjusted life-year (QALY) is recommended internationally, including by the National Institute for Health and Care Excellence (NICE) for England and Wales. [2][3][4] QALYs are measured on a preference-based quality-adjustment scale, anchored at 0 (a state equivalent to dead) and 1 (full health), combined with length of life allowing comparisons between interventions that affect quantity and quality of life. 1,5 Nevertheless, the concept of "a QALY is a QALY" for cross-comparable decision making has been debated extensively given that different preference-based measures and value sets produce different QALYs, stemming from aspects such as content and size of classification systems, and methods and populations used to value health states. [5][6][7][8][9][10][11][12] Additionally, alternative mathematical and statistical methods can influence QALY estimates and associated cost-effectiveness evidence. [13][14][15] A more consistent, comparable approach is a rationale for NICE and reimbursement agencies internationally recommending the EQ-5D 3-level version (EQ-5D-3L) representing (3 5 ) 243 possible health states as a generic health measure. [2][3][4] In comparison, the newer EQ-5D 5-level version (EQ-5D-5L) represents (5 5 ) 3125 possible health states resulting in increased sensitivity and reduced ceiling effects. [16][17][18][19][20][21][22] Country-specific EQ-5D-5L preference-based value sets are available (https://euroqol.org/), with the value set for England (VSE) based on a composite time trade-off (TTO) and discrete choice experiment hybrid model. [23][24][25][26][27][28][29] Nevertheless, an independent quality assurance study led to NICE recommending the van Hout et al crosswalk over the VSE. [30][31][32][33][34] Therefore, EQ-5D-5L preference-based values are cross-walked/ mapped EQ-5D-3L UK value set scores based on the conventional TTO method. 35 Nevertheless, cross-walked scores have inherent concerns (eg, predictive errors) and do not represent a direct value set for the EQ-5D-5L. 36,37 Analyses internationally comparing EQ-5D-5L and EQ-5D-3L value sets and alternative cross-walked scores suggest that they estimate different preference-based values and subsequent QALYs. [38][39][40][41] Related to mental health, the EQ-5D measures' underlying health domains/items (mobility, self-care, usual activities, pain/ discomfort, anxiety/depression) have been argued to be more physical than mental health focused, stimulating debate as to their appropriateness within mental health populations. 10,[42][43][44][45][46][47][48] The 2010 Global Burden of Disease study estimated that depression and anxiety disorders contribute to a large portion of the total disability among all mental health and substance use disorders. 49 Approximately 1 in 6 adults in England has a common mental health disorder. 50 Mental health services and interventions have evolved to deal with care demand; for example, stepped-care within Improving Access to Psychological Therapies (IAPT) services in England and use of low-intensity interventions such as Digital Mental Health Interventions (DMHIs), which require appropriate cost-effectiveness evidence. [51][52][53][54] For reimbursement agencies such as NICE, alternative preference-based measures can be rationalized based on aspects such as psychometric performance (4; p. 42), as suggested by Brazier and Deverill. 55 EQ-5D measures' psychometric results offer better support in common (eg, anxiety and depression) than severe (eg, schizophrenia and bipolar disorder) mental health disorders. [44][45][46][47]56 The Recovering Quality of Life (ReQoL)  and ReQoL 10-item (ReQoL-10) versions are "recovery-focused quality of life" measures for mental health service users. 57 A UK value set using the composite TTO method has been developed to calculate QALYs from 7 ReQoL-10 items: the ReQoL-Utility Index (ReQoL-UI) representing (5 7 ) 78125 possible health states. 58 The ReQoL-UI's developers suggest it is arguably a more mental health-focused generic measure relative to the more physical health-focused EQ-5D measures. 58 A psychometric analysis by Franklin and Enrique 59 in patients with anxiety and/or depression identified that, compared with the EQ-5D-5L using the VSE or UK crosswalk, the ReQoL-UI had better construct validity with depression severity, that is, Patient-Health Questionnaire-9 (PHQ-9) score, 60 whereby construct validity was assessed based on "convergent" (eg, correlation with the PHQ-9) and "known-group" validity (eg, assessing effect sizes between depression severity groupings; eg, "moderate" relative to "mild" severity). Nevertheless, the EQ-5D-5L preference-based score was more responsive (based on assessing standardized response means) and had better construct validity with anxiety severity, that is, Generalized Anxiety Disorder-7 (GAD-7) score. 59,61,62 These results suggest that the 2 preference-based measures may systematically differ in how they measure anxiety and depression, with implications for the precision of QALY estimation. 59 We aim to explore the various QALY and subsequent costeffectiveness estimates based on the EQ-5D-5L (VSE or crosswalk) or ReQoL-UI, when using alternative within-trial statistical methods based on a case study trial. Throughout we describe possible reasons for different QALY estimates based on the interaction between item scores, health state profiles, preference-based scores, and mathematical and statistical methods chosen, with suggested implications for evaluating interventions within mental health services such as IAPT and future research.

Data Source
A parallel-group, randomized waitlist-controlled trial examining the effectiveness and cost-effectiveness of internetdelivered cognitive behavioral therapy (iCBT) for patients presenting with depression or anxiety was conducted at an established IAPT service. 63,64 Before 2:1 randomization (intervention: 8-week waiting-list control), trial eligibility criteria were applied (see Appendix S1 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2021.11.1358). Trial inclusion criteria were people (1) aged between 18 and 80 years, (2) above clinical thresholds for depression (PHQ-9 $10) or anxiety (GAD-7 $8), [60][61][62] and (3) suitable for iCBT (ie, willing to use iCBT, internet access). The structured Mini-International Neuropsychiatric Interview 7.0.2, administered by telephone by psychological wellbeing practitioners (ie, clinicians trained to deliver low-intensity support), established the presence or absence of a primary diagnosis of depression or anxiety disorder at baseline. 65 National Health Service England Research Ethics Committee provided trial ethics approval (Research Ethics Committee reference: 17/NW/0311). The trial was prospectively registered: current controlled trials ISRCTN91967124.
The trial is completed with the protocol and main results published showing that iCBT produced statistically significant improvements in depression (PHQ-9) and anxiety (GAD-7) severity compared with wait-list controls at 8 weeks, with further statistically significant intervention-group improvements from 8 weeks to 12 months. 63,64 Over 8 weeks, the probability of costeffectiveness was 46.6% ,£30 000 per EQ-5D-5L crosswalk-based QALY as the NICE reference case. 64 VSE and ReQoL-UI results were not published given NICE's VSE position and nonfinalized ReQoL-UI at point of submission.

Economic Evaluation
This 8-week within-trial cost-effectiveness analysis focuses on the NICE reference case of cost-per-QALY from a health and social care perspective. Because estimated QALYs are the main interest here, intervention (£94.63 per person) and other cost calculations are described elsewhere. 64 We followed NICE guidelines, Consolidated Health Economic Evaluation Reporting Standards checklist, and recommended methods for handling preference-based (utility), cost, and missing data using Stata version 15 and Microsoft Excel 2016. 4,13-15,30,66-71 Calculating QALYs QALYs are calculated from preference-based scores using the total area under the curve (AUC) method 15 : whereby p, preference-based score; i, an individual; and t, time (ie, baseline, t = 0). For each group j (j = 0, control; j = 1, intervention), the consecutive time measures are added, averaged, and then rescaled (d) for the percent of a year that t and t-1 cover, that is, 0.15 for 8 weeks. From Eq. (1), total QALYs (Q) for each individual's trial duration are the summation of QALY calculations for each follow-up time point starting at t = 1: Preference-based scores at baseline (t = 0) and 8 weeks (t = 1) are reported alongside subsequent QALY estimates for both trial arms and from 8 weeks to 12 months (t = 5) for the intervention arm only.

Statistical analyses
Analyses included complete cases (CCs) and with missing cases imputed based on multiple imputation (MI) by chained equation using predictive mean matching, drawing inference from a pool of 10 donors (k-nearest neighbors = 10) thus avoiding predicting missing values outside the plausible and observed range. 67,72 The MI method was chosen post hoc once the mechanism for missingness was deemed to be missing at random based on logistic regression which identified baseline sex, GAD-7 caseness, work and social adjustment scale, and IAPT phobia scale scores as predictors of missingness. [13][14][15]73 VSE, crosswalk, ReQoL-UI, and future cost missing cases at all follow-up time points were imputed. The number of imputed data sets was based on the percent of missing CC data across all time points in the intervention arm (m = 43). 13,74 Rubin's rule was applied when estimating MI analyses' means and standard errors of the mean (SEM). 75,76 Baseline-adjusted QALYs are estimated using baseline preference-based values and trial arm as covariates within 2 independent regression models: ordinary least squares and seemingly unrelated regression, the latter accounting for the bivariate relationship between costs and QALYs. 15,77,78 Incremental meanpoint estimates of trial arm differences (ie, intervention minus control) related to mean costs over mean QALYs are used to estimate incremental cost-effectiveness ratios (ICERs).
Bootstrapping was used to calculate bootstrapped 95% confidence intervals (bCIs) and bootstrapped SEMs around costs and QALYs and for plotting cost-effectiveness acceptability curves. CC and MI analyses involved 5000 or 21 500 (ie, 500 nested within imputed data sets: m = 43) bootstrapped iterations, respectively. 67 Cost-effectiveness acceptability curves present the probability of intervention cost-effectiveness compared with control across a range of cost-effectiveness thresholds, for example, NICE's £20 000 to £30 000 per QALY. 4 CC analyses bCIs are bias corrected and Table 1. Preference-based score descriptive statistics for observed cases at baseline across and within-trial arms. Cont. indicates, control; EQ-5D-5L, EQ-5D 5-level version; Int., intervention; Max, maximum; Min, minimum; N, number of responders; P., possible; ReQoL-UI, Recovering Quality of Life-utility index; UHSP, unique health state profile; UPBS, unique preference-based score; VSE, value set for England. *Number of participants at baseline was as follows: both trial arms ("Both"), N = 361; Int. arms, N = 241; Cont. arms, N = 120. † UHSP: the descriptive system element of the EQ-5D-5L and ReQoL-UI questionnaires produces a 5-digit or 7-digit health state profile, respectively, that represents the level of reported problems on each of the 5 or 7 dimensions of health, for example, 11223 for the EQ-5D-5L or 1112234 for the ReQoL-UI. UHSP refers to the number of UHSPs represented by the group of participants on that specific measure, for example, across both trial arms ( accelerated 95% confidence intervals (95% BCa CIs), which corrects for the bias and skewness in the distribution of bootstrap estimates, which is methodologically complicated for MI data sets when jackknifing; therefore, percentile method bCIs (95% bCIs) are used to reflect value coverage across bootstrapped MI data sets. 13,76 Additional analyses exploring the interaction between estimated QALYs, preference-based scoring algorithms, and item scores are described in the Appendices in Supplemental Materials found at https://doi.org/10.1016/j.jval.2021.11.1358.

Descriptive Statistics
Overall, 361 people were randomized (241 intervention:120 control): 71.5% were female, with a mean age of 33 years (range   Table 1 provides preference-based score descriptive statistics for observed cases at baseline across and within-trial arms. The crosswalk suggests this patient sample has the lowest, and the ReQoL-UI suggests the highest, mean preference-based health status at baseline. The EQ-5D-5L suggests this patient population is less heterogeneous than the ReQoL-UI, categorizing 355 participants into 111 unique health state profiles (UHSPs), whereas the ReQoL-UI categorizes 353 participants into 319 UHSPs. Relatedly, each ReQoL-UI UHSP is accompanied by its own unique preference-based score (UPBS). In comparison, 111 UHSPs are quantified by 100 VSE UPBSs and 105 crosswalk UPBSs, because some health states are represented by the same preference-based score (see Table 1). Figure 1 shows kernel density estimates for the CC analyses preference-based scores at baseline and 8 weeks, as plotted on a graph within and across trial arms; the change in score over this 8-week period is also presented. Figure 1 shows the VSE's CC indicates complete case; EQ-5D-5L, EQ-5D 5-level version; ReQoL-UI, ReQoL-Utility Index; VSE, value set for England. distribution is "smoother" than for the crosswalk, but not the ReQoL-UI, which is partly due to the number of UHSPs and UPBSs represented by each measure. Smoother in this context implies a broader distribution of scores across the score range resulting in less clustering and lower density around specific score ranges dependent on the prespecified bandwidth (ie, 0.02 for Fig. 1). Nevertheless, particularly at baseline, the ReQoL-UI presents higher density at the upper end of the scale (eg, .0.7) than the crosswalk or VSE, which can relatively restrict ability for greater ReQoL-UI improvement after baseline. Relatedly in the intervention arm, the ReQoL-UI's high central density just above zero for 8-week score change is similar to the VSE and crosswalk, but the VSE and crosswalk have a broader distribution and additional peaks (eg, .0.15), which contributes to a greater mean change. Figure 2 presents MI mean and 95% confidence interval preference-based scores across all data collection time points and up to 8 weeks in Table 2. These results suggest crosswalk-based health is poorer than that estimated using the VSE or ReQoL-UI, which are more similar with each other than the crosswalk (Fig. 2). The ReQoL-UI suggests that over 8 weeks the mean difference in preference-based health between trial arms decreases, whereas the EQ-5D-5L suggested it increased with implications for estimating incremental QALYs. In the intervention arm, a statistically significant difference with baseline preference-based scores is achieved by 8 weeks for the EQ-5D-5L but not until 3 ‡ Score (Dif., t 1 -t 0 ), the difference in mean score at 8 weeks (t 1 ) minus mean score at baseline (t 0 ); QALYs (Dif.
-months for the ReQoL-UI; this 3-month period represents the natural treatment timeframe in the intervention arm not captured by the 8-week comparative trial period nor the incremental QALY estimates. Table 2 indicates the crosswalk produces the largest incremental QALY difference between trial arms over 8 weeks, although the ReQoL-UI produces more incremental QALYs than the VSE suggesting the opposite to the change in preference-based scores ( Fig. 2 and Table 2). This is because baseline imbalances are not accounted for across the individuals' total AUC calculations (Eq. 1), with regression-based adjustment recommended over individual-level adjustment as part of the AUC calculation. 78,79 Regression-based baseline-adjustment using the total AUC takes into account baseline imbalances in preference-based scores and the phenomenon that those individuals with preference-based scores that are lower or higher than the mean at baseline will usually experience a respectively higher or lower improvement at follow-up. Therefore, because of the baseline imbalance and greater variation between the 2 arms in the ReQoL-UI, when a baseline-adjustment is statistically applied, the mean incremental difference in QALYs between the 2 arms is smaller than without the baseline-adjustment. 78 Table 3 and Figure 3 show that across both CC and MI unadjusted analyses, EQ-5D-5L and ReQoL-UI suggest iCBT is costeffective ,£30 000 per QALY (probability range 54%-64%). Baseline-adjusted QALY results are contrary to the aforementioned, whereby for the same MI analyses the ReQoL-UI suggests the highest ICER (£1 252 542) relative to the crosswalk's lowest "cost-effective" ICER (£27 684). When accounting for baselineadjusted QALYs across CC and MI analyses, probability of costeffectiveness ,£30 000 per QALY ranged from 6% (CC ReQoL-UI baseline-adjusted QALY) to 58.9% (CC crosswalk baseline-adjusted QALY). The largest change in probability of cost-effectiveness when moving from unadjusted to baseline-adjusted QALYs was for the ReQoL-UI in the MI analysis, which dropped from 60.9% to 7.4%-an absolute decrease of 53.5%. Baseline-adjusted costs and seemingly unrelated regression results are presented in Appendix S4 in Supplemental Materials found at https://doi.org/10.1016/j. jval.2021.11.1358.

Incremental Results
The change in EQ-5D-5L and ReQoL-UI item-level scores are described in Appendix S5 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2021.11.1358. To summarize, the EQ-5D-5L's cost-effectiveness results seem to be driven by the "usual activities" and "anxiety/depression" items, with the intervention arm having better outcomes on average than the control arm across all EQ-5D-5L domains. Nevertheless, ReQoL-UI's item results were more varied, with the control group having better outcomes on average than the intervention arm across 3 (belonging and relationship, physical activity, self-perception) of its 7 items, influencing the incremental ReQoL-UI results and subsequent QALY estimates.

Discussion
This study supports current empirical evidence that value sets such as the VSE and cross-walked scores produce different QALYs even when from the same classification system. 38,39,41,80 We found that the VSE preference-based scores were more similar to those from the ReQoL-UI than the crosswalk. This meant the VSE and ReQoL-UI produced similar unadjusted QALYs. These similarities disappeared when statistically accounting for baseline preferencebased scores, given that the control group improved more on average than the intervention group over 8 weeks on the ReQoL-UI-a result not mirrored on the EQ-5D-5L nor the trial's Figure 2. MI preference-based score means with 95% CIs at baseline and 8 weeks per trial arm and at 3, 6, 9, and 12 months in the intervention arm. CI indicates confidence interval; Cont. indicates control; EQ-5D-5L, EQ-5D 5-level version; Int., intervention; MI, multiple imputation; ReQoL-UI, ReQoL-Utility Index; VSE, value set for England.
condition-specific (GAD-7 and PHQ-9) measures. 64 This meant the ReQoL-UI had a lower probability of the intervention being costeffective than the VSE or crosswalk: a decision maker is unlikely to consider implementing iCBT based on these ReQoL-UI results, but might when using the crosswalk results. These differences stem from the analyses conducted (eg, CC vs MI; unadjusted vs baseline adjusted) and the measures themselves.

Exploring Why the ReQoL-UI and EQ-5D-5L Produce Different QALYs
The different preference-based scores produced by the ReQoL-UI and EQ-5D-5L stem from aspects such as the content and size of their classification systems, the methods and populations used to value health states, and how their underlying preference-based scoring algorithms are constructed.
The ReQoL-UI can quantify a larger number of health states than the EQ-5D-5L (ie, 78 125 vs 3125), suggesting our study sample are more heterogeneous by categorizing them into almost 3 times more health state profiles than the EQ-5D-5L. This categorization stems from responses at the item-score level, which indicated more response variability for the ReQoL-UI than the EQ-5D-5L (see Appendix S5 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2021.11.1358). The ability to categorize population samples into more health states should permit the measure to be more sensitive to change in generic health status, as long as that change is represented by the measure's items and preference-based score.
As far as the current authors are aware, there is only one published psychometric assessment of the EQ-5D-5L and ReQoL-UI-a study conducted by the current authors using the same data source as this article specifically to inform the associated withintrial economic evaluation. 59 This psychometric analysis suggests the ReQoL-UI has poorer responsiveness to change in GAD-7 anxiety or PHQ-9 depression severity than the EQ-5D-5L, which will have contributed to the smaller incremental QALY gains observed in this within-trial economic evaluation. Additionally, although the EQ-5D-5L was identified as having better construct validity with GAD-7 anxiety severity than the ReQoL-UI, the ReQoL-UI had better construct validity with PHQ-9 depression severity. The items that drove these construct validity results, particularly for the EQ-5D-5L, were the same items for which we identified a statistically significant difference between trial arms over 8 weeks (eg, "anxiety/depression" and "usual activities") as shown in Appendix S5 in Supplemental Materials found at https:// doi.org/10.1016/j.jval.2021.11.1358. 59 The ReQoL-UI has some perceived benefits over the EQ-5D-5L in mental health populations, including the ability to represent a larger number and variety of mental health states with better depression construct validity. Nevertheless, in the MI analysis, for example, the incremental ReQoL-UI baseline-adjusted QALYs were minimal (,0.0001) compared with those estimated from the VSE (0.0023) or crosswalk (0.0031), a result stemming in part from the ReQoL-UI's poorer responsiveness (particularly over 8 weeks). This is an unexpected result given that we would expect the ReQoL-UI to be more responsive given its mental health focus and classification system. Nevertheless, the psychometric analysis only partly explains the different QALY estimations. Also influencing the result is that the ReQoL-UI's preference-based score is based on a "random effects model consisting of a quadratic specification of ɵ (newtheta) with interaction terms for ɵ and levels 3, 4, and 5 of physical health [sic]." 58 In other words, the physical health item or dimension has a direct interaction with the mental health dimension (ɵ) within the ReQoL-UI preference-based scoring algorithm. This is practically and conceptually different to how the EQ-5D value sets are scored with implications for the derived preference-based score. It is important that researchers currently using, or considering using, the ReQoL-UI are aware of this interaction and associated rationale as described by Keetharuth and Rowen. 58 It is our hypothesis that the interaction with the physical health item contributed to the responsive statistics identified by the previous psychometric analysis and why the control group improved more on average than the intervention group in this IAPT-based within-trial analysis, as discussed further in the next sub-section and Appendix S6 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2021.11.1358. 59

Implications for Mental Health Services, Users, and Research
The trial context is important for interpreting our results. IAPT step 2 focuses on specific mental health populations and interventions; that is, common mental health conditions that could benefit from low-intensity therapies as brief psychological interventions (eg, DMHI, Bibliotherapy) offered with support from clinicians. 81 Furthermore, IAPT standards of patient recovery focus on symptom improvement, where "recovery" is defined as moving from "caseness" (PHQ-9 $10; GAD-7 $8) to "no caseness." 54 The ReQoL-UI psychometrics and within-trial results are potentially representative of its intended "recovery-focused" construct, which is different to "recovery" as operationalized by IAPT. Such symptomatic changes seem to be captured in part by the EQ-5D-5L dimensions of "usual activities" and "anxiety/ depression," which drive our within-trial results (Appendix S5 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2 021.11.1358). In comparison, the ReQoL-UI is developed from a conceptual framework of personal recovery in mental health, which is more focused on improving long-term wellbeing through self-management and having personally meaningful life goals, therefore expanding beyond the traditional symptom-based recovery paradigm. 57,[82][83][84][85] Given that IAPT performance metrics are, in part, symptom-based recovery with a focus on mental health, previous psychometric results suggest that the EQ-5D-5L captures these aspects better for anxiety severity and with greater continued on next page responsiveness than the ReQoL-UI, and this is reflected in our IAPT-based within-trial economic evaluation results. 59 Additionally, as mentioned in the previous sub-section, the ReQoL-UI's preference-based scoring algorithm includes a physical health interaction term with the mental health domain; this type of interaction term is not used in the EQ-5D measures' preference-based value set scoring algorithms.
Step 2 IAPT patients are referred on the basis of experiencing acute depression and/or anxiety symptomology, with improvements in physical health not being a key purpose of the service. In this trial's IAPT-based population, the majority of participants reported baseline physical health as "no problem" or "slight problem," with the majority not moving from this baseline state (Appendix S5 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2 021.11.1358). The interaction term in the ReQoL-UI's preference-based scoring algorithm means that because the majority of the study sample have no or slight problems with baseline physical health from which there is no change over the trial period, there is subsequently restricted ability for the ReQoL-UI's preference-based score to change, even if there are changes across the mental health domain. This will have influenced the ReQoL-UI's responsiveness, but also incremental QALY estimates, particularly given that the control arm randomly had more people who reported worse physical health at baseline and had a higher mean improvement in physical health over 8 weeks than the intervention arm. Compared with the ReQoL-UI, for the EQ-5D value sets, an interaction term is not imposed between the physical and mental health items allowing more independence between items in the preference-based scoring algorithm-this aspect is explored further in Appendix S6 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2021.11.1358.
In different mental health settings (eg, hospital outpatients) and patient populations (severe mental health disorders), with different intervention types (high-intensity interventions), these psychometric and within-trial results could be different. Further research is warranted including to what extent various mental health interventions, from medication to DMHIs, are intended to promote symptomatic or personal recovery and physical health, which itself could dictate whether the EQ-5D-5L or ReQoL-UI may be the more appropriate preference-based measure to estimate cost-effectiveness. Further exploratory analysis of the ReQoL-UI is warranted before it is used to guide resource-allocation decision making, particularly as a complement or substitute to the EQ-5D-5L. Additionally, EuroQol's blog provides updates for its new health and wellbeing instrument (EQ-HWB), which should be considered for future research. 86

Limitations
The 8-week between trial arm analyses limited the ability to assess incremental QALYs over a longer time-horizon. Common mental health disorder trials rarely exceed 12-month follow-up, with most follow-up periods aligning with when clinical change is most likely to be observed following treatment: between 8 and 12 weeks. [87][88][89] The lack of longer-term data also limits the ability and/or reliability of conducting extrapolated or modeling-based analyses over an even longer time-horizon. A systematic review of DMHI economic evaluations stated that 54 of 66 included articles did not explore the results beyond trial endpoints: "lack of longer-term modeling is likely to be due to, in part, the lack of reliable data about the long-term performance of DMHIs." 51 These data-driven limitations suggest longer-term comparative trial follow-ups are needed whenever possible with statistical methods as secondary options. 14  The VSE has suggested complications beyond what our analysis explores, with a new UK valuation study underway. [31][32][33]59,92 Nevertheless, as an imperfect direct value set for the EQ-5D-5L relative to the crosswalk that represents the EQ-5D-3L UK value set, it is still useful and informative for this exploratory analysis.

Conclusions
These results indicate the importance of reflecting on a preference-based measure's whole design before using it for economic evaluation, aspects of which can be revealed by conducting psychometric analyses, given that on QALY face value it is difficult to wholly understand why different preference-based measures produce different QALYs. These differences stem from mathematical and statistical methods used and the preferencebased measure itself, which need to be considered holistically to understand any subsequent QALY and cost-effectiveness estimates before suggesting to decision makers if an intervention is "costeffective" or not based on such evidence.

Supplemental Materials
Supplementary data associated with this article can be found in the online version at https://doi.org/10.1016/j.jval.2021.11.1358.