Preference-Based Assessments|Articles in Press

# Modifying the Composite Time Trade-Off Method to Improve Its Discriminatory Power

Open AccessPublished:October 13, 2022

## Highlights

• Composite time trade-off (cTTO) method used to elicit utility values for health states has limited discriminatory power due to some respondents not trading time for quality for mild states, multiple observations being censored at −1 for severe states, and counterintuitive lack of association between the negative part of utility and state severity.
• We show how changing cTTO can help to overcome all the above 3 phenomena. Offering smaller trades or changing the framing to avoid loss aversion induces trading. Making a task more consistent across better than dead and worse than dead parts restores the association between the negative utility and state severity. The changes result in much lower utility values.
• The values elicited with cTTO may overestimate the utility. In consequence, the value of health-related quality of life improvements from severe states may be underestimated, and the value of life prolonging treatments in impaired health may be overestimated, whereas the health interventions offering small health-related quality of life gains for small incremental costs may be artificially disadvantaged.

## Abstract

### Objectives

In cost-effectiveness analysis of health technologies, health state utilities are needed. They are often elicited with a composite time trade-off (cTTO) method, particularly for the widely used EQ-5D-5L. Unfortunately, cTTO discriminatory power is hindered by (1) respondents’ nontrading (NT) of time for quality, (2) censoring of utilities at −1, and (3) poor correlation of negative utilities with state severity. We investigated whether modifying cTTO can mitigate these effects.

### Methods

We interviewed online 478 students (February to April, 2021) who each valued the same 10 EQ-5D-5L health states in 1 of 3 arms. Arm A used a standard cTTO, expanded with 2 questions to explore reasons for NT and censoring. Arms B and C used a time trade-off with modified alternatives offered to overcome loss aversion, to unify the tasks for positive and negative utilities, and to enable eliciting utilities < −1.

### Results

In arms B and C, we observed less NT than in A (respectively, 4% and 4% vs 10%), more strictly negative utilities (38% and 40% vs 25%), and more utilities ≤ −1 (18% and 30% vs 10%). The average utility of state 55555 dropped to −2.15 and −2.52 from −0.53. Enabling finer trades in arm A reduced NT by 70%. Arms B and C yielded an intuitive association between negative utilities and state severity. These arms were considered more difficult and resulted in more inconsistencies.

### Conclusions

The discriminatory power of cTTO can be improved, but it may require increasing the difficulty of the task. The standard cTTO may overestimate the utilities, especially of severe states.

## Introduction

To decide whether to reimburse a health technology, its cost and effects are compared in cost-utility analysis. The effects are usually measured as quality-adjusted life-years (QALYs). Each health state is assigned a weight expressing how good or bad it is, usually based on societal preferences. The product of the weight and time spent in the state defines QALYs (see Bleichrodt et al,
• Bleichrodt H.
• Wakker P.
• Johannesson M.
Characterizing QALYs by risk neutrality.
for formalities).
Health states are often defined using one of the EQ-5D family of instruments (see Kennedy-Martin et al,
• Kennedy-Martin M.
• Slaap B.
• Herdman M.
• et al.
Which multi-attribute utility instruments are recommended for use in cost-utility analysis? A review of national health technology assessment (HTA) guidelines.
for a review). To assign weights, the preferences are elicited with various tasks in a sample of respondents. The elicited utilities are averaged and extrapolated to all the states, to form a collection of weights (a value set).
The time trade-off (TTO) method has been widely used to elicit utilities, especially for the EQ-5D family of instruments (for instance
• Versteegh M.
• Vermeulen K.
• Evers S.
• de Wit G.
• Prenger R.
• Stolk E.
Dutch tariff for the five-level version of EQ-5D.
• Pickard A.
• Law E.
• Jiang R.
• et al.
United States valuation of EQ-5D-5L health states using an international protocol.
• Yang F.
• Katumba K.
• Roudijk B.
• et al.
Developing the EQ-5D-5L value set for Uganda using the ‘lite’ protocol.
). TTO aims to identify the indifference point where foregoing either quality of life (QoL) or duration is equivalent; that is, the following 2 options seem equally attractive to the respondent: (1) living in state $Q$ with impaired health for 10 years (usually) and (2) living in full health (FH) for $T$ years, $T≤10$. An iterative procedure is used to find $T$. Under the QALY model, the utility, $u(Q)$, can then be calculated as $T/10$, using the QALY scale where $u(FH)=1$ and $u(dead)=0$. If the respondent considers a state worse than dead (WTD) and trades all the time, a modification is required to elicit negative utilities. In lead-time TTO (LT-TTO), 10 years in FH are added to both alternatives to enable further trading
• Jansen B.M.
• Oppe M.
• Versteegh M.M.
• Stolk E.A.
Introducing the composite time trade-off: a test of feasibility and face validity.
.
The combination of a regular TTO and LT-TTO (for better than dead [BTD] and for WTD states, respectively) is called a composite TTO (cTTO, see Jansen et al
• Jansen B.M.
• Oppe M.
• Versteegh M.M.
• Stolk E.A.
Introducing the composite time trade-off: a test of feasibility and face validity.
and Stolk et al
• Stolk E.
• Ludwig K.
• Rand K.
• van Hout B.
• Ramos-Goñi J.M.
Overview, update, and lessons learned from the international EQ-5D-5L valuation work: version 2 of the EQ-5D-5L valuation protocol.
). It is commonly used to build value sets for EQ-5D instruments using an operationalization standardized in the EuroQol Valuation Technology (EQ-VT) protocol (henceforth, we refer to this operationalization as standard cTTO). cTTO was the sole source of data for some value sets (eg, Pickard et al
• Pickard A.
• Law E.
• Jiang R.
• et al.
United States valuation of EQ-5D-5L health states using an international protocol.
and Versteegh et al
• Versteegh M.
• Vermeulen K.
• Evers S.
• de Wit G.
• Prenger R.
• Stolk E.
Dutch tariff for the five-level version of EQ-5D.
). Even when cTTO is combined with discrete choice experiment (DCE) without duration, it remains the sole basis for anchoring of the utilities on the QALY scale (see Ramos-Goñi et al,
• Ramos-Goñi J.M.
• Oppe M.
• Cabasés J.M.
• Serreno-Aguilar P.
• Rivero-Arias O.
Valuation and modeling of EQ-5D-5L health states using a hybrid approach.
Golicki et al,
• Golicki D.
• Jakubczyk M.
• Graczyk K.
Valuation of EQ-5D-5L health states in Poland: the first EQ-VT-based study in Central and Eastern Europe.
and Jensen et al,
• Jensen C.E.
• Sørensen S.S.
• Gudex C.
• Jensen M.B.
• Pedersen K.M.
• Ehlers L.H.
The Danish EQ-5D-5L value set: a hybrid model using cTTO and DCE data.
for examples). Therefore, the performance of cTTO is crucial for a reliable measurement of societal preferences for health. Unfortunately, cTTO still has shortcomings. In this article, we focus on its limited discriminatory power, that is, limited ability to capture the utility differences between health states, for very mild or severe states. For mild states, at the top of the utility scale, cTTO experiments occasionally end with nontrading (NT), that is, respondents not accepting any shortening of duration as equivalent to the worsening of QoL. In Golicki et al,
• Golicki D.
• Jakubczyk M.
• Graczyk K.
Valuation of EQ-5D-5L health states in Poland: the first EQ-VT-based study in Central and Eastern Europe.
NT occurred in > 6% EQ-5D-5L valuations, the proportion exceeded 30% for multiple mild health states, and > 50% respondents refused to trade for at least 1 state. NT results in equating the utility of Q and FH, which violates the logical ordering based on dominance (impaired state Q should have utility lower than $1=u(FH)$).
Understanding the causes of NT, and perhaps reducing its frequency, is important for several reasons. If NT accurately reflects the preferences being based on lexicographic ordering (QoL only matters if longevity is equal), it implies that the standard QALY model does not hold (for more information on lexicographic orderings in general, see Keeney and Raiffa
• Keeney R.L.
• Raiffa H.
Decisions With Multiple Objective: Preferences and Value Trade- Offs.
). If NT misrepresents the preferences, perhaps because of the standard cTTO operationalization (see next section), then some modifications may allow estimating smaller disutilities (ie, differences between u(FH) = 1 and u(Q)), which may be relevant in applications. In published cost-utility analysis, the median difference in QALY amounted to 0.06.
• Wisløff T.
• Hagen G.
• Hamidi V.
• Movik E.
• Klemp M.
• Olsen J.
Estimating QALY gains in applied studies: a review of cost-utility analyses published in 2010.
Additionally, some value set construction methods depend more on detecting small disutilities. For instance, Schneider et al
• Schneider P.
• van Hout B.
• Brazier J.
Fair interpersonal utility comparison in the valuation of health: a relative utilitarian preference aggregation method. GitHub.
proposed to rescale disutilities within each individual for comparability, which magnifies even small disutilities for respondents with narrow ranges.
For severe states, at the bottom of the scale, the impact of the reduced discriminatory power is even larger. In LT-TTO, after the respondents trade-off all duration, no utility values < −1 can be elicited. Such all-in trading (AIT) occurred in 10.7% cases in Golicki et al.
• Golicki D.
• Jakubczyk M.
• Graczyk K.
Valuation of EQ-5D-5L health states in Poland: the first EQ-VT-based study in Central and Eastern Europe.
This left-censoring makes it impossible to infer the range of utilities and may distort the measurement of QALYs (for instance, when QoL improves from a severe state). When constructing value sets, the censoring is usually overcome by imposing parametric assumptions, which are vulnerable to misspecification errors. Censoring is even more troublesome if exact observations are needed to rescale the utilities (as in Schneider et al
• Schneider P.
• van Hout B.
• Brazier J.
Fair interpersonal utility comparison in the valuation of health: a relative utilitarian preference aggregation method. GitHub.
and Jakubczyk

Jakubczyk M. What if 0 is not equal to 0? Inter-personal health utilities anchoring using the largest health gains. Accepted for publication in Eur J Health Econ. 2022.

).
To add to the problems with WTD states, the reliability of how cTTO captures the negative utilities overall can be questioned. Gandhi et al
• Gandhi M.
• Rand K.
• Luo N.
Valuation of health states considered to be worse than death-an analysis of composite time trade-off data from 5 EQ-5D-5L valuation studies.
found that the negative part of utility (ie, $u$ also right-truncated at 0) is not correlated with the EQ-5D-5L state severity as measured by level sum score (LSS, ie, the sum of level values). Whether this signals problems with LT-TTO is being debated, given that the truncation in data biases the estimates toward 0 (see Roudijk et al,
• Roudijk B.
• Donders R.
• Stalmeier P.
A threshold explanation for the lack of variation in negative composite time trade-off values.
for a detailed discussion and an alternative explanation of the lack of correlation). Nevertheless, the findings of Gandhi et al
• Gandhi M.
• Rand K.
• Luo N.
Valuation of health states considered to be worse than death-an analysis of composite time trade-off data from 5 EQ-5D-5L valuation studies.
are at least partially due to the large proportion of AIT cases, even for states that many respondents considered BTD. Thus, it is legitimate to ask whether AIT reveals true preferences (ie, utility ≤ −1) or it is an artifact, that is, the consequence of how the standard cTTO is constructed.
Our primary goal was to test how modifying the cTTO affects the proportion of NT and $u≤−1$ when valuing EQ-5D-5L health states. The modifications of cTTO were designed to overcome the hypothesized reasons for NT and AIT (elaborated in the section Possible Reasons for NT and AIT). In consequence, this article seeks to identify what causes these phenomena and suggest refinements to cTTO. Our secondary goal was to test whether the modifications restore the intuitive association between the negative utility and severity.
In section 2, we present the methods: the design of the arms with their respective rationale, the logistics of the study, and the approach to analysis. In section 3, we present the results. We discuss our findings in section 4, presenting the possible implications and the limitations of our approach. We conclude in section 5.

## Methods

### Possible Reasons for NT and AIT

We used 3 study arms—A, B, and C—designed following a set of hypotheses regarding possible reasons behind NT/AIT in the standard cTTO as presented below (other than simply reflecting the true preferences).
Regarding NT, it may trivially be caused by the limited granularity of responses in the standard cTTO, that is, minimum half-a-year steps. Alternatively, NT may stem from the following psychological phenomena. The cTTO starts with a comparison between living in FH for $T$ years and living in Q for 10 years. Henceforth, for brevity, such health profiles are denoted by $(FH,T)$ and $(Q,10)$, respectively. Initially, $T=10$, and then $T$ is modified: if the former (latter) alternative is preferred, $T$ is reduced (increased), to worsen (improve) the first alternative. In this iterative process, the choices are framed as giving up time in $Q$ to gain QoL. The confirmatory message in the EQ-VT protocol after reaching the indifference says: “[…] to avoid being in [Q] for 10 years you are willing to give up […] years out of 10. Is that correct?.” Such a framing presents the QoL change as a gain (“avoid being,” ie, having FH instead of $Q$) and the duration change as a loss (“willing to give up,” ie, having $T<10$ instead of 10). Accounting for loss aversion, a phenomenon well studied in economics
• Kahneman D.
• Tversky A.
Prospect theory: an analysis of decision under risk.
and in health valuation more specifically,
• Lipman S.
• Brouwer W.
• Attema A.
QALYs without bias? Nonparametric correction of time trade-off and standard gamble weights based on prospect theory.
this framing magnifies the disutility of $T$ reductions and reduces the propensity to trade (ie, give up time).
What adds to the loss aversion is the imprecision of preferences.
• Jakubczyk M.
• Golicki D.
Elicitation and modelling of imprecise utility of health states.
The above-presented framing makes $(Q,10)$ the status quo, especially that it is a constant alternative in the BTD part. Foregoing status quo for $(FH,T)$ requires the respondent to be strongly convinced. In consequence, the imprecise perception of the utility of $(Q,10)$ will lead to only accepting large $T$ in $(FH,T)$. Zhao and Kling
• Zhao J.
• Kling C.
A new explanation for the WTP/WTA disparity.
used a similar reasoning to explain the willingness-to-pay/accept disparity.
Regarding AIT, if it results from true preferences, then assessing the actual utility values would be informative to correctly construct value sets. Alternatively, AIT may be enhanced by the discontinuity of preferences in the following sense. In LT-TTO, $(FH,T)$ is compared with a mixed profile of $(FH,10)$ followed by $(Q,10)$, denoted henceforth with the “+” symbol as $(FH,10)+(Q,10)$. The state $Q$ only prevails in one of the alternatives. If a respondent derives disutility from the sole fact of having to experience $Q$, irrespectively of the duration, this extra disutility may reinforce the willingness to trade.

### Study Arms

This experiment used a design with 3 arms described below.

#### Arm A

The standard cTTO was used as implemented in the EQ-VT protocol predominantly used for the valuation of EQ-5D-5L (see Stolk et al,
• Stolk E.
• Ludwig K.
• Rand K.
• van Hout B.
• Ramos-Goñi J.M.
Overview, update, and lessons learned from the international EQ-5D-5L valuation work: version 2 of the EQ-5D-5L valuation protocol.
for a detailed description).
This arm started with a comparison $(FH,10)$ versus $(Q,10)$. Typically, the former was preferred. Then, a task separating BTD from WTD scenarios was used (to test whether $u(Q)=><0$): immediate death versus $(Q,10)$. If Q was BTD and the latter was preferred, a single bisection step was used: $(FH,T)$ for $T=5$ versus $(Q,10)$. Subsequently, the former was worsened if it was preferred (by reducing $T$) or improved otherwise. One-year steps were used, unless a change in direction was needed or near the corner comparisons ($T=10$ or $T=0$).
In the WTD case, 10 years were added to both lives to enable further trading. Hence, the following task was presented: $(FH,T)$ for $T=10$ versus $(FH,10)+(Q,10)$. This task effectively repeats the $u(Q)=><0$ test. If the former was preferred, $T$ was set to 5, and then 1-year steps were used (reduced to half a year in some cases, as for BTD). No further trading beyond $T=0$, that is, immediate death, versus $(FH,10)+(Q,10)$ was possible. In Figure 1, we present what the standard cTTO arm looked like.
After the completion of cTTO for all the states, additional questions were asked in arm A for states for which either NT or AIT occurred. In the former case, the respondents compared 2 hypothetical worsenings (to make them both losses): of QoL from FH to $Q$ or of the duration from 10 years (in months, weeks, or days), to indicate when they would be indifferent.
For states ending in AIT, the respondents compared $(FH,5)+(Q,5)$ with $(FH,10)+(Q,10)$, to check whether $u(Q)$ is $<−1$, $=−1$, or $>−1$, while including $Q$ in both alternatives in case of discontinuous preferences. The exact wording of added questions is presented in the Appendix A in Supplemental Materials found at https://doi.org/10.1016/j.jval.2022.08.011.

#### Arm B

Arm B was designed to avoid loss aversion and include $Q$ in both alternatives in the WTD scenario (to avoid the impact of preference discontinuity). The first choice task was as in arm A. However, if FH was preferred, $(Q,10)$ was improved by adding $(FH,10)$ at the beginning. Hence, in the BTD scenario, $(FH,10)$ was compared with $(FH,T)+(Q,10)$, $0≤T≤10$, and $T$ was increased (decreased) if the first (second) alternative was preferred. To resemble arm A, the first 3 choice tasks corresponded to $u(Q)=1$, $0$, and $0.5$. Then, 1-year steps were used (or half a year, as in arm A).
In the WTD part, to include Q in both lives, $(FH,T)+(Q,10−T)$ was compared with a constant alternative $(FH,10)+(Q,10)$. Initially, $T=10$ in the BTD-WTD separating task, and subsequently $0.5≤T≤9.5$. Because the minimal elicitable utility value amounted to as little as $u(Q)=−19$ (for $T=0.5$), no bisection was used, but instead T was changed in 1-year steps from $T=10$ (reduced to half a year, as in arm A). The same number of unique $T$ values is available in arms B and A for BTD part, and an almost identical number for WTD part (nevertheless, a wider range of negative utilities reduces the granularity). Arm B is illustrated in Figure 2.

#### Arm C

Arm C was designed to avoid a change of framing between the BTD and WTD scenarios. In this unified framework, the respondent compared $(FH,10−2T)+(Q,T)$ with $(FH,2T)+(Q,10−T)$, starting with $T=0$, that is, $(FH,10)$ versus $(Q,10)$ just like in arms A and B. The BTD-WTD separating choice task is obtained for $T=2.5$: $(FH,5)+(Q,2.5)$ versus $(FH,5)+(Q,7.5)$, in which the alternatives differ only in time spent in $Q$. Increasing $T$ makes the first alternative less attractive and the second alternative more attractive. Half-a-year changes in $T$ were used, $0≤T≤4.5$. Because only 10 unique values of $T$ are available, no bisection was used in either BTD or WTD scenario. The minimal elicitable utility amounts to $−8$. The choice tasks are illustrated in Figure 3.

### Interviewing

A convenience sample of students were randomized between the arms and interviewed by 5 interviewers online with screen sharing and a web application. More details can be found in the Appendix B in Supplemental Materials found at https://doi.org/10.1016/j.jval.2022.08.011.
Interviews started with a welcome and some general information. Then, the respondents were asked about demographics, own health, and experience with health problems.
The main part of the interview consisted of 3 warm-up tasks (living in a wheelchair and 11112 and 14554, as defined in EQ-5D-5L) and 10 actual TTO tasks. In the actual tasks, mostly mild and severe states were used, as per study objective. Two moderate states were also included to collect some evidence for the middle of the scale and to facilitate testing for logical consistency. The following states were presented (mild, severe, and then random order): (mild) 11121, 11211, 12111, and 21111; (moderate) 22222 and 33333; and (severe) 43555, 44444, 55424, and 55555, based on the data on the states with most NT and AIT in the Dutch valuation
• Versteegh M.
• Vermeulen K.
• Evers S.
• de Wit G.
• Prenger R.
• Stolk E.
Dutch tariff for the five-level version of EQ-5D.
.

### Analysis

We analyzed the elicited utilities per state and arm with cumulative distribution functions (cdf), means and standard deviations, and the proportion of NT, negative values, and ≤ −1 values (comparisons used Cochran-Mantel-Haenszel test stratified by state). In arm A, we also studied the utilities elicited with additional questions. Because in arms B and C much lower negative utilities can be elicited, whether all years were traded in the WTD part of the arms is not comparable. Hence, no counterpart of AIT was studied in arms B and C.
We studied the correlation between negative utility and state severity. Admittedly, the slope in such a regression is biased toward 0 compared with the regression for all (not only negative) utility values.
• Roudijk B.
• Donders R.
• Stalmeier P.
A threshold explanation for the lack of variation in negative composite time trade-off values.
Studying the relationship is still informative to gauge the effect on the association between state severity and negative utilities of the change in the BTD-WTD separating task or the alternatives offered when eliciting negative utilities. To expand the existing approach of Gandhi et al,
• Gandhi M.
• Rand K.
• Luo N.
Valuation of health states considered to be worse than death-an analysis of composite time trade-off data from 5 EQ-5D-5L valuation studies.
we used as a measure of severity (beside LSS) also the proportion of WTD and the mean positive utility Appendix E https://doi.org/10.1016/j.jval.2022.08.011. Given that utilities in arm A are censored at −1, we introduced the same censoring for arms B and C in this analysis for comparability. To inspect data quality, we checked for interviewer effects Appendix E https://doi.org/10.1016/j.jval.2022.08.011 and compared the arms in terms of inconsistencies (assigning greater utility to an objectively worse state). In each interview, there were 35 unique pairs of different, Pareto-ranked states. We calculated the proportion of pairs with an illogical valuation ordering.
We also regressed observed disutilities on LSS for each respondent separately, to detect problems with understanding.
To compare the respondents’ behavior, we analyzed the time spent on the task, the number of steps required to reach a final value, and the subjective difficulty Appendix E https://doi.org/10.1016/j.jval.2022.08.011.

## Results

### Sample Characteristics

We collected 478 interviews in February to April 2021. Five interviews were removed because of internet problems and data loss, and 1 interview because of evident respondent inattentiveness (as judged by an interviewer). The demographic structure of the sample is presented in Table 1.
Table 1Sample structure: number of interviews included, sex, age (mean and SD), own health as per VAS and EQ-5D-5L, and experience with various health problems.
CharacteristicArm AArm BArm CArm D
N159157156472
Female, n (%)92 (58)84 (54)100 (64)276 (59)
Mean age (SD), years22.7 (3.2)22.7 (3.2)22.9 (3.1)22.7 (3.2)
Mean VAS (SD)81.9 (10.6)81.7 (11.4)82.5 (9.7)82.0 (10.6)
11111, n (%)54 (34)52 (33)62 (40)168 (36)
11112, n (%)22 (14)28 (18)32 (21)82 (17)
11121, n (%)22 (14)15 (10)14 (9)51 (11)
11122, n (%)16 (10)14 (9)9 (6)39 (8)
11222, n (%)3 (2)8 (5)7 (5)18 (4)
11113, n (%)8 (5)4 (3)4 (3)16 (3)
11212, n (%)6 (4)5 (3)2 (1)13 (3)
Own health problems, n (%)35 (22)21 (13)27 (17)83 (18)
Family health problems, n (%)126 (79)125 (80)110 (71)361 (77)
Family premature death, n (%)76 (48)71 (45)67 (43)214 (45)
Religion, n (%)31 (20)31 (20)31 (20)93 (20)
Note. One person in arm C did not identify with any sex. Regarding own health, only health states prevailing in at least 10 respondents in the whole sample are reported.
SD indicates standard deviation; VAS, visual analog scale.

### Utilities Per State and Arm

The main results are presented in Table 2 (means, standard deviations, and proportion of NT and $u≤−1$) and illustrated in Appendix C in Supplemental Materials found at https://doi.org/10.1016/j.jval.2022.08.011. For all the states, the difference in means between arm A and each of the other arms was highly significant (P < .001, t test).
Table 2Utility values per state and arm: mean, SD, % of NT and .
StateArm AArm BArm C
mean (SD)% NT$u≤−1$mean (SD)% NT$u≤−1$mean (SD)% NT$u≤−1$
111210.91 (0.11)2100.78 (0.26)1000.69 (0.24)110
112110.91 (0.09)2300.77 (0.29)1010.70 (0.25)110
121110.92 (0.11)2800.78 (0.23)1000.69 (0.29)111
211110.91 (0.11)2600.76 (0.23)900.69 (0.26)81
222220.74 (0.25)300.53 (0.43)310.43 (0.78)33
333330.49 (0.36)020.17 (0.59)18−0.26 (1.35)019
43555−0.40 (0.52)023−1.36 (3.14)041−2.14 (2.40)077
44444−0.29 (0.54)020−1.48 (3.72)038−1.81 (2.20)064
55424−0.21 (0.56)020−1.20 (2.94)037−1.55 (2.20)058
55555−0.53 (0.46)030−2.15 (4.27)052−2.52 (2.63)079
NT indicates nontrading; SD standard deviation.
The experimental arms reduced the proportion of NT, with odds ratio (OR) amounting to 0.36 (95% confidence interval [CI] 0.26-0.48) for B versus A and 0.374 (95% CI 0.28-0.51) for C versus A. Conversely, the experimental arms increased the proportion of $u≤−1$: OR = 2.53 (95% CI 1.99-3.21) for B versus A and OR = 8.04 (95% CI 6.27-10.33) for C versus A (all P < .001).
The experimental arms increased the proportion of states being WTD, $u≤0$ (or strictly WTD, $u<0$). In arms A to C, respectively, it amounted to 28%, 40%, and 44% (or 25%, 38%, and 40%), with the differences between A and the remaining arms being significant (P < .001).

### NT and AIT in Additional Questions in Arm A

With finer trades allowed, the respondents traded in 70% of NT cases (see Table 3). For 22222, some respondents traded more than what is possible in cTTO. We attribute it to the erratic behavior of some respondents combined with a very small sample for this state. Among all AIT cases in the standard cTTO part, 92% ended in the respondent strictly preferring immediate death, that is, demonstrating $u(Q)<−1$. This high proportion was confirmed in the additional question: the respondents mostly (77%) preferred $(FH,5)+(Q,5)$ over $(FH,10)+(Q,10)$, which also corresponds to $u(Q)<−1$, under the QALY model.
Table 3Arm A, trading when offered shorter intervals in additional question (time in days).
1112133, 64%50 (126), 778, 60
1121137, 68%52 (132), 777, 30
1211144, 68%42 (70), 861, 30
2111142, 67%49 (84), 1273, 30
222225, 80%393 (483), 150492, 435
SD indicates standard deviation

### The Overall Distribution Properties

In Figure 4, we present the cdf per state and arm. In A, for severe states, there are many close-to-0 positive utility values and few close-to-0 negative values. As a result, the cdf have a kink around 0: they are flat in the range (−0.5, −0.05), rise steeply in (0, 0.3), and flatten again in (0.3, 1). Such a pattern is not discernible for arms B and C: the cdf have a convex-concave shape (when smoothed and treated as continuous), as for typical probability distributions (eg, normal or logistic).
In Figure 5, we illustrate the association of negative utilities with 3 measures of state severity: LSS, proportion of respondents for whom the state was WTD, and the mean positive utility. For arm A, the findings of Gandhi et al
• Gandhi M.
• Rand K.
• Luo N.
Valuation of health states considered to be worse than death-an analysis of composite time trade-off data from 5 EQ-5D-5L valuation studies.
are replicated (ie, no correlation) and extended to other measures of severity, whereas the correlation is present for arms B and C. The correlation is also present when only looking at the states that were considered WTD in arm A (so in this sense, it does not stem from the overall utility being decreased in arms B and C).

### Respondents’ Consistency

For each respondent, the association of LSS with disutility was positive, as expected. The mean proportion of state pairs violating logical order amounted to 1.56%, 4.79%, and 2.69%, in respective arms (significantly lower in arm A: P < .001 vs B and P = .002 vs C). To account for the size of inconsistency, for each interview we calculated the minimal utility shift required (ie, the aggregate absolute change in utility) that would allow to remove all the inconsistencies. The average distortion needed amounted to 0.09, 0.52, and 0.94 in respective arms (P = .001 vs B and P < .001 vs C). The increase in the proportion and size of inconsistencies in experimental arms agrees with perceived greater difficulty of the task (see Appendix E in Supplemental Materials found at https://doi.org/10.1016/j.jval.2022.08.011).

## Discussion

### The Ends of TTO

Regarding the upper end of the utility scale, the cTTO modifications decreased the proportion of NT. This was true both of additional questions in arm A (offering finer trades) and the changed structure of choices in arms B and C. Therefore, the lexicographic preferences seem not to be as prevalent as the standard cTTO could suggest. As such preferences violate the QALY model, this is a convenient finding from the theoretical perspective. The 2 modifications of cTTO could be combined, to further reduce NT.
Arms B and C were designed not to invoke loss aversion. We observed more trading in these arms, which suggests that loss aversion is indeed present in standard cTTO. This finding compliments earlier research by Lipman et al,
• Lipman S.
• Brouwer W.
• Attema A.
QALYs without bias? Nonparametric correction of time trade-off and standard gamble weights based on prospect theory.
who showed that correcting for loss aversion (in an approach different to ours) also decreases the utility values. We see no other mechanisms that can explain our findings. For instance, discounting would not serve this purpose given that for small trades the choices in arms A and B are similar: for example, instead of comparing $(FH,9.5)$ with $(Q,10)$ in arm A, we have $(FH,10)$ versus $(FH,0.5)+(Q,10)$ in arm B. We also do not think our findings can be explained by the ordering effect (ie, preference for the QoL improving rather than worsening over time). In arm C and in the WTD part of arm B, the ordering effect would cancel out between the 2 alternatives. In the BTD part of arm B, the aversion to QoL worsening over time would increase the elicited utility value.
Regarding the lower end of the scale, had AIT in cTTO been caused by discontinuity of preferences, the utilities elicited with B and C should increase. Nevertheless, experimental arms increased the frequency of $u≤−1$, providing no evidence to support our hypothesis on discontinuity.
An important conclusion from our findings is that the large number of AIT cases in the standard cTTO is not an artifact (ie, it is not caused by the design of cTTO) but it represents true preferences. This is confirmed by the answers of respondents to the additional question in arm A (see NT and AIT in Additional Questions in Arm A) and the elicited utilities in arms B and C being often $<−1$. In consequence, the standard cTTO is subject to a serious censoring problem, and the credibility with which the range of negative utilities is elicited is limited.
Perhaps some part of NT and AIT observed in valuation studies results from carelessness of some respondents. Then, it may not be removed by modifications of the protocol. Nevertheless, strict quality control used in the valuation studies helps to spot and reduce the impact of such problematic respondents.
• Ramos-Goñi J.M.
• Oppe M.
• Slaap B.
• Busschbach J.J.
• Stolk E.
Quality control process for EQ-5D-5L valuation studies.

We also observed important effects in the middle of the QALY utility scale. We found many more WTD states in arms B and C than in A. This is particularly striking given that arm C offered much coarser granularity of negative values, so it could be expected that fewer respondents would assign a strictly negative value. This finding is complemented by the comparison of the cdf between arms (see The Overall Distribution Properties).
We attribute these findings to a more seamless transition from the BTD to WTD part in the experimental arms, where the structure of choices changes little (especially in arm C, see Fig. 3). In arm A, to enter the WTD part, the respondent must choose immediate death over $(Q,10)$—the only comparison in cTTO (except for AIT case) that involves a 0-duration alternative. Even if the comparison versus immediate death may be theoretically appealing, given that it removes some confounding factors (for instance, discounting or QoL changing over time), it may be appalling to some respondents, because imagining immediate death may trigger strong emotions. Such a choice is qualitatively different from other cTTO choices.
• Jakubczyk M.
• Craig B.
• Barra M.
• et al.
Choice defines value: a predictive modeling competition in health preference research.
• Flynn T.
Using conjoint analysis to estimate health state values for cost-utility analysis: issues to consider.
• Sampson C.
• Parkin D.
• Devlin N.
Drop dead: is anchoring at ‘dead’ a theoretical requirement in health state valuation? OHE.
In consequence, the comparison versus immediate death in arm A works as a strong gatekeeper preventing respondents from declaring a state Q as WTD. In consequence, respondents who would attribute slightly negative u(Q) under other elicitation methods may not pass the gatekeeper, causing clustering for slight positive utilities. Nevertheless, once u(Q) is sufficiently negative to pass the gate, many years are traded off in the LT-TTO, resulting in the kink of cdf and distorting the correlation between the severity and negative utilities (especially in view of the floor effect in arm A, ie, censoring the utilities in −1). The abovementioned problems are apparently avoided in arms B and C. The evidence for respondents avoiding reporting $u(Q)≤0$ in the standard cTTO was also presented by Lipman et al.
• Lipman S.
• Zhang L.
• Shah K.
• Attema A.
Time and lexicographic preferences in the valuation of EQ-5D-Y with time trade-off methodology.

### Implications for Value Sets

Our findings suggest that the standard cTTO may overestimate the utility over the whole scale and in particular for more severe states. This is not to say that arms B and C provide correct estimates, especially that these 2 arms differ between each other. Because this is definitely beyond the scope of this article, we leave aside the discussion about the meaning of “true value” of utility, given that our article clearly shows that the elicited values strongly depend on how TTO is operationalized. In addition, because arms B and C posed more difficulties to the respondents, we do not conclude they should replace the standard cTTO. Instead, the experimental arms highlight where the standard cTTO could fail.
This likely overestimation in arm A may explain some earlier findings. Wu et al
• Wu J.
• Xie S.
• He X.
• et al.
Valuation of SF-6Dv2 health states in China using time trade-off and discrete-choice experiment with a duration dimension.
conducted a large study in which 3320 respondents completed TTO and DCE with duration (DCEd) for health states defined using the SF-6Dv2 descriptive system. Health state utilities as elicited with TTO varied between −0.28 and 1 with 5% states valued as WTD. The utilities obtained by DCEd were lower, between −0.54 and 1, with 8.5% states WTD. The values were highly correlated between the approaches (intraclass correlation coefficient equal to 0.98), which suggests that the difference lies mostly in the scale assigned and not a change in preferences. An earlier pilot study by Xie et al
• Xie S.
• Wu J.
• He X.
• Chen G.
• Brazier J.
Do discrete choice experiments approaches perform better than time trade-off in eliciting health state utilities? Evidence from SF-6Dv2 in China.
anchors the latent utilities on the QALY scale.
We acknowledge that an opposite difference was observed when DCEd and TTO data were estimated in independent samples from the same general population. For instance, Craig and Rand
• Craig B.
• Rand K.
Choice defines QALYs: a US valuation of the EQ-5D-5L.
report a US value set based on DCEd with $u(55555)=−0.29$ and Pickard et al
• Pickard A.
• Law E.
• Jiang R.
• et al.
United States valuation of EQ-5D-5L health states using an international protocol.
a TTO-based value set with $u(55555)=−0.57$. Nevertheless, Craig and Rand
• Craig B.
• Rand K.
Choice defines QALYs: a US valuation of the EQ-5D-5L.
accounted for discounting that makes the results less comparable, given that such correction has been shown to increase the estimated utilities
• Jonker M.F.
• Donkers B.
• de Bekker-Grob E.W.
• Stolk E.A.
Advocating a paradigm shift in health-state valuations: the estimation of time-preference corrected QALY tariffs.
.
Because of the overestimation of utilities, the health gains resulting from QoL improvements are underestimated and those resulting from life prolongation are overestimated. The low discriminatory power at the top of the utility scale may discriminate against public health interventions offering small QoL gains for small incremental costs.

### Limitations

The very low negative utilities and averages for severe states are surprising and lack face validity (at least to those who are used to existing value sets obtained with existing methods). Using elicitation methods with no censoring creates the risk that some respondents will assign very low, negative utilities, even if due to misunderstanding of the task. Then, some careful procedures of removing extreme values should be implemented to avoid distortion in value set. The possibility of eliciting very negative values was considered a problem with previous approach to TTO, as used in the Measurement and Value of Health study by Dolan,
• Dolan P.
Modeling valuations for EuroQol health states.
which allowed eliciting utilities as low as −39. Because such very low utilities were observed, first a theoretically unfounded algebraic correction was introduced, to be subsequently replaced by cTTO. Nevertheless, our results show that limiting interest narrowly to the $[−1,1]$ range is probably unwarranted.
In addition, the low values obtained for mild and moderate states in arms B and C of our study lack face validity. Some of these low values may stem from the erratic behavior of some respondents also resulting in increased inconsistency rate observed in these arms (the proportion of inconsistency is still not worrying compared with other studies
• Ramos-Goñi J.M.
• Oppe M.
• Slaap B.
• Busschbach J.J.
• Stolk E.
Quality control process for EQ-5D-5L valuation studies.
). That the experimental arms pose more difficulty to the respondent is also confirmed directly by the respondents in debriefing questions (see Appendix E in Supplemental Materials found at https://doi.org/10.1016/j.jval.2022.08.011). Think-aloud studies, assuming they provoke more deliberate choices, could provide additional insight into to what extent such low utilities are credible and what thought processes can lead to them.
Another limitation is a very coarse granularity of possible answers in arm C. This was the result of trying simultaneously to (1) include both Q and FH in the alternatives, (2) avoid time steps < 6 months, and (3) not expand the total time horizon. Nevertheless, arm C gave similar results to B in many aspects (eg, the proportion of NT or how it restored the association between negative utility and state severity). In this sense, arm C provided confirmatory value to the whole study. The very negative values in C may result from the small number of steps needed to reach these very negative values: although in cTTO after 3 steps we may reach the utility of −0.5, in arm C it can be $u=−8$.
We admit that considering nonconstant alternatives (ie, ones in which QoL changes over time) is more cognitively demanding for the respondents. In cTTO no such alternatives are used in BTD, and in WTD only one of the alternatives is nonconstant. Nevertheless, in the experimental arms, often 2 alternatives were nonconstant. This fact might be the main reason for the increased difficulty and inconsistencies. Hence, in future research, attempts should be made to refine the standard cTTO but still refrain from nonconstant profiles as much as possible.
In the context of task difficulty, we used a sample that was highly nonrepresentative for the general population: only young and educated individuals. On the one hand, arms B and C might yield even more inconsistencies in less educated samples. On the other hand, perhaps some part of the difficulty resulted from the use of an online format. Because we did not aim to produce a value set, the lack of representativeness is not a limitation per se. In the convenience sample we used, we expect the same direction of effects of TTO modification, differing only in size.
Regarding this size, we largely surveyed the Dutch (even if a substantial fraction of the students from whom we recruited was international; alas, no data on nationality were collected), and more trading can typically be expected in Dutch samples.
• Olsen J.
• Lamu A.
• Cairns J.
In search of a common currency: a comparison of seven EQ-5D-5L value sets.
We believe also that young people, typically not accustomed to health problems, may be more willing to trade to avoid them. Therefore, it would be interesting to see whether similar effects are observed in other samples.
Finally, our study focused on mild and severe states. Hence, the performance of experimental arms in the middle of the scale is more difficult to assess. Further studies could attempt at gathering more evidence.

## Conclusions

The experimental TTO arms alleviate some issues with cTTO at the upper end of the utility scale (NT), middle of the scale (crowding at or above 0), and negative utilities (removing the censoring at −1 and counterintuitive patterns when related to state severity). This suggests that standard cTTO may overestimate the utilities. Nevertheless, the experimental arms increased inconsistencies and yielded utility values with problematic face validity. This suggests that further research is needed and it should focus on reducing the effect of loss aversion and determining whether the state is better or WTD in a way more compatible with other tasks.

## Article and Author Information

Author Contributions: Concept and design: Jakubczyk, Lipman, Roudijk, Norman, Pullenayegum, Yang, Gu, Stolk
Acquisition of data: Jakubczyk, Lipman, Roudijk
Analysis and interpretation of data: Jakubczyk, Lipman, Roudijk, Norman, Pullenayegum, Yang, Gu, Stolk
Drafting of the manuscript: Jakubczyk, Norman, Stolk
Critical revision of paper for important intellectual content: Jakubczyk, Lipman, Roudijk, Norman, Pullenayegum, Yang, Gu, Stolk
Statistical analysis: Jakubczyk
Obtaining funding: Jakubczyk, Lipman, Roudijk, Norman, Yang, Gu, Stolk
Administrative, technical, or logistic support: Jakubczyk, Lipman, Roudijk
Supervision: Jakubczyk
Conflict of Interest Disclosures: Dr Jakubczyk reported receiving grants from the EuroQol Research Foundation , during the conduct of the study, and grants and personal fees from the EuroQol Research Foundation , outside the submitted work. Drs Lipman, Pullenayegum, and Yang reported receiving grants from the EuroQol Research Foundation , during the conduct of the study, and grants from the EuroQol Research Foundation , outside the submitted work. Dr Roudijk reported receiving grants from and other financial relationships with the EuroQol Research Foundation , outside the submitted work. Dr. Norman reported receiving grants from the EuroQol Research Foundation , during the conduct of the study, and grants and personal fees from the EuroQol Research Foundation , outside the submitted work. Dr Ning Yan Gu is a member of EuroQol Research Foundation and reported received grants from the EuroQol Research Foundation for this study. Dr Stolk reports employment at and grants from the EuroQol Research Foundation , outside the submitted work. The EuroQol Research Foundation is the sole owner of the EuroQol Valuation Technology, which is a software tool for valuing health. EuroQol also has the intellectual property rights on the modified versions of EuroQol Valuation Technology tested in this article. The views presented in this article may not be shared by the EuroQol Research Foundation or the EuroQol Group. Drs Norman and Stolk are editors for Value in Health and had no role in the peer-review process of this article.
Funding/Support: This work was supported by grant number EQ Project 20190070 from the EuroQol Research Foundation .
Role of the Funder/Sponsor: The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

## Acknowledgment

The authors appreciate the funding of this study by the EuroQol Research Foundation . The views expressed by the authors in the paper do not necessarily reflect the views of the EuroQol Foundation. The authors thank Meike van der Linden, Ilse Littooij, and Angeliki Kaproulia for their help in data collection. They also appreciate the cooperation from HostLab when developing and using the software for interviews.

## Supplemental Material

• Supplementary Material

## References

• Bleichrodt H.
• Wakker P.
• Johannesson M.
Characterizing QALYs by risk neutrality.
J Risk Uncertainty. 1997; 15: 107-114
• Kennedy-Martin M.
• Slaap B.
• Herdman M.
• et al.
Which multi-attribute utility instruments are recommended for use in cost-utility analysis? A review of national health technology assessment (HTA) guidelines.
Eur J Health Econ. 2020; 21: 1245-1257
• Versteegh M.
• Vermeulen K.
• Evers S.
• de Wit G.
• Prenger R.
• Stolk E.
Dutch tariff for the five-level version of EQ-5D.
Value Health. 2016; 19: 343-352
• Pickard A.
• Law E.
• Jiang R.
• et al.
United States valuation of EQ-5D-5L health states using an international protocol.
Value Health. 2019; 22: 931-941
• Yang F.
• Katumba K.
• Roudijk B.
• et al.
Developing the EQ-5D-5L value set for Uganda using the ‘lite’ protocol.
Pharmacoeconomics. 2022; 40: 309-321
• Jansen B.M.
• Oppe M.
• Versteegh M.M.
• Stolk E.A.
Introducing the composite time trade-off: a test of feasibility and face validity.
Eur J Health Econ. 2013; 14: 5-13
• Stolk E.
• Ludwig K.
• Rand K.
• van Hout B.
• Ramos-Goñi J.M.
Overview, update, and lessons learned from the international EQ-5D-5L valuation work: version 2 of the EQ-5D-5L valuation protocol.
Value Health. 2019; 22: 23-30
• Ramos-Goñi J.M.
• Oppe M.
• Cabasés J.M.
• Serreno-Aguilar P.
• Rivero-Arias O.
Valuation and modeling of EQ-5D-5L health states using a hybrid approach.
Med Care. 2017; 55: e51-e58
• Golicki D.
• Jakubczyk M.
• Graczyk K.
Valuation of EQ-5D-5L health states in Poland: the first EQ-VT-based study in Central and Eastern Europe.
Pharmacoeconomics. 2019; 37: 1165-1176
• Jensen C.E.
• Sørensen S.S.
• Gudex C.
• Jensen M.B.
• Pedersen K.M.
• Ehlers L.H.
The Danish EQ-5D-5L value set: a hybrid model using cTTO and DCE data.
Appl Health Econ Health Policy. 2021; 19: 579-591
• Keeney R.L.
• Raiffa H.
Decisions With Multiple Objective: Preferences and Value Trade- Offs.
Cambridge University Press, Cambridge, England1993
• Wisløff T.
• Hagen G.
• Hamidi V.
• Movik E.
• Klemp M.
• Olsen J.
Estimating QALY gains in applied studies: a review of cost-utility analyses published in 2010.
Pharmacoeconomics. 2014; 32: 367-375
• Schneider P.
• van Hout B.
• Brazier J.
Fair interpersonal utility comparison in the valuation of health: a relative utilitarian preference aggregation method. GitHub.
https://github.com/bitowaqr/eq5d_muap
Date accessed: October 8, 2022
1. Jakubczyk M. What if 0 is not equal to 0? Inter-personal health utilities anchoring using the largest health gains. Accepted for publication in Eur J Health Econ. 2022.

• Gandhi M.
• Rand K.
• Luo N.
Valuation of health states considered to be worse than death-an analysis of composite time trade-off data from 5 EQ-5D-5L valuation studies.
Value Health. 2019; 22: 370-376
• Roudijk B.
• Donders R.
• Stalmeier P.
A threshold explanation for the lack of variation in negative composite time trade-off values.
Qual Life Res. 2022; 31: 2753-2761
• Kahneman D.
• Tversky A.
Prospect theory: an analysis of decision under risk.
Econometrica. 1979; 47: 263-291
• Lipman S.
• Brouwer W.
• Attema A.
QALYs without bias? Nonparametric correction of time trade-off and standard gamble weights based on prospect theory.
Health Econ. 2019; 28: 843-854
• Jakubczyk M.
• Golicki D.
Elicitation and modelling of imprecise utility of health states.
Theor Decis. 2020; 88: 51-71
• Zhao J.
• Kling C.
A new explanation for the WTP/WTA disparity.
Econ Lett. 2001; 73: 293-300
• Ramos-Goñi J.M.
• Oppe M.
• Slaap B.
• Busschbach J.J.
• Stolk E.
Quality control process for EQ-5D-5L valuation studies.
Value Health. 2017; 20: 466-473
• Jakubczyk M.
• Craig B.
• Barra M.
• et al.
Choice defines value: a predictive modeling competition in health preference research.
Value Health. 2018; 21: 229-238
• Flynn T.
Using conjoint analysis to estimate health state values for cost-utility analysis: issues to consider.
Pharmacoeconomics. 2010; 28: 711-722
• Sampson C.
• Parkin D.
• Devlin N.
Drop dead: is anchoring at ‘dead’ a theoretical requirement in health state valuation? OHE.
• Lipman S.
• Zhang L.
• Shah K.
• Attema A.
Time and lexicographic preferences in the valuation of EQ-5D-Y with time trade-off methodology.
Eur J Health Econ. 2022; https://doi.org/10.1007/s10198-022-01466-6
• Wu J.
• Xie S.
• He X.
• et al.
Valuation of SF-6Dv2 health states in China using time trade-off and discrete-choice experiment with a duration dimension.
Pharmacoeconomics. 2021; 39: 521-535
• Xie S.
• Wu J.
• He X.
• Chen G.
• Brazier J.
Do discrete choice experiments approaches perform better than time trade-off in eliciting health state utilities? Evidence from SF-6Dv2 in China.
Value Health. 2020; 23: 1391-1399
• Craig B.
• Rand K.
Choice defines QALYs: a US valuation of the EQ-5D-5L.
Med Care. 2018; 56: 529-536
• Jonker M.F.
• Donkers B.
• de Bekker-Grob E.W.
• Stolk E.A.
Advocating a paradigm shift in health-state valuations: the estimation of time-preference corrected QALY tariffs.
Value Health. 2018; 21: 993-1001
• Dolan P.
Modeling valuations for EuroQol health states.
Med Care. 1997; 35: 1095-1108
• Olsen J.
• Lamu A.
• Cairns J.
In search of a common currency: a comparison of seven EQ-5D-5L value sets.
Health Econ. 2018; 27: 39-49