Advertisement

Test-Retest Reliability of Discrete Choice Experiment for Valuations of QLU-C10D Health States

Open ArchivePublished:March 02, 2018DOI:https://doi.org/10.1016/j.jval.2017.11.012

      Abstract

      Background

      Recently, a newly developed cancer-specific multiattribute utility instrument based on the widely used health-related quality of life instrument, the European Organisation for Research and Treatment of Cancer QLQ-C30, was introduced: the QLU-C10D. For the elicitation of utility weights, a discrete choice experiment (DCE) was designed. Our aim was to investigate the DCE in terms of individual choice consistency and utility estimate consistency by applying a test-retest design.

      Methods

      We conducted the study in general population samples in Germany and France. The DCE was administered via a web-based self-complete survey using online panels. Respondents were presented 16 choice sets comprising 11 attributes with 4 levels each. Retest was conducted 4 to 6 weeks after first assessment. We used kappa and percentage agreement as measures of choice consistency and both intraclass correlations and mean utility differences as measures of utility estimate consistency.

      Results

      A total of 300 German respondents (31% female, mean age 48 years [SD 14]) and 305 French respondents (46% female, mean age 47 years [SD 16]) completed test and retest assessments. Individual choice consistency was moderate to high (Germany: κ = 0.605, percentage agreement = 80.2%; France: κ = 0.411, percentage agreement = 70.6%). Utility estimate consistency was high when considering intraclass correlations (all >0.79). Mean utility differences were 0.08 in the German sample and 0.05 in the French sample.

      Conclusions

      Results indicate that the designed DCE elicits stable health state preferences rather than guesses or mood-specific or condition-specific judgments. Nevertheless, the identified mean utility differences between test and retest need to be taken into account when determining minimal important differences for the QLU-C10D in future research.

      Keywords

      Introduction

      Quality-adjusted life years (QALYs) are regarded as one of the most important primary outcomes in health-economic evaluations [
      • Drummond M.
      • Brixner D.
      • Gold M.
      • et al.
      Consensus Development Group
      Toward a consensus on the QALY.
      ]. They combine quality of life and survival time into one parameter that can then be employed as the outcome metric in cost-utility analysis (CUA). For the calculation of QALYs, condition-specific health states need to be attached with utility weights.
      A common approach to achieve this is the use of multiattribute utility instruments (MAUIs) [
      • Hawthorne G.
      • Richardson J.
      • Day N.A.
      A comparison of the Assessment of Quality of Life (AQoL) with four other generic utility instruments.
      ]. A MAUI consists of a set of questions covering various aspects of health and health-related quality of life (HRQOL), such as mobility, pain, and social life. Each question is answered on a rating scale with a finite number of mutually exclusive and exhaustive response categories. Through different combinations of levels within domains, a MAUI can depict a broad range of HRQOL profiles/health states. For example, the widely used three-level version of the EuroQol five-dimensional questionnaire (EQ-5D), which is one of the most commonly used MAUIs, consists of five questions, each with a three-level response scale, and thereby is able to describe 35 (i.e., 243) health/HRQOL states. The described health states are attached with utility weights that are derived from health state preferences elicited from respondents of a valuation sample [
      • Brazier J.
      • Ratcliffe J.
      • Saloman J.
      • Tsuchiya A.
      Measuring and Valuing Health Benefits for Economic Evaluation.
      ].
      The most recent development in the field of utility elicitation is the adoption of the discrete choice experiment (DCE) for health valuations [
      • Ryan M.
      • Netten A.
      • Skatun D.
      • Smith P.
      Using discrete choice experiments to estimate a preference-based measure of outcome—an application to social care for older people.
      ,
      • Bansback N.
      • Brazier J.
      • Tsuchiya A.
      • Anis A.
      Using a discrete choice experiment to estimate health state utility values.
      ]. In a DCE in a health context, respondents are most commonly asked to make choices between given (hypothetical) health profiles, each consisting of a set of so-called attributes, that is, different health/HRQOL aspects and a survival time. Thus, in contrast to the commonly used utility elicitation methods of standard gamble (SG) [
      • Von Neumann J.
      • Morgenstern O.
      ] and time trade-off (TTO) [
      • Torrance G.W.
      • Thomas W.H.
      • Sackett D.L.
      A utility maximization model for evaluation of health care programs.
      ], in which respondents have to quantify the strength of their preference, in a DCE they are stating a preference between two options. Whereas the SG and TTO approaches allow direct calculation of utilities for the health states valuated by the respondent, in the case of the DCE, utilities are determined indirectly by using a statistical model based on the answers of a sample of respondents.
      An important aspect of the reliability of inferences derived from DCEs is choice consistency, which, according to Louviere et al. [
      • Louviere J.L.
      • Islam T.
      • Wasi N.
      • et al.
      Designing discrete choice experiments: do optimal designs come at a price?.
      ], is high in cases in which the variability in choice outcomes is mainly explained by attributes and associated preference weights. Choice consistency is likely to decrease with task complexity, which, among other factors, is influenced by the number of attributes and alternatives [
      • Swait J.
      • Adamowicz W.
      The influence of task complexity on consumer choice: a latent class model of decision strategy switching.
      ]. This can have a substantial impact on policy inferences. If a task is too complex, small but important effects can be obscured by noise.
      Test-retest reliability is a common measure of an instrument’s ability to provide consistent scores over time in a stable population that has rarely been applied in the context of utility elicitation. In the context of DCEs for health valuations, two different types of test-retest reliability are relevant: 1) the consistency of respondents’ choices of health states from choice sets, when exactly the same choices sets are presented at two times (in the following called “choice consistency”); 2) the consistency in the final utilities obtained from the DCE (in the following called “utility estimate consistency”). A similar approach of investigating DCE test-retest reliability has been used by Skjoldborg et al. [
      • Skjoldborg U.S.
      • Lauridsen J.
      • Junker P.
      Reliability of the discrete choice experiment at the input and output level in patients with rheumatoid arthritis.
      ] in a different setting.
      The main aim of the present investigation was to determine the test-retest reliability of the DCE used for utility elicitation for the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Utility–Core 10 Dimensions (QLU-C10D) [
      • King M.T.
      • Costa D.S.
      • Aaronson N.K.
      • et al.
      QLU-C10D: a health state classification system for a multi-attribute utility measure based on the EORTC QLQ-C30.
      ,
      • Norman R.
      • Viney R.
      • Aaronson N.K.
      • et al.
      Using a discrete choice experiment to value the QLU-C10D: feasibility and sensitivity to presentation format.
      ], a recently developed cancer-specific MAUI based on the widely used HRQOL questionnaire EORTC Quality of Life Questionnaire–Core 30 (QLQ-C30).
      Our specific aims were to
      • 1.
        determine the consistency of respondents’ choice of health state within each choice set of the DCE survey at two different time points (test, retest)–(choice consistency), and
      • 2.
        determine the consistency of the estimated utilities derived from the DCE survey at the “test” and the “retest” time points (utility estimate consistency).

      Methods

       Context and Design of QLU-C10D Health State Valuations

      The data used for these analyses were collected as part of the work of the EORTC Quality of Life Group utility project in collaboration with the international Multi-Attribute Utility Cancer (MAUCa) Consortium, which developed the QLU-C10D [
      • King M.T.
      • Costa D.S.
      • Aaronson N.K.
      • et al.
      QLU-C10D: a health state classification system for a multi-attribute utility measure based on the EORTC QLQ-C30.
      ]. The valuations within the EORTC Quality of Life Group utility project were performed using a design and methodology first employed in the Australian valuations [
      • Norman R.
      • Viney R.
      • Aaronson N.K.
      • et al.
      Using a discrete choice experiment to value the QLU-C10D: feasibility and sensitivity to presentation format.
      ]. The following provides a short outline of these valuation methods; for details on the QLU-C10D and DCE design, please refer to King et al. [
      • King M.T.
      • Costa D.S.
      • Aaronson N.K.
      • et al.
      QLU-C10D: a health state classification system for a multi-attribute utility measure based on the EORTC QLQ-C30.
      ] and Norman et al. [
      • Norman R.
      • Viney R.
      • Aaronson N.K.
      • et al.
      Using a discrete choice experiment to value the QLU-C10D: feasibility and sensitivity to presentation format.
      ]. The QLU-C10D comprises 10 HRQOL dimensions (physical, role, social, emotional functioning; pain; fatigue; sleep disturbances; appetite loss; nausea; and bowel problems,) with the severity of impairment expressed using four levels (ranging from “not at all” to “very much”), analogous to the four-point Likert scale of the parent instrument, the QLQ-C30. The QLU-C10D therefore is able to describe 410 health states.
      Each of the hypothetical health profiles for the DCE comprises a QLU-C10D health state and a survival time in that health state (1, 2, 5, or 10 years). Therefore, each DCE health profile comprises 11 attributes (10 HRQOL dimensions + survival time).
      The experimental design underpinning the QLU-C10D DCE was designed to achieve maximal statistical efficiency using a balanced incomplete block design [
      • Colbourn C.J.
      • Dinitz J.H.
      Handbook of Combinatorial Designs.
      ]. The final design comprised 960 choice sets, each consisting of 2 health profiles (A and B) to be compared. Each respondent was randomly assigned 16 out of the 960 choice sets. Within choice sets, we randomized the order of attributes to mitigate potential order effects, and we randomized which option was seen as option A and option B to mitigate any position bias. To keep the cognitive burden for the respondent at a manageable level, only five attributes differed between options A and B within each choice set (i.e., we imposed overlap between profiles for the remaining attributes).

       Data Collection and Procedure for DCE Consistency Evaluations (Test-Retest Design)

      Test-retest assessments were conducted in a German and a French sample because these are the first two European countries for which QLU-C10D utilities will be developed within the project. The DCE was administered via a Web-based self-complete survey, which was managed by an international company that specializes in online DCE surveys. The company uses online panels to recruit respondents, who are invited by email. Participation was voluntary and incentivized. Quotas for sex, age, and education were introduced according to distributions in the respective general population. Respondents’ eligibility in terms of quota completion was checked before starting the survey; if quotas were already fulfilled, the survey was discontinued for the respective respondents. For the QLU-C10D valuations in European countries, we applied the same methodology and contracted the same survey company as in the Australian valuation study [
      • Norman R.
      • Viney R.
      • Aaronson N.K.
      • et al.
      Using a discrete choice experiment to value the QLU-C10D: feasibility and sensitivity to presentation format.
      ]. The main elements of the survey were the EORTC QLQ-C30; an example choice set; the valuation task comprising 16 choice sets (randomized as described earlier); debriefing questions on the tasks’ clarity, their difficulty compared to other surveys, and the difficulty of choosing between the health states and a self-reported strategy regarding how a decision was reached; demographics; and additional self-reported health (the general health question of the Short Form 36 Health Survey (SF-36), and the EQ-5D. The survey has been described in more detail by Norman et al. [
      • Norman R.
      • Viney R.
      • Aaronson N.K.
      • et al.
      Using a discrete choice experiment to value the QLU-C10D: feasibility and sensitivity to presentation format.
      ]. The study conforms to the ISPOR code of ethics. The whole survey was translated by native speakers of the target languages who were fluent in English. The procedure included forward and backward translations as well as feedback from in-country persons.
      The two aspects of consistency, choice consistency and utility estimate consistency, were evaluated using a test-retest design. In general, the assumptions made when assessing test-retest reliability are that the characteristic being measured does not change over a predefined time and that the time period between the two assessments is long enough to prevent learning, carryover effects, or recall [
      • Allen M.
      ]. Transferred to the DCE for QLU-C10D valuation, this means that we assumed that the way respondents value the attributes of the QLU-C10D likewise is stable; that is, the DCE survey results in similar valuations when conducted at two different time points.
      Hence, respondents in both France and Germany were approached to complete exactly the same survey twice. The previously described randomization of attributes and the position of response options, was the same as in the first assessment. This means that randomization was done in the first survey, and each respondent was presented with his or her ordering again in the second survey. Respondents were invited to both surveys separately; that is, they were not informed about the retest when completing the first survey. On invitation to the second survey, respondents were informed that the survey may be very similar to the first one for methodological reasons, in order not to annoy them with repetitive questions. The appropriate test-retest interval depends on the construct being measured and how stable or dynamic it is. The interval may be from days to months, but for most applications it is suggested to be about 2 weeks [
      • Pedhazur E.
      • Schmelkin L.
      Measurement, Design, and Analysis.
      ]. We decided to recontact 4 to 6 weeks after the first assessment because this period was considered to be short enough to not expect significant change in preferences, but long enough that the typical respondent would not recall previous answers.

       Statistical Methods

      Descriptive statistics are presented as percentages, means, medians, and standard deviations. First, assuming that the respondents’ values underlying their health preferences are basically stable but may vary with a change of health state, we investigated whether their overall health status and quality of life as measured by the score of the general health item of the SF-36 and the score of QLQ-C30 global quality of life scale had changed between the two assessments. In case of such an observed change of health state in a significant proportion of respondents, sensitivity analyses would be required. We also investigated the proportion of changes by ≥2 response categories in these questionnaires/scales. The cutoff of ≥2 categories was chosen because it can be considered to reflect a real change of health status rather than a differing response resulting from chance.
      Aim 1: Determine the consistency of respondents’ choice of health state within each choice set of the DCE survey at two different time points: (test, retest)–(choice consistency)
      Test-retest reliability was assessed by considering the degree of agreement between respondents’ choices for each choice set at time 1 (test) and time 2 (retest) in the following ways. Exact agreement for a particular choice set and respondent was defined as choosing the same health profile in both test and retest surveys (either option A on both or option B on both). The overall agreement (over all respondents and choice sets), the proportion of respondents with exact agreement in all 16 choice sets, and the proportion with poor agreement (<8 of 16 [<50%] of choice sets) were then determined. We also calculated the kappa (κ) statistic, which provides an estimate of agreement that is corrected for chance [
      • Cohen J.
      A coefficient of agreement for nominal scales.
      ]. Agreement was judged according to the classification of Landis and Koch [
      • Landis J.R.
      • Koch G.G.
      The measurement of observer agreement for categorical data.
      ]; that is, agreement is considered very good when κ is >0.8, good when between 0.61 and 0.8, and moderate when between 0.41 and 0.60. Overall agreement was confirmed at ≥70% agreement between item scores at test and retest according to Kazdin [
      • Kazdin A.E.
      Artifact, bias, and complexity of assessment: the ABCs of reliability.
      ]. Kappa 95% confidence intervals [
      • Fleiss J.L.
      • Cohen J.
      • Everitt B.S.
      Large sample standard errors of kappa and weighted kappa.
      ] and confidence intervals for overall agreement were corrected for repeated observations within subjects by means of generalized estimation equation models, assuming a first-order autoregressive correlation structure AR(1).
      Aim 2: Determine the consistency of the estimated utilities derived from the DCE survey at the “test” and the “retest” time points (utility estimate consistency).
      The DCE data at time 1 and time 2 were analyzed independently to determine utility weights at each time, following the method of Bansback et al. [
      • Bansback N.
      • Brazier J.
      • Tsuchiya A.
      • Anis A.
      Using a discrete choice experiment to estimate health state utility values.
      ]. This approach has also previously been used by Norman et al. [
      • Norman R.
      • Viney R.
      • Aaronson N.K.
      • et al.
      Using a discrete choice experiment to value the QLU-C10D: feasibility and sensitivity to presentation format.
      ] to analyze DCE data based on the QLU-C10D. The model for the utility of option j in choice set s for respondent i is given by
      Uisj=αTIMEisj+βXisjTIMEisj+εisj,


      where X′isj is a set of dummies related to the levels of the health state (as defined within the QLU-C10D), and TIMEisj is the survival time presented in option j. The errors εisj were assumed to be independent and identically distributed normal random variables. The analysis method used to estimate the parameters α (scalar) and β (vector) was conditional logistic regression. To account for repeated observations within subjects, generalized estimation equation models with an AR(1) correlation structure were used. Regression weights obtained were converted into utility decrements as described by Bansback et al. [
      • Bansback N.
      • Brazier J.
      • Tsuchiya A.
      • Anis A.
      Using a discrete choice experiment to estimate health state utility values.
      ]. Briefly, utility weights for use in CUA require estimation of the willingness of the typical respondent to give up years of life to alleviate the health problem. In our empirical specification, the relative importance of each β term relative to α gives us that relative value. The latter were determined both for “test” data and for “retest” data. To check for systematic differences between test and retest utility weights, test and retest data were pooled and a model with additional interaction terms was fitted to the data. The interactions considered were assessment (test, retest) × TIME with 1 degree of freedom (d.f.) and assessment × HRQOL dimension × TIME with 30 d.f. (10 dimensions × 3 levels). The latter interaction is included because the standard approach for analyzing these data uses interactions between TIME and levels of dimensions of the instrument to allow the model to fit within the QALY framework [
      • Bansback N.
      • Brazier J.
      • Tsuchiya A.
      • Anis A.
      Using a discrete choice experiment to estimate health state utility values.
      ]. To explore the effect of test/retest on these coefficients, one needs to interact the assessment with these two-factor interactions. Significance of the additional terms was tested by means of the likelihood ratio test.
      Based on the utility decrements obtained, utilities for a random subset of 1000 of all 410 possible health states were calculated for both conditions test and retest (the same subset of health states for test and retest). In addition, utilities for the 960 health states used in the balanced incomplete block design were determined in the same way. To assess the level of agreement between test and retest utility estimates, for these 1000 health states, Pearson correlation coefficients were estimated to assess the degree of correlation between test/retest utility values; mean differences were used to assess systematic differences between test/retest utility values; and the intraclass correlations (ICCs) were used to assess the degree of variation between test-retest utilities relative to total variation in utilities (between plus within health states). ICCs were interpreted as fair between 0.40 and 0.59, as good between 0.60 and 0.74, and excellent between 0.75 and 1.00, according to Cicchetti and Domenic [
      • Cicchetti D.V.
      Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology.
      ].

       Sample and Sample Size Considerations

      Test-retesting the survey was planned for samples of 300 respondents each from the German and the French general population. Sample size estimation was based on the length of the 95% confidence interval (CI) for κ. The planned sample size of 300 respondents for retest gives rise to a total of 4800 observations, as each respondent completed 16 choice sets. Using the formula for the standard error of κ by Fleiss et al. [
      • Fleiss J.L.
      • Cohen J.
      • Everitt B.S.
      Large sample standard errors of kappa and weighted kappa.
      ] and assuming that overall (summing over all choice sets) the probabilities of deciding for option A and B will be similar (i.e., both close to 0.5), a sample size of 4800 pairs of observations (test, retest) gives rise to 95% CIs of the form [κ − d, κ + d] with values of d not exceeding 0.03 over the entire range of possible κ values. This value may require correction for the fact that observations within respondents are correlated. For the German general population data, the within-subject correlations were low, giving rise to a correction factor of <1.5 for standard errors of parameters obtained in logistic regression analyses. When allowing for a correction factor of 2, the value for d would amount to 0.06, giving rise to sufficiently small CIs.

      Results

       Sample Characteristics

      A total of 300 German (31% female, mean age 48 years [SD 14]) and 305 French respondents (46% female, mean age 47 years [SD 16]) completed the 16 choice sets at test and retest. Details are presented in Table 1. To identify potential selection bias, we compared the sociodemographic characteristics of the retest samples with the respective valuation sample. We also compared the respondents’ appraisal of the difficulty of the survey and the time required for completion. There were no differences except that among the German respondents who were willing to do the retest, fewer people perceived the survey as more difficult than other surveys. In general, DCE tasks were perceived as difficult but clear by German and French respondents. The majority (>90% in both samples) could name a strategy regarding how they reached their decisions (responses to debriefing questions are provided in Table 1).
      Table 1Sample characteristics
      CharacteristicsGerman sample, N = 301French sample, N = 305
      N%Mean (SD; range)N%Mean (SD; range)
      Age, years48 (14)47 (16)
      Sex: female923114146
      Marital status
       Unmarried722410233
       Married1364515451
       Separated114
       Divorced2693612
       Widowed114134
       Fix partnership4515
      Education
       Secondary lower education or less196217
       Apprenticeship5017196
       Secondary school71247023
       A-level
      Including bachelor.
      752512641
       University level
      Including university of applied sciences.
      84286923
      Children: yes923110534
      Chronic disease general: yes1234113544
      Cancer diagnosis: yes258227
      Cancer diagnosis in family: yes1705617156
      Self-reported health status
       General health question at test2.95 (0.87; 4)2.91 (1.01; 4)
       General health question at retest2.96 (0.91; 4)2.94 (0.98; 4)
       QLQ-C30 Global QOL at test68.3 (18.3; 85.7)73.6 (18.2; 85.7)
       QLQ-C30 Global QOL at retest68.5 (18.2; 85.7)73.4 (17.5; 71.4)
      Debriefing questions:
      Task difficulty compared to other surveys
       Easier3110289
       Similar1615418661
       More difficult100339130
       Don’t know73
      Clarity of description of health states
       Very clear to clear2277618460
       Neutral47158628
       Unclear to very unclear2693512
      Difficulty to choose between health states
       Very easy to easy86297023
       Neutral993311136
       Difficult to very difficult1153812441
      Self-reported decision strategy
       No strategy2383110
       Concentration on single attributes93315017
       Concentration on attributes that differed between health states672211939
       Consideration of most/all attributes104359230
       Different strategy134134
      low asterisk Including bachelor.
      Including university of applied sciences.
      Analyses of the stability of the respondents’ health state (as an indicator of potential bias through associated change of underlying preferences) demonstrated that the sample had stable health between test and retest (Germany: SF-36 general health question mean scores 2.95 vs. 2.96; QLQ-C30 global quality of life scale scores 67.1 vs. 67.6; France: SF-36 general health question mean scores 2.91 vs. 2.94; QLQ-C30 global quality of life scale scores 73.7 vs. 73.4). The proportion of changes of ≥2 response categories, which would indicate relevant changes of subjective health in a person, was less than 10% for general health question and QLQ-C30 global quality of life scores. The time required for test and retest was fairly similar in both samples, with mean durations for survey completion of 13.15 minutes (SD 7.3) in the test and 14.32 minutes (SD 7.24) in the retest situation in the German sample and 14.46 minutes (SD 9.4) in the test and 14.63 minutes (SD 7.8) in the retest situation in the French sample.

       Test-Retest Reliability

      Aim 1: Findings on choice consistency (test-retest reliability of individual choices).
      The frequency distributions of choices in the test and retest data at the level of the individual choice sets are provided in Table 2 for the German and French samples separately. For those in the German sample who originally selected option A, 80.2% of cases selected A in the re-test. For those who initially selected B, 80.3% of cases selected B in the retest, giving an overall agreement of 80.2% (Table 3). The corresponding numbers for the French sample were 69.2% and 71.8%, respectively, for 70.6% overall agreement.
      Table 2Cross tabulation of choices at first (test) and second (retest) DCE surveys in Germany and France
      CountryNumber of persons (N), number of choice sets (n)Health profile preferred in the first (test) DCE surveyHealth profile preferred in the second (retest) DCE survey
      ABTotal
      GermanyN = 300, n = 4800
      Each respondent completed 16 choice sets, resulting in a total of 4800 choices (16 × 300) for the German sample at each time point and 4880 (16 × 305) for the French.
      A1853 (38.6%)457 (9.5%)2310 (48.1%)
      B491 (10.2%)1999 (41.6%)2490 (51.9%)
      Total2344 (48.8%)2456 (51.2%)4800 (100%)
      FranceN = 305, n = 4880
      Each respondent completed 16 choice sets, resulting in a total of 4800 choices (16 × 300) for the German sample at each time point and 4880 (16 × 305) for the French.
      A1667 (34.2%)741 (15.2%)2408 (49.3%)
      B696 (14.3%)1776 (36.4%)2472 (50.7%)
      Total2363 (48.4%)2517 (51.6%)4880 (100%)
      DCE, discrete choice experiment.
      low asterisk Each respondent completed 16 choice sets, resulting in a total of 4800 choices (16 × 300) for the German sample at each time point and 4880 (16 × 305) for the French.
      Table 3Measures of choice consistency in the DCE surveys in Germany and France
      CountryMeasureEstimate95% CI
      GermanyKappa0.6050.578–0.631
      Adjusted for repeated measures.
      Agreement overall80.2%78.9–81.6%
      Adjusted for repeated measures.
      Proportion of subjects with 100% agreement between test and retest (all 16 choice sets)25.3%20.7–30.6%
      Proportion of subjects with less than 50% agreement (<8 of 16)5.7%3.5–8.9%
      FranceKappa0.4110.383–0.442
      Adjusted for repeated measures.
      Agreement overall70.6%69.1–72.1%
      Adjusted for repeated measures.
      Proportion of subjects with 100% agreement between test and retest (all 16 choice sets)9.5%6.7–13.4%
      Proportion of subjects with less than 50% agreement (<8 of 16)13.4%10.0–17.8%
      CI, confidence interval; DCE, discrete choice experiment.
      low asterisk Adjusted for repeated measures.
      Kappa coefficients were κ = 0.605 for the German and κ = 0.411 for the French sample (Table 3). According to the classification by Landis and Koch [
      • Landis J.R.
      • Koch G.G.
      The measurement of observer agreement for categorical data.
      ], consistency of test and retest ranged at the lower end of “good” agreement for the German sample and “fair to moderate” for the French sample. The proportion of respondents with exact agreement in all 16 choice sets was 25% in the German sample but only 10% in the French sample. The majority of German respondents had exact agreement in at least half of the 16 choice sets, with only 6% having exact agreement in fewer than 8 of 16 choice sets (no impact of age and sex). Respondents from France were less consistent: 13% had exact agreement in fewer than 8 of 16 choice sets (no impact of age and sex).
      Aim 2: Utility estimate consistency (test-retest reliability of derived utilities).
      Utility decrements obtained for test and retest by conditional logistic regression were broadly similar in size. As an example, utility decrements for each level of each domain of the QLU-C10D for the French sample are displayed in Table 4 (German utility decrements are shown in Appendix Table 1 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2017.11.012). Testing for systematic differences between test and retest did not yield significant differences between the two assessments, neither for the German (assessment × TIME: χ2 = 0.69, d.f. = 1, P = 0.406; assessment × HRQOL dimensions × TIME: χ2 = 28.79, d.f. = 30, P = 0.529) nor for the French (assessment × TIME: χ2 = 0.61, d.f. = 1, P = 0.435; assessment × HRQOL dimensions × TIME: χ2 = 23.71, d.f. = 30, P = 0.785) sample. Regarding utility decrements for individual dimensions and levels, there were no significant differences between test and retest for any of the parameters for the French survey and only 2 of 30 for the German (physical functioning, level 2, χ2 = 4.63, P = 0.031; fatigue level 2, χ2 = 4.65, P = 0.031). After Bonferroni correction for multiple testing, none of the significances was retained.
      Table 4Utility decrements for French test and retest data
      ParameterSeverity levelTestRetest
      Decrements.e.Decrements.e.
      Physical functioning2−0.0370.048−0.117
      P < 0.05
      0.044
      3−0.167
      P < 0.01
      0.049−0.146
      P < 0.01
      0.047
      4−0.336
      P < 0.001.
      0.052−0.309
      P < 0.001.
      0.042
      Role functioning2−0.0110.044−0.0410.039
      3−0.123
      P < 0.01
      0.042−0.0550.038
      4−0.179
      P < 0.001.
      0.039−0.102
      P < 0.01
      0.035
      Social functioning2−0.0520.040−0.0080.035
      30.0020.046−0.0170.039
      4−0.0650.039−0.0600.034
      Emotional functioning2−0.120
      P < 0.01
      0.040−0.0550.037
      3−0.101
      P < 0.05
      0.041−0.099
      P < 0.05
      0.038
      4−0.189
      P < 0.001.
      0.043−0.198
      P < 0.001.
      0.037
      Pain2−0.0760.042−0.0540.040
      3−0.0740.048−0.114
      P < 0.05
      0.042
      4−0.234
      P < 0.001.
      0.042−0.241
      P < 0.001.
      0.038
      Fatigue20.0640.0480.0060.038
      30.0050.045−0.0270.039
      4−0.0460.037−0.0630.033
      Sleep2−0.0410.039−0.0050.035
      30.0130.0480.0310.041
      4−0.0680.037−0.072
      P < 0.05
      0.034
      Appetite2−0.0380.041−0.0530.035
      30.0080.048−0.0750.037
      4−0.0590.038−0.0410.034
      Nausea2−0.0190.039−0.107
      P < 0.01
      0.033
      3−0.111
      P < 0.01
      0.038−0.108
      P < 0.01
      0.037
      4−0.158
      P < 0.001.
      0.037−0.180
      P < 0.001.
      0.034
      Bowel problems2−0.0400.038−0.0400.033
      3−0.0390.043−0.097
      P < 0.05
      0.036
      4−0.132
      P < 0.01
      0.038−0.148
      P < 0.001.
      0.035
      s.e., standard error.
      low asterisk P < 0.05
      P < 0.01
      P < 0.001.
      Various measures of the consistency of the health state utilities derived from test and retest are displayed in Table 5 and presented graphically in Fig. 1, Fig. 2. Pearson correlations between utilities obtained from test and those obtained from retest were high, with values above 0.88 for the two random samples of health states, for both Germany and France. The corresponding ICCs were above 0.85 for the French sample, indicating good agreement. ICCs were somewhat lower but only slightly below 0.8 for the German sample (0.790 and 0.796). Regarding mean differences between health state utilities derived from test and retest, the corresponding values for the French sample were fairly low (below 0.05 for both samples of health states), whereas mean differences in the German sample of respondents were higher, approaching values of 0.08.
      Table 5Measures of consistency of utility estimates
      CountryMeasureRandom sample of 1000 health states960 health states used in DCE design
      Germany
      Pearson correlation test vs. retest0.8960.888
      ICC test vs. retest0.7900.796
      Mean difference retest-test (SD)0.076 (0.079)0.078 (0.082)
      Mean absolute difference (SD)0.094 (0.066)0.092 (0.065)
      France
      Pearson correlation test vs. retest0.8840.902
      ICC test vs. retest0.8570.879
      Mean difference retest-test (SD)−0.047 (0.101)−0.045 (0.095)
      Mean absolute difference (SD)0.089 (0.067)0.084 (0.063)
      DCE, discrete choice experiment.
      Fig. 1
      Fig. 1QLU-C10D utilities for a random sample of health states: test and retest (Germany).
      Fig. 2
      Fig. 2QLU-C10D utilities for a random sample of health states: test and retest (France).

      Discussion

      In our investigation of the test-retest reliability of the DCE designed for the elicitation of utilities for the QLU-C10D, we found that the choice consistency of individual choices within the DCE was moderate based on κ values and fairly high considering the overall agreement rates of 70.6% to 80.2%. There is no consensus in the literature regarding which of these parameters is considered more informative [
      • McHugh M.L.
      Interrater reliability: the kappa statistic.
      ,
      • Kottner J.
      Interrater reliability and the kappa statistic: a comment on Morris et al. (2008).
      ,
      • Morris R.
      • MacNeela P.
      • Scott A.
      • et al.
      Ambiguities and conflicting results: the limitations of the kappa statistic in establishing the interrater reliability of the Irish nursing minimum data set for mental health: a discussion paper.
      ]. Kappa allows consideration of the possibility of guessing, but its assumptions on the independence of assessments lack scientific support, which may lead to an underestimation of real agreement [
      • McHugh M.L.
      Interrater reliability: the kappa statistic.
      ]. Percentages for overall agreement tend to overestimate agreement because chance agreement is included. Because potentially underestimating κ values was moderate and potentially overestimating percentage agreement was high, true consistency is considered to be somewhere in the middle and hence sufficient on a single choice level.
      Given that the main aim of conducting a DCE was the estimation of utility decrements in aggregate for the population, it was of greater interest to us whether inconsistency on the individual choice level would affect the consistency of utility estimates. In both samples, we found the utility estimates to be fairly similar between the two valuation time points considering different parameters. Pearson correlations of >0.89, and even more importantly ICCs >0.790, indicate high to excellent utility estimate consistency despite the moderate κ values for choice consistency. Although there is no universal guideline for the interpretation of ICCs, other researchers have postulated similar interpretations [
      • Portney L.G.
      • Watkins M.P.
      Foundations of Clinical Research: Applications to Practice.
      ,
      • Koo T.K.
      • Li M.Y.
      A guideline of selecting and reporting intraclass correlation coefficients for reliability research.
      ].
      Furthermore, we found that despite poorer choice consistency in the French survey, the consistency on the level of utility estimates was similar to that of the German survey. Mean differences of utilities between test and retest were lower in the French survey than in the German survey and went in different directions in the two countries (larger mean values for test than for retest in the French sample and larger values for retest than for test in the German sample). The seemingly divergent findings for choice consistency (higher κ values for the German survey than for the French survey) and utility estimate consistency (at least partly better values for the French survey) should be seen against the background that the former (κ) is based on individual observations, whereas the latter uses aggregated data—namely, utility estimates based on the complete valuation samples (n = 300). Shortcomings in consistency on the individual choice set level may average out at the aggregated level, which was obviously more the case for the French sample than for the German one.
      It must be acknowledged that the mean differences were relatively large, given that the utility scale has a maximum of 1 and is anchored at 0, with most of the QLU-C10D health states having utilities in this range. Nevertheless, given that the standard deviations of the mean utility differences were larger than the mean differences themselves, and because there was no significant interaction between assessment time and domain in either country, we conclude that differences between test and retest surveys were not systematic. These test-retest mean differences provide useful benchmarks for the development of minimally important differences for the QLU-C10D, which logically must be larger than the differences observed in a test-retest context. The development of minimally important differences, which are defined as the smallest score change that patients perceive as important [
      • Mokkink L.B.
      • Terwee C.B.
      • Patrick D.L.
      • et al.
      International consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes: results of the COSMIN study.
      ], may be the subject of future research when the QLU-C10D is ready for use in the target population to support the interpretation of obtained scores.
      As mentioned previously, choice consistency in a DCE is very likely to be negatively influenced by high task complexity [
      • Swait J.
      • Adamowicz W.
      The influence of task complexity on consumer choice: a latent class model of decision strategy switching.
      ,
      • Bech M.
      • Kjaer T.
      • Lauridsen J.
      Does the number of choice sets matter? Results from a web survey applying a discrete choice experiment.
      ,
      • Louviere J.J.
      What if consumer experiments impact variances as well as means? Response variability as a behavioral phenomenon.
      ]. At the same time, researchers also suggest that having to answer several choice sets may allow respondents to learn about the survey, the associated context, and their own preferences such that they become capable of making more precise and consistent decisions [
      • Hoeffler S.
      • Ariely D.
      Constructing stable preferences: a look into dimensions of experience and their impact on preference stability.
      ]. The number of choice sets in our DCE at 16 was toward the upper limit of what has been shown to be manageable so far [
      • Bech M.
      • Kjaer T.
      • Lauridsen J.
      Does the number of choice sets matter? Results from a web survey applying a discrete choice experiment.
      ]. Likewise, the number of attributes at 12 was rather high. It may be that the cognitive burden of the QLU-C10D DCE was a factor in the degree of inconsistency in the test-retest results. Although the majority of respondents in both samples rated the tasks as difficult but comparable to other surveys they know and considered them clear, this subjective appraisal does not reflect an understanding of the tasks. It is a limitation of our study that we did not check for this by, for example, including a choice set with a dominant option. This was not done within this study because choice sets were chosen randomly.
      Naturally, also the time interval chosen between test and retest may affect the reproducibility of results. It is a further limitation of our study that we cannot entirely rule out that a certain memorizing effect may have occurred in some respondents [
      • Brouwer R.
      • Dekker T.
      • Rolfe J.
      • Windle J.
      Choice certainty and consistency in repeated choice experiments.
      ]. The chosen test-retest interval of 4 to 6 weeks should, however, prevent such effects from being very strong. This assumption is also supported by the durations required for completion, which were very similar at test and retest in both samples. Obviously, respondents were not quicker in the second survey, which could be expected in the presence of memorizing effects.
      Given the scarcity of comparable studies, it is difficult to compare our results with the existing literature. Louviere et al. [
      • Louviere J.L.
      • Islam T.
      • Wasi N.
      • et al.
      Designing discrete choice experiments: do optimal designs come at a price?.
      ] highlighted the lack of consistency of results evaluations in market research applying DCEs, and the overall lack of reliability and validity of evaluations was criticized by Rakotonarivo et al. [
      • Rakotonarivo O.S.
      • Schaafsma M.
      • Hockley N.
      A systematic review of the reliability and validity of discrete choice experiments in valuing non-market environmental goods.
      ] in the context of DCEs in environmental studies. To the best of our knowledge, in the context of health, test-retest reliability of DCEs has not yet been investigated in a comparable manner. Nevertheless, more traditional methods for utility elicitation also have rarely been tested for test-retest reliability. Feeny et al. [
      • Feeny D.
      • Blanchard C.M.
      • Mahon J.L.
      • et al.
      The stability of utility scores: test-retest reliability and the interpretation of utility scores in elective total hip arthroplasty.
      ] investigated the test-retest reliability of SG-elicited utilities in patients with osteoarthritis and found ICCs varying between 0.49 and 0.62 on an individual patient level. The test-retest ICCs found by Badia et al. [
      • Badia X.
      • Monserrat S.
      • Roset M.
      • Herdman M.
      Feasibility, validity and test-retest reliability of scaling methods for health states: the visual analogue scale and the time trade-off.
      ] for TTO and visual analogue scale (VAS) methods for EQ-5D health states were clearly higher and comparable to our results (0.90 for VAS and 0.84 for TTO). Whereas Feeny et al. [
      • Feeny D.
      • Blanchard C.M.
      • Mahon J.L.
      • et al.
      The stability of utility scores: test-retest reliability and the interpretation of utility scores in elective total hip arthroplasty.
      ] chose a rather long interval of 6 months, Badia et al. [
      • Badia X.
      • Monserrat S.
      • Roset M.
      • Herdman M.
      Feasibility, validity and test-retest reliability of scaling methods for health states: the visual analogue scale and the time trade-off.
      ] performed the retest 1 to 4 weeks after the first assessment, closer to the 4 to 6 weeks in our study.
      Utility theory in health care is based on the assumption that preferences that underlie the calculation of QALYs are temporally stable [
      • Brouwer R.
      Do stated preference methods stand the test of time? A test of the stability of contingent values and models for health risks when facing an extreme event.
      ,
      • Brouwer R.
      • Logar I.
      • Oleg S.
      Choice consistency and preference stability in test-retests of discrete choice experiment and open-ended willingness to pay elicitation formats.
      ] and that choices made within utility elicitation tasks are consistent [
      • Schaafsma M.
      • Brouwer R.
      • Liekens I.
      • de Nocker L.
      Temporal stability of preferences and willingness to pay for natural areas in choice experiments.
      ]. As Scuffham et al. [
      • Scuffham P.A.
      • Whitty J.A.
      • Mitchell A.
      • Viney R.
      The use of QALY weights for QALY calculations: a review of industry submissions requesting listing on the Australian Pharmaceutical Benefits Scheme 2002–4.
      ] state, the application of QALYs in CUAs makes consideration of how they are derived and how they can be compared especially important. In the context of health utility estimation, DCEs are a relatively new approach. In the present investigation, we have contributed to the evaluation of the validity of utilities derived from the DCE designed for QLU-C10D valuations [
      • Norman R.
      • Viney R.
      • Aaronson N.K.
      • et al.
      Using a discrete choice experiment to value the QLU-C10D: feasibility and sensitivity to presentation format.
      ,
      • Norman R.
      • Kemmler G.
      • Viney R.
      • et al.
      Order of presentation of dimensions does not systematically bias utility weights from a discrete choice experiment.
      ] by investigating their test-retest reliability. In summary, our results characterize the stability over time of individual choices and the reliability of estimated utilities arising from the DCE survey that is being used internationally to provide value sets for the QLU-C10D. We conclude that the individual choices are sufficiently stable over time to support the validity of this valuation method. We have provided important evidence about the reliability of elicited utilities and a threshold above which the minimally important difference in QLU-C10D scores must lie. A detailed description and interpretation of the utility weights obtained will be given in forthcoming articles based on the full samples of about 1000 respondents from each country.

      Acknowledgments

      The project was funded by a grant from the European Organisation for Research and Treatment of Cancer (EORTC; Grant No. 002/2014). Professor King is supported by the Australian Government through Cancer Australia. The work of Rosalie Viney was supported by a grant from the NHMRC (Grant No. 1065395).

      Supplemental Materials

      References

        • Drummond M.
        • Brixner D.
        • Gold M.
        • et al.
        • Consensus Development Group
        Toward a consensus on the QALY.
        Value Health. 2009; 12: S31-S35
        • Hawthorne G.
        • Richardson J.
        • Day N.A.
        A comparison of the Assessment of Quality of Life (AQoL) with four other generic utility instruments.
        Ann Med. 2001; 33: 358-370
        • Brazier J.
        • Ratcliffe J.
        • Saloman J.
        • Tsuchiya A.
        Measuring and Valuing Health Benefits for Economic Evaluation.
        Oxford University Press, Oxford2007
        • Ryan M.
        • Netten A.
        • Skatun D.
        • Smith P.
        Using discrete choice experiments to estimate a preference-based measure of outcome—an application to social care for older people.
        J Health Econ. 2006; 25: 927-944
        • Bansback N.
        • Brazier J.
        • Tsuchiya A.
        • Anis A.
        Using a discrete choice experiment to estimate health state utility values.
        J Health Econ. 2012; 31: 306-318
        • Von Neumann J.
        • Morgenstern O.
        Theory of Games and Economic Behavior. 3 ed. Wiley, New York, NY1953
        • Torrance G.W.
        • Thomas W.H.
        • Sackett D.L.
        A utility maximization model for evaluation of health care programs.
        Health Serv Res. 1972; 7: 118-133
        • Louviere J.L.
        • Islam T.
        • Wasi N.
        • et al.
        Designing discrete choice experiments: do optimal designs come at a price?.
        J Consum Res. 2008; 35: 360-375
        • Swait J.
        • Adamowicz W.
        The influence of task complexity on consumer choice: a latent class model of decision strategy switching.
        J Consum Res. 2001; 28: 135-148
        • Skjoldborg U.S.
        • Lauridsen J.
        • Junker P.
        Reliability of the discrete choice experiment at the input and output level in patients with rheumatoid arthritis.
        Value Health. 2009; 12: 153-158
        • King M.T.
        • Costa D.S.
        • Aaronson N.K.
        • et al.
        QLU-C10D: a health state classification system for a multi-attribute utility measure based on the EORTC QLQ-C30.
        Qual Life Res. 2016; 25: 625-636
        • Norman R.
        • Viney R.
        • Aaronson N.K.
        • et al.
        Using a discrete choice experiment to value the QLU-C10D: feasibility and sensitivity to presentation format.
        Qual Life Res. 2016; 25: 637-649
        • Colbourn C.J.
        • Dinitz J.H.
        Handbook of Combinatorial Designs.
        Taylor and Francis, Boca Raton, FL2006
        • Allen M.
        Introduction to Measurement Theory. 1st ed. Waveland Press, Long Grove, IL2001
        • Pedhazur E.
        • Schmelkin L.
        Measurement, Design, and Analysis.
        Lawrence Erlbaum, Hilsdale, NJ1991
        • Cohen J.
        A coefficient of agreement for nominal scales.
        Educ Psychol Meas. 1960; 20: 37-46
        • Landis J.R.
        • Koch G.G.
        The measurement of observer agreement for categorical data.
        Biometrics. 1977; 33: 159-174
        • Kazdin A.E.
        Artifact, bias, and complexity of assessment: the ABCs of reliability.
        J Appl Behav Anal. 1977; 10: 141-150
        • Fleiss J.L.
        • Cohen J.
        • Everitt B.S.
        Large sample standard errors of kappa and weighted kappa.
        Psychol Bull. 1969; 72: 323-327
        • Cicchetti D.V.
        Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology.
        Psychol Assess. 1994; 6: 284-290
        • McHugh M.L.
        Interrater reliability: the kappa statistic.
        Biochem Med (Zagreb). 2012; 22: 276-282
        • Kottner J.
        Interrater reliability and the kappa statistic: a comment on Morris et al. (2008).
        Int J Nurs Stud. 2009; 46: 140-141
        • Morris R.
        • MacNeela P.
        • Scott A.
        • et al.
        Ambiguities and conflicting results: the limitations of the kappa statistic in establishing the interrater reliability of the Irish nursing minimum data set for mental health: a discussion paper.
        Int J Nurs Stud. 45. 2008: 645-647
        • Portney L.G.
        • Watkins M.P.
        Foundations of Clinical Research: Applications to Practice.
        Prentice Hall, Upper Saddle River, NJ2000
        • Koo T.K.
        • Li M.Y.
        A guideline of selecting and reporting intraclass correlation coefficients for reliability research.
        J Chiropr Med. 2016; 15: 155-163
        • Mokkink L.B.
        • Terwee C.B.
        • Patrick D.L.
        • et al.
        International consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes: results of the COSMIN study.
        J Clin Epidemiol. 2010; 63: 737-745
        • Bech M.
        • Kjaer T.
        • Lauridsen J.
        Does the number of choice sets matter? Results from a web survey applying a discrete choice experiment.
        Health Econ. 2011; 20 (273–27)
        • Louviere J.J.
        What if consumer experiments impact variances as well as means? Response variability as a behavioral phenomenon.
        J Consum Res. 28. 2001: 506-511
        • Hoeffler S.
        • Ariely D.
        Constructing stable preferences: a look into dimensions of experience and their impact on preference stability.
        J Consum Psychol. 1999; 8: 113-119
        • Brouwer R.
        • Dekker T.
        • Rolfe J.
        • Windle J.
        Choice certainty and consistency in repeated choice experiments.
        Environ Resour Econ. 2010; 46: 93-109
        • Rakotonarivo O.S.
        • Schaafsma M.
        • Hockley N.
        A systematic review of the reliability and validity of discrete choice experiments in valuing non-market environmental goods.
        J Environ Manage. 2016; 183: 98-109
        • Feeny D.
        • Blanchard C.M.
        • Mahon J.L.
        • et al.
        The stability of utility scores: test-retest reliability and the interpretation of utility scores in elective total hip arthroplasty.
        Qual Life Res. 2004; 13: 15-22
        • Badia X.
        • Monserrat S.
        • Roset M.
        • Herdman M.
        Feasibility, validity and test-retest reliability of scaling methods for health states: the visual analogue scale and the time trade-off.
        Qual Life Res. 1999; 8: 303-310
        • Brouwer R.
        Do stated preference methods stand the test of time? A test of the stability of contingent values and models for health risks when facing an extreme event.
        Ecol Econ. 2006; 60: 399-406
        • Brouwer R.
        • Logar I.
        • Oleg S.
        Choice consistency and preference stability in test-retests of discrete choice experiment and open-ended willingness to pay elicitation formats.
        Environ Resour Econ. 2016; 68: 729-751
        • Schaafsma M.
        • Brouwer R.
        • Liekens I.
        • de Nocker L.
        Temporal stability of preferences and willingness to pay for natural areas in choice experiments.
        Resour Energy Econ. 2014; 38: 243-260
        • Scuffham P.A.
        • Whitty J.A.
        • Mitchell A.
        • Viney R.
        The use of QALY weights for QALY calculations: a review of industry submissions requesting listing on the Australian Pharmaceutical Benefits Scheme 2002–4.
        Pharmacoeconomics. 2008; 26: 297-310
        • Norman R.
        • Kemmler G.
        • Viney R.
        • et al.
        Order of presentation of dimensions does not systematically bias utility weights from a discrete choice experiment.
        Value Health. 2016; 19: 1033-1038