If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Address correspondence to: Eva-Maria Gamper, Department for Psychiatry and Psychotherapy, Medical University of Innsbruck, Anichstraße 35, 6020 Innsbruck, Austria.
Psycho-Oncology Cooperative Research Group, Cancer Australia Chair in Cancer Quality of Life School of Psychology, Faculty of Science, University of Sydney, Sydney, Australia
Recently, a newly developed cancer-specific multiattribute utility instrument based on the widely used health-related quality of life instrument, the European Organisation for Research and Treatment of Cancer QLQ-C30, was introduced: the QLU-C10D. For the elicitation of utility weights, a discrete choice experiment (DCE) was designed. Our aim was to investigate the DCE in terms of individual choice consistency and utility estimate consistency by applying a test-retest design.
Methods
We conducted the study in general population samples in Germany and France. The DCE was administered via a web-based self-complete survey using online panels. Respondents were presented 16 choice sets comprising 11 attributes with 4 levels each. Retest was conducted 4 to 6 weeks after first assessment. We used kappa and percentage agreement as measures of choice consistency and both intraclass correlations and mean utility differences as measures of utility estimate consistency.
Results
A total of 300 German respondents (31% female, mean age 48 years [SD 14]) and 305 French respondents (46% female, mean age 47 years [SD 16]) completed test and retest assessments. Individual choice consistency was moderate to high (Germany: κ = 0.605, percentage agreement = 80.2%; France: κ = 0.411, percentage agreement = 70.6%). Utility estimate consistency was high when considering intraclass correlations (all >0.79). Mean utility differences were 0.08 in the German sample and 0.05 in the French sample.
Conclusions
Results indicate that the designed DCE elicits stable health state preferences rather than guesses or mood-specific or condition-specific judgments. Nevertheless, the identified mean utility differences between test and retest need to be taken into account when determining minimal important differences for the QLU-C10D in future research.
]. They combine quality of life and survival time into one parameter that can then be employed as the outcome metric in cost-utility analysis (CUA). For the calculation of QALYs, condition-specific health states need to be attached with utility weights.
A common approach to achieve this is the use of multiattribute utility instruments (MAUIs) [
]. A MAUI consists of a set of questions covering various aspects of health and health-related quality of life (HRQOL), such as mobility, pain, and social life. Each question is answered on a rating scale with a finite number of mutually exclusive and exhaustive response categories. Through different combinations of levels within domains, a MAUI can depict a broad range of HRQOL profiles/health states. For example, the widely used three-level version of the EuroQol five-dimensional questionnaire (EQ-5D), which is one of the most commonly used MAUIs, consists of five questions, each with a three-level response scale, and thereby is able to describe 35 (i.e., 243) health/HRQOL states. The described health states are attached with utility weights that are derived from health state preferences elicited from respondents of a valuation sample [
]. In a DCE in a health context, respondents are most commonly asked to make choices between given (hypothetical) health profiles, each consisting of a set of so-called attributes, that is, different health/HRQOL aspects and a survival time. Thus, in contrast to the commonly used utility elicitation methods of standard gamble (SG) [
], in which respondents have to quantify the strength of their preference, in a DCE they are stating a preference between two options. Whereas the SG and TTO approaches allow direct calculation of utilities for the health states valuated by the respondent, in the case of the DCE, utilities are determined indirectly by using a statistical model based on the answers of a sample of respondents.
An important aspect of the reliability of inferences derived from DCEs is choice consistency, which, according to Louviere et al. [
], is high in cases in which the variability in choice outcomes is mainly explained by attributes and associated preference weights. Choice consistency is likely to decrease with task complexity, which, among other factors, is influenced by the number of attributes and alternatives [
]. This can have a substantial impact on policy inferences. If a task is too complex, small but important effects can be obscured by noise.
Test-retest reliability is a common measure of an instrument’s ability to provide consistent scores over time in a stable population that has rarely been applied in the context of utility elicitation. In the context of DCEs for health valuations, two different types of test-retest reliability are relevant: 1) the consistency of respondents’ choices of health states from choice sets, when exactly the same choices sets are presented at two times (in the following called “choice consistency”); 2) the consistency in the final utilities obtained from the DCE (in the following called “utility estimate consistency”). A similar approach of investigating DCE test-retest reliability has been used by Skjoldborg et al. [
The main aim of the present investigation was to determine the test-retest reliability of the DCE used for utility elicitation for the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Utility–Core 10 Dimensions (QLU-C10D) [
], a recently developed cancer-specific MAUI based on the widely used HRQOL questionnaire EORTC Quality of Life Questionnaire–Core 30 (QLQ-C30).
Our specific aims were to
1.
determine the consistency of respondents’ choice of health state within each choice set of the DCE survey at two different time points (test, retest)–(choice consistency), and
2.
determine the consistency of the estimated utilities derived from the DCE survey at the “test” and the “retest” time points (utility estimate consistency).
Methods
Context and Design of QLU-C10D Health State Valuations
The data used for these analyses were collected as part of the work of the EORTC Quality of Life Group utility project in collaboration with the international Multi-Attribute Utility Cancer (MAUCa) Consortium, which developed the QLU-C10D [
]. The valuations within the EORTC Quality of Life Group utility project were performed using a design and methodology first employed in the Australian valuations [
]. The QLU-C10D comprises 10 HRQOL dimensions (physical, role, social, emotional functioning; pain; fatigue; sleep disturbances; appetite loss; nausea; and bowel problems,) with the severity of impairment expressed using four levels (ranging from “not at all” to “very much”), analogous to the four-point Likert scale of the parent instrument, the QLQ-C30. The QLU-C10D therefore is able to describe 410 health states.
Each of the hypothetical health profiles for the DCE comprises a QLU-C10D health state and a survival time in that health state (1, 2, 5, or 10 years). Therefore, each DCE health profile comprises 11 attributes (10 HRQOL dimensions + survival time).
The experimental design underpinning the QLU-C10D DCE was designed to achieve maximal statistical efficiency using a balanced incomplete block design [
]. The final design comprised 960 choice sets, each consisting of 2 health profiles (A and B) to be compared. Each respondent was randomly assigned 16 out of the 960 choice sets. Within choice sets, we randomized the order of attributes to mitigate potential order effects, and we randomized which option was seen as option A and option B to mitigate any position bias. To keep the cognitive burden for the respondent at a manageable level, only five attributes differed between options A and B within each choice set (i.e., we imposed overlap between profiles for the remaining attributes).
Data Collection and Procedure for DCE Consistency Evaluations (Test-Retest Design)
Test-retest assessments were conducted in a German and a French sample because these are the first two European countries for which QLU-C10D utilities will be developed within the project. The DCE was administered via a Web-based self-complete survey, which was managed by an international company that specializes in online DCE surveys. The company uses online panels to recruit respondents, who are invited by email. Participation was voluntary and incentivized. Quotas for sex, age, and education were introduced according to distributions in the respective general population. Respondents’ eligibility in terms of quota completion was checked before starting the survey; if quotas were already fulfilled, the survey was discontinued for the respective respondents. For the QLU-C10D valuations in European countries, we applied the same methodology and contracted the same survey company as in the Australian valuation study [
]. The main elements of the survey were the EORTC QLQ-C30; an example choice set; the valuation task comprising 16 choice sets (randomized as described earlier); debriefing questions on the tasks’ clarity, their difficulty compared to other surveys, and the difficulty of choosing between the health states and a self-reported strategy regarding how a decision was reached; demographics; and additional self-reported health (the general health question of the Short Form 36 Health Survey (SF-36), and the EQ-5D. The survey has been described in more detail by Norman et al. [
]. The study conforms to the ISPOR code of ethics. The whole survey was translated by native speakers of the target languages who were fluent in English. The procedure included forward and backward translations as well as feedback from in-country persons.
The two aspects of consistency, choice consistency and utility estimate consistency, were evaluated using a test-retest design. In general, the assumptions made when assessing test-retest reliability are that the characteristic being measured does not change over a predefined time and that the time period between the two assessments is long enough to prevent learning, carryover effects, or recall [
]. Transferred to the DCE for QLU-C10D valuation, this means that we assumed that the way respondents value the attributes of the QLU-C10D likewise is stable; that is, the DCE survey results in similar valuations when conducted at two different time points.
Hence, respondents in both France and Germany were approached to complete exactly the same survey twice. The previously described randomization of attributes and the position of response options, was the same as in the first assessment. This means that randomization was done in the first survey, and each respondent was presented with his or her ordering again in the second survey. Respondents were invited to both surveys separately; that is, they were not informed about the retest when completing the first survey. On invitation to the second survey, respondents were informed that the survey may be very similar to the first one for methodological reasons, in order not to annoy them with repetitive questions. The appropriate test-retest interval depends on the construct being measured and how stable or dynamic it is. The interval may be from days to months, but for most applications it is suggested to be about 2 weeks [
]. We decided to recontact 4 to 6 weeks after the first assessment because this period was considered to be short enough to not expect significant change in preferences, but long enough that the typical respondent would not recall previous answers.
Statistical Methods
Descriptive statistics are presented as percentages, means, medians, and standard deviations. First, assuming that the respondents’ values underlying their health preferences are basically stable but may vary with a change of health state, we investigated whether their overall health status and quality of life as measured by the score of the general health item of the SF-36 and the score of QLQ-C30 global quality of life scale had changed between the two assessments. In case of such an observed change of health state in a significant proportion of respondents, sensitivity analyses would be required. We also investigated the proportion of changes by ≥2 response categories in these questionnaires/scales. The cutoff of ≥2 categories was chosen because it can be considered to reflect a real change of health status rather than a differing response resulting from chance.
Aim 1: Determine the consistency of respondents’ choice of health state within each choice set of the DCE survey at two different time points: (test, retest)–(choice consistency)
Test-retest reliability was assessed by considering the degree of agreement between respondents’ choices for each choice set at time 1 (test) and time 2 (retest) in the following ways. Exact agreement for a particular choice set and respondent was defined as choosing the same health profile in both test and retest surveys (either option A on both or option B on both). The overall agreement (over all respondents and choice sets), the proportion of respondents with exact agreement in all 16 choice sets, and the proportion with poor agreement (<8 of 16 [<50%] of choice sets) were then determined. We also calculated the kappa (κ) statistic, which provides an estimate of agreement that is corrected for chance [
]; that is, agreement is considered very good when κ is >0.8, good when between 0.61 and 0.8, and moderate when between 0.41 and 0.60. Overall agreement was confirmed at ≥70% agreement between item scores at test and retest according to Kazdin [
] and confidence intervals for overall agreement were corrected for repeated observations within subjects by means of generalized estimation equation models, assuming a first-order autoregressive correlation structure AR(1).
Aim 2: Determine the consistency of the estimated utilities derived from the DCE survey at the “test” and the “retest” time points (utility estimate consistency).
The DCE data at time 1 and time 2 were analyzed independently to determine utility weights at each time, following the method of Bansback et al. [
] to analyze DCE data based on the QLU-C10D. The model for the utility of option j in choice set s for respondent i is given by
where X′isj is a set of dummies related to the levels of the health state (as defined within the QLU-C10D), and TIMEisj is the survival time presented in option j. The errors εisj were assumed to be independent and identically distributed normal random variables. The analysis method used to estimate the parameters α (scalar) and β (vector) was conditional logistic regression. To account for repeated observations within subjects, generalized estimation equation models with an AR(1) correlation structure were used. Regression weights obtained were converted into utility decrements as described by Bansback et al. [
]. Briefly, utility weights for use in CUA require estimation of the willingness of the typical respondent to give up years of life to alleviate the health problem. In our empirical specification, the relative importance of each β term relative to α gives us that relative value. The latter were determined both for “test” data and for “retest” data. To check for systematic differences between test and retest utility weights, test and retest data were pooled and a model with additional interaction terms was fitted to the data. The interactions considered were assessment (test, retest) × TIME with 1 degree of freedom (d.f.) and assessment × HRQOL dimension × TIME with 30 d.f. (10 dimensions × 3 levels). The latter interaction is included because the standard approach for analyzing these data uses interactions between TIME and levels of dimensions of the instrument to allow the model to fit within the QALY framework [
]. To explore the effect of test/retest on these coefficients, one needs to interact the assessment with these two-factor interactions. Significance of the additional terms was tested by means of the likelihood ratio test.
Based on the utility decrements obtained, utilities for a random subset of 1000 of all 410 possible health states were calculated for both conditions test and retest (the same subset of health states for test and retest). In addition, utilities for the 960 health states used in the balanced incomplete block design were determined in the same way. To assess the level of agreement between test and retest utility estimates, for these 1000 health states, Pearson correlation coefficients were estimated to assess the degree of correlation between test/retest utility values; mean differences were used to assess systematic differences between test/retest utility values; and the intraclass correlations (ICCs) were used to assess the degree of variation between test-retest utilities relative to total variation in utilities (between plus within health states). ICCs were interpreted as fair between 0.40 and 0.59, as good between 0.60 and 0.74, and excellent between 0.75 and 1.00, according to Cicchetti and Domenic [
Test-retesting the survey was planned for samples of 300 respondents each from the German and the French general population. Sample size estimation was based on the length of the 95% confidence interval (CI) for κ. The planned sample size of 300 respondents for retest gives rise to a total of 4800 observations, as each respondent completed 16 choice sets. Using the formula for the standard error of κ by Fleiss et al. [
] and assuming that overall (summing over all choice sets) the probabilities of deciding for option A and B will be similar (i.e., both close to 0.5), a sample size of 4800 pairs of observations (test, retest) gives rise to 95% CIs of the form [κ − d, κ + d] with values of d not exceeding 0.03 over the entire range of possible κ values. This value may require correction for the fact that observations within respondents are correlated. For the German general population data, the within-subject correlations were low, giving rise to a correction factor of <1.5 for standard errors of parameters obtained in logistic regression analyses. When allowing for a correction factor of 2, the value for d would amount to 0.06, giving rise to sufficiently small CIs.
Results
Sample Characteristics
A total of 300 German (31% female, mean age 48 years [SD 14]) and 305 French respondents (46% female, mean age 47 years [SD 16]) completed the 16 choice sets at test and retest. Details are presented in Table 1. To identify potential selection bias, we compared the sociodemographic characteristics of the retest samples with the respective valuation sample. We also compared the respondents’ appraisal of the difficulty of the survey and the time required for completion. There were no differences except that among the German respondents who were willing to do the retest, fewer people perceived the survey as more difficult than other surveys. In general, DCE tasks were perceived as difficult but clear by German and French respondents. The majority (>90% in both samples) could name a strategy regarding how they reached their decisions (responses to debriefing questions are provided in Table 1).
Analyses of the stability of the respondents’ health state (as an indicator of potential bias through associated change of underlying preferences) demonstrated that the sample had stable health between test and retest (Germany: SF-36 general health question mean scores 2.95 vs. 2.96; QLQ-C30 global quality of life scale scores 67.1 vs. 67.6; France: SF-36 general health question mean scores 2.91 vs. 2.94; QLQ-C30 global quality of life scale scores 73.7 vs. 73.4). The proportion of changes of ≥2 response categories, which would indicate relevant changes of subjective health in a person, was less than 10% for general health question and QLQ-C30 global quality of life scores. The time required for test and retest was fairly similar in both samples, with mean durations for survey completion of 13.15 minutes (SD 7.3) in the test and 14.32 minutes (SD 7.24) in the retest situation in the German sample and 14.46 minutes (SD 9.4) in the test and 14.63 minutes (SD 7.8) in the retest situation in the French sample.
Test-Retest Reliability
Aim 1: Findings on choice consistency (test-retest reliability of individual choices).
The frequency distributions of choices in the test and retest data at the level of the individual choice sets are provided in Table 2 for the German and French samples separately. For those in the German sample who originally selected option A, 80.2% of cases selected A in the re-test. For those who initially selected B, 80.3% of cases selected B in the retest, giving an overall agreement of 80.2% (Table 3). The corresponding numbers for the French sample were 69.2% and 71.8%, respectively, for 70.6% overall agreement.
Table 2Cross tabulation of choices at first (test) and second (retest) DCE surveys in Germany and France
Country
Number of persons (N), number of choice sets (n)
Health profile preferred in the first (test) DCE survey
Health profile preferred in the second (retest) DCE survey
Each respondent completed 16 choice sets, resulting in a total of 4800 choices (16 × 300) for the German sample at each time point and 4880 (16 × 305) for the French.
Each respondent completed 16 choice sets, resulting in a total of 4800 choices (16 × 300) for the German sample at each time point and 4880 (16 × 305) for the French.
A
1667 (34.2%)
741 (15.2%)
2408 (49.3%)
B
696 (14.3%)
1776 (36.4%)
2472 (50.7%)
Total
2363 (48.4%)
2517 (51.6%)
4880 (100%)
DCE, discrete choice experiment.
Each respondent completed 16 choice sets, resulting in a total of 4800 choices (16 × 300) for the German sample at each time point and 4880 (16 × 305) for the French.
], consistency of test and retest ranged at the lower end of “good” agreement for the German sample and “fair to moderate” for the French sample. The proportion of respondents with exact agreement in all 16 choice sets was 25% in the German sample but only 10% in the French sample. The majority of German respondents had exact agreement in at least half of the 16 choice sets, with only 6% having exact agreement in fewer than 8 of 16 choice sets (no impact of age and sex). Respondents from France were less consistent: 13% had exact agreement in fewer than 8 of 16 choice sets (no impact of age and sex).
Aim 2: Utility estimate consistency (test-retest reliability of derived utilities).
Utility decrements obtained for test and retest by conditional logistic regression were broadly similar in size. As an example, utility decrements for each level of each domain of the QLU-C10D for the French sample are displayed in Table 4 (German utility decrements are shown in Appendix Table 1 in Supplemental Materials found at https://doi.org/10.1016/j.jval.2017.11.012). Testing for systematic differences between test and retest did not yield significant differences between the two assessments, neither for the German (assessment × TIME: χ2 = 0.69, d.f. = 1, P = 0.406; assessment × HRQOL dimensions × TIME: χ2 = 28.79, d.f. = 30, P = 0.529) nor for the French (assessment × TIME: χ2 = 0.61, d.f. = 1, P = 0.435; assessment × HRQOL dimensions × TIME: χ2 = 23.71, d.f. = 30, P = 0.785) sample. Regarding utility decrements for individual dimensions and levels, there were no significant differences between test and retest for any of the parameters for the French survey and only 2 of 30 for the German (physical functioning, level 2, χ2 = 4.63, P = 0.031; fatigue level 2, χ2 = 4.65, P = 0.031). After Bonferroni correction for multiple testing, none of the significances was retained.
Table 4Utility decrements for French test and retest data
Various measures of the consistency of the health state utilities derived from test and retest are displayed in Table 5 and presented graphically in Fig. 1, Fig. 2. Pearson correlations between utilities obtained from test and those obtained from retest were high, with values above 0.88 for the two random samples of health states, for both Germany and France. The corresponding ICCs were above 0.85 for the French sample, indicating good agreement. ICCs were somewhat lower but only slightly below 0.8 for the German sample (0.790 and 0.796). Regarding mean differences between health state utilities derived from test and retest, the corresponding values for the French sample were fairly low (below 0.05 for both samples of health states), whereas mean differences in the German sample of respondents were higher, approaching values of 0.08.
Table 5Measures of consistency of utility estimates
In our investigation of the test-retest reliability of the DCE designed for the elicitation of utilities for the QLU-C10D, we found that the choice consistency of individual choices within the DCE was moderate based on κ values and fairly high considering the overall agreement rates of 70.6% to 80.2%. There is no consensus in the literature regarding which of these parameters is considered more informative [
Ambiguities and conflicting results: the limitations of the kappa statistic in establishing the interrater reliability of the Irish nursing minimum data set for mental health: a discussion paper.
]. Kappa allows consideration of the possibility of guessing, but its assumptions on the independence of assessments lack scientific support, which may lead to an underestimation of real agreement [
]. Percentages for overall agreement tend to overestimate agreement because chance agreement is included. Because potentially underestimating κ values was moderate and potentially overestimating percentage agreement was high, true consistency is considered to be somewhere in the middle and hence sufficient on a single choice level.
Given that the main aim of conducting a DCE was the estimation of utility decrements in aggregate for the population, it was of greater interest to us whether inconsistency on the individual choice level would affect the consistency of utility estimates. In both samples, we found the utility estimates to be fairly similar between the two valuation time points considering different parameters. Pearson correlations of >0.89, and even more importantly ICCs >0.790, indicate high to excellent utility estimate consistency despite the moderate κ values for choice consistency. Although there is no universal guideline for the interpretation of ICCs, other researchers have postulated similar interpretations [
Furthermore, we found that despite poorer choice consistency in the French survey, the consistency on the level of utility estimates was similar to that of the German survey. Mean differences of utilities between test and retest were lower in the French survey than in the German survey and went in different directions in the two countries (larger mean values for test than for retest in the French sample and larger values for retest than for test in the German sample). The seemingly divergent findings for choice consistency (higher κ values for the German survey than for the French survey) and utility estimate consistency (at least partly better values for the French survey) should be seen against the background that the former (κ) is based on individual observations, whereas the latter uses aggregated data—namely, utility estimates based on the complete valuation samples (n = 300). Shortcomings in consistency on the individual choice set level may average out at the aggregated level, which was obviously more the case for the French sample than for the German one.
It must be acknowledged that the mean differences were relatively large, given that the utility scale has a maximum of 1 and is anchored at 0, with most of the QLU-C10D health states having utilities in this range. Nevertheless, given that the standard deviations of the mean utility differences were larger than the mean differences themselves, and because there was no significant interaction between assessment time and domain in either country, we conclude that differences between test and retest surveys were not systematic. These test-retest mean differences provide useful benchmarks for the development of minimally important differences for the QLU-C10D, which logically must be larger than the differences observed in a test-retest context. The development of minimally important differences, which are defined as the smallest score change that patients perceive as important [
International consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes: results of the COSMIN study.
]. At the same time, researchers also suggest that having to answer several choice sets may allow respondents to learn about the survey, the associated context, and their own preferences such that they become capable of making more precise and consistent decisions [
]. Likewise, the number of attributes at 12 was rather high. It may be that the cognitive burden of the QLU-C10D DCE was a factor in the degree of inconsistency in the test-retest results. Although the majority of respondents in both samples rated the tasks as difficult but comparable to other surveys they know and considered them clear, this subjective appraisal does not reflect an understanding of the tasks. It is a limitation of our study that we did not check for this by, for example, including a choice set with a dominant option. This was not done within this study because choice sets were chosen randomly.
Naturally, also the time interval chosen between test and retest may affect the reproducibility of results. It is a further limitation of our study that we cannot entirely rule out that a certain memorizing effect may have occurred in some respondents [
]. The chosen test-retest interval of 4 to 6 weeks should, however, prevent such effects from being very strong. This assumption is also supported by the durations required for completion, which were very similar at test and retest in both samples. Obviously, respondents were not quicker in the second survey, which could be expected in the presence of memorizing effects.
Given the scarcity of comparable studies, it is difficult to compare our results with the existing literature. Louviere et al. [
] highlighted the lack of consistency of results evaluations in market research applying DCEs, and the overall lack of reliability and validity of evaluations was criticized by Rakotonarivo et al. [
] in the context of DCEs in environmental studies. To the best of our knowledge, in the context of health, test-retest reliability of DCEs has not yet been investigated in a comparable manner. Nevertheless, more traditional methods for utility elicitation also have rarely been tested for test-retest reliability. Feeny et al. [
] investigated the test-retest reliability of SG-elicited utilities in patients with osteoarthritis and found ICCs varying between 0.49 and 0.62 on an individual patient level. The test-retest ICCs found by Badia et al. [
] for TTO and visual analogue scale (VAS) methods for EQ-5D health states were clearly higher and comparable to our results (0.90 for VAS and 0.84 for TTO). Whereas Feeny et al. [
Do stated preference methods stand the test of time? A test of the stability of contingent values and models for health risks when facing an extreme event.
The use of QALY weights for QALY calculations: a review of industry submissions requesting listing on the Australian Pharmaceutical Benefits Scheme 2002–4.
] state, the application of QALYs in CUAs makes consideration of how they are derived and how they can be compared especially important. In the context of health utility estimation, DCEs are a relatively new approach. In the present investigation, we have contributed to the evaluation of the validity of utilities derived from the DCE designed for QLU-C10D valuations [
] by investigating their test-retest reliability. In summary, our results characterize the stability over time of individual choices and the reliability of estimated utilities arising from the DCE survey that is being used internationally to provide value sets for the QLU-C10D. We conclude that the individual choices are sufficiently stable over time to support the validity of this valuation method. We have provided important evidence about the reliability of elicited utilities and a threshold above which the minimally important difference in QLU-C10D scores must lie. A detailed description and interpretation of the utility weights obtained will be given in forthcoming articles based on the full samples of about 1000 respondents from each country.
Acknowledgments
The project was funded by a grant from the European Organisation for Research and Treatment of Cancer (EORTC; Grant No. 002/2014). Professor King is supported by the Australian Government through Cancer Australia. The work of Rosalie Viney was supported by a grant from the NHMRC (Grant No. 1065395).
Ambiguities and conflicting results: the limitations of the kappa statistic in establishing the interrater reliability of the Irish nursing minimum data set for mental health: a discussion paper.
International consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes: results of the COSMIN study.
Do stated preference methods stand the test of time? A test of the stability of contingent values and models for health risks when facing an extreme event.
The use of QALY weights for QALY calculations: a review of industry submissions requesting listing on the Australian Pharmaceutical Benefits Scheme 2002–4.