Introduction
The “Extending the QALY” project aimed to develop a new generic measure, the EQ-HWB (EQ Health and Wellbeing), that can be used in economic evaluation across health, social care, and public health to estimate quality-adjusted life-years, based on the views of users and beneficiaries of these services including informal carers. The aim was to develop a long and short version, with the latter designed to be amenable to valuation to address the need for a single measure that would be used within and across the different beneficiaries.
1Brazier J, Peasgood T, Mukuria C, et al. The EQ-HWB: Overview of the development of a measure of health and well- being and key results. Value Health. 2022, in press.
Stage 1 of this project established the domains for the measure, which were based on aspects of health and wellbeing identified as important in qualitative research by future users of the measure.
2- Peasgood T.
- Mukuria C.
- Carlton J.
- et al.
What is the best approach to adopt for identifying the domains for a new measure of health, social care and carer-related quality of life to measure quality-adjusted life years? Application to the development of the EQ-HWB?.
A qualitative literature review was undertaken to identify qualitative reviews on the impact on health and wellbeing of health conditions, being an informal carer or being a social care user, and primary qualitative work used in measure development.
3Mukuria C, et al. A targeted review of qualitative evidence on domains of quality of life important for patients, social care users and informal carers to inform the development of the EQ health and wellbeing (EQ-HWB).
This resulted in 32 subdomains organized into 7 high level domains (activity, autonomy, cognition, feelings and emotions, relationships, physical sensations, and self-identity).
Stage 2 generated a list of candidate items for each subdomain and stage 3 explored the content and face validity of these items (n = 97) using a standardized interview protocol across 6 countries (Argentina, Australia, China, England, Germany, and the United States) with individuals with various physical and mental health conditions, carers, and social care users.
4Carlton J, PT, Mukuria C, Connell J, et al. Generation, selection and face validation of items for a new generic measure of quality of life, the EQ health and wellbeing (EQ-HWB). Value Health. 2022 (in press). doi 10.1016/j.jval.2021.12.007
This explored potential users’ interpretation and views of the items. During stage 2, criteria for item selection were developed that reflected the aims of the project, and these were taken into consideration at every stage to support item selection (these criteria are discussed in a separate article
5- Peasgood T.
- Mukuria C.
- Carlton J.
- Connell J.
- Brazier J.
Criteria for item selection for a preference-based measure for use in economic evaluation.
). These criteria aimed to identify items which would work well for both measurement and valuation (such as brief and unambiguous items). Findings from the interviews were used to identify items (n = 64) to take forward to the psychometric assessment.
Psychometric methods are widely used in the development of outcome measures and are an essential step in generating a valid and reliable questionnaire.
6Quality of Life: The Assessment, Analysis and Interpretation of Patient-Reported Outcomes.
Current best practice recommends a combination of classical test theory approaches, confirmatory factor analysis (CFA), and analysis based on modern measurement theory such as item response theory (IRT) or Rasch analysis.
7- Cappelleri J.C.
- Jason Lundy J.
- Hays R.D.
Overview of classical test theory and item response theory for the quantitative assessment of items in developing patient-reported outcomes measures.
Stage 2 provided qualitative evidence on the face validity of the proposed items whereas the stage 4 assessment aimed to provide quantitative evidence of validity in a larger sample. This article sets out the methods, results, and discussion of stage 4: classical psychometric analyses, factor analyses, and IRT analyses.
Methods
Future users of the new measure include patients, social care users, and informal carers. Therefore, a survey was undertaken targeting these groups and healthy individuals in 6 countries (Argentina, Australia, China, Germany, the United Kingdom, and the United States). Informal carers were defined as those who looked after friends or family because they were sick, disabled, or elderly. The study was initiated in the United Kingdom, and then other researchers who were members of the EuroQol Group were invited to apply to replicate aspects of the study including face validity (stage 2) and this stage. Applications were assessed by the funder with focus on a mix of countries with consideration of different languages and potential cultural differences.
Sample
All participants were aged 18 years or older and able to complete a questionnaire in the main language of their country. In the United Kingdom, patients were recruited from online panels (target n = 1200) and from National Health Service (NHS) Trusts and primary care (target n = 800) with the latter invited by health or other professionals in person or by post with a self-complete article questionnaire. The invite included an option to request interviewer assistance (to encourage frail adults to take part) or to complete the survey online. For the online UK panel, patients with cancer, depression or anxiety, asthma or chronic obstructive pulmonary disease, diabetes, arthritis, heart conditions, or irritable bowel syndrome/Crohn’s disease were targeted (n ≥ 100 in the online panel). Conditions were selected to represent both long-term physical and mental health conditions including those with the highest prevalence.
8Qual outcomes framework (QOF): 2015-2016. NHS Digital.
Social care users and informal carers were recruited via both the NHS and online.
Participants in the other countries were recruited from online panels only, targeting different groups that were dependent on the context: Argentina (cancer, mental health, diabetes, carers), Australia (mental health, people experiencing pain, carers, people who use care aids), China (depression, generalized anxiety disorder, chronic hepatitis B, human immunodeficiency virus/acquired immunodeficiency syndrome, and carers), Germany (cancer, carers), and the United States (cancer). A healthy general population (defined as a visual analog score on health of >80) was also recruited online in all countries.
For both factor and IRT analysis, a sample size of 500 is considered adequate.
9A First Course in Factor Analysis.
To ensure an adequate sample was achieved, the overall target sample size in the United Kingdom was 2000 participants across different clusters, age groups, and diagnoses to allow for subgroup analysis where necessary whereas for the other countries the target was 500 with fewer subgroups.
Data Collection
The selection of items and additional questionnaires for inclusion in the survey questionnaire balanced respondent burden with data requirements for the analysis. A survey was designed that included the EQ-HWB candidate items (n = 64). Domain and subdomain analysis to explore the performance of items is more robust if there are at least 4 items per subdomain
10- Netemeyer R.G.
- Bearden W.O.
- Sharma S.
Scaling Procedures: Issues and Applications.
so 2 additional items were included to facilitate the IRT analysis: “I felt cheerful,” which was taken from the Warwick-Edinburgh Mental Wellbeing Scale
11- Tennant R.
- Hiller L.
- Fishwick R.
- et al.
The Warwick-Edinburgh mental well-being scale (WEMWBS): development and UK validation.
and included in the happiness/sadness domain, and “How difficult is it for you to wash, toilet, dress yourself, eat or care for your appearance?” (with different response options to those tested in face validity), which was taken from the AQol-8D
12- Richardson J.
- Iezzi A.
- Khan M.A.
- Maxwell A.
Validity and reliability of the Assessment of Quality of Life (AQoL)-8D multi-attribute utility instrument.
and included in the personal care domain. Positive and negative items were retained from stage 3 despite the acknowledgment that combining both positively and negatively worded items would be challenging for preference elicitation exercises.
Response options for most of the items used frequency terms (not at all, only occasionally, some of the time, often, most, or all of the time [n = 55]) with some items using severity terms (mild, slight, moderate, severe, or very severe [n = 2] or not at all, a little bit, somewhat, quite a bit, or very much [n = 1]) and difficulty terms (no difficulty, slight, some, a lot of, unable [n = 7], or a phrase to describe difficulty [n = 1]). Countries that needed translation (Argentina, China, and Germany) used the validated translation from the face validity studies (stage 3). Items that were added or modified from the face validity study underwent back translation into English alongside modification by an independent translation company. International teams proofread and approved the final versions.
Background information and other health and wellbeing questionnaires were also included to support analysis of domain structure. The additional questionnaires were the EQ-5D 3-level version (EQ-5D-3L)
13EuroQol: the current state of play.
and EQ-5D 5-level version (EQ-5D-5L),
14- Herdman M.
- Gudex C.
- Lloyd A.
- et al.
Development and preliminary testing of the new five-level version of EQ-5D (EQ-5D-5L).
the short Warwick-Edinburgh Mental Wellbeing Scale (SWEMWBS),
15- Stewart-Brown S.
- Tennant A.
- Tennant R.
- Platt S.
- Parkinson J.
- Weich S.
Internal construct validity of the Warwick-Edinburgh mental well-being scale (WEMWBS): a Rasch analysis using data from the Scottish health education population survey.
and Adult Social Care Outcomes Toolkit [ASCOT]
16- Netten A.
- Burge P.
- Malley J.
- et al.
Outcomes of social care for adults: developing a preference-weighted measure.
). EQ-5D-3L and EQ-5D-5L are generic health measures with 5 dimensions and 3 or 5 levels of severity, respectively. The SWEMWBS is a 7-item measure covering positive mental wellbeing. ASCOT, a measure of social care related quality of life with 9 items, was only included in the United Kingdom, the United States, and Australia because translated versions were not available. Items from the SWEMWBS and EQ-5D were used to support estimation of latent constructs for the IRT and CFA. Average scores for the measures were used to describe the samples; ASCOT was scored using UK public preferences,
16- Netten A.
- Burge P.
- Malley J.
- et al.
Outcomes of social care for adults: developing a preference-weighted measure.
SWEMWBS was scored by summing across items,
15- Stewart-Brown S.
- Tennant A.
- Tennant R.
- Platt S.
- Parkinson J.
- Weich S.
Internal construct validity of the Warwick-Edinburgh mental well-being scale (WEMWBS): a Rasch analysis using data from the Scottish health education population survey.
and the EQ-5D measures were scored with the relevant country tariffs where available.
Three versions of the survey were created with a different order of the EQ-HWB candidate items to minimize learning and order effects and further randomized as to whether EQ-5D-3L or EQ-5D-5L appeared last (to support separate analysis on a comparison between these 2 instruments) making 6 different versions. Positively (n = 19) and negatively (n = 36) worded frequency items were grouped together to minimize the number of reversals of the meaning of the response options (eg, whether “often” represents higher or lower wellbeing). Items using “difficulty” response options were also grouped together. All other questionnaires were presented after the EQ-HWB candidate items. The same versions were administered across all countries, with approved EuroQol translations for EQ-5D to German, Argentinian Spanish, and simplified Chinese and relevant translations for the SWEMWBS.
17Validation of the German Warwick–Edinburgh mental well-being scale (WEMWBS) in a community-based sample of adults in Austria: a bi-factor modelling approach.
, 18- Dong A.
- Chen X.
- Zhu L.
- et al.
Translation and validation of a Chinese version of the Warwick–Edinburgh mental well-being scale with undergraduate nursing trainees.
, 19Translation, Spanish adaptation and validation of the Warwick-Edinburgh Well-being Scale in a sample of Argentine older adults.
A single company (Accent) managed the data collection.
All participants provided an informed consent. Participants recruited in the United Kingdom via NHS Trusts, primary care organizations, and other organizations were given a £5 voucher, which was sent on receipt of the questionnaire. Online participants were rewarded based on their specific panel agreements, which was mainly points. Ethical approval was obtained for all the studies.
Analysis
The main aim of the data analysis was to assess item performance from a psychometric perspective to support selection of items for a long measure and a shorter measure, which would be suitable for valuation. The analysis also sought to confirm the domain structure. Classical psychometric analysis was undertaken exploring responses (inconsistencies, missing data, distribution) and sensitivity to known group differences. Factor analysis and IRT were used to assess dimensionality and performance of items. All items were recoded such that a higher score reflected poorer health or wellbeing. An analysis protocol was developed and used to support consistent analysis across the countries with modification based on sample size and groups that were included. A summary of the analysis methods is presented in
Table 1, 21- Reise S.P.
- Bonifay W.E.
- Haviland M.G.
Scoring and modeling psychological measures in the presence of multidimensionality.
, 22- Böhnke J.R.
- Croudace T.J.
Calibrating well-being, quality of life and common mental disorder items: psychometric epidemiology in public mental health research.
, 23Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives.
, 24Likelihood-based item-fit indices for dichotomous item response theory models.
, 25- Rose M.
- Bjorner J.B.
- Becker J.
- Fries J.F.
- Ware J.E.
Evaluation of a preliminary physical function item bank supported the expected advantages of the Patient-Reported Outcomes Measurement Information System (PROMIS).
, 26The basics of item response theory. Second edition. ERIC.
; further detail is available in the
Supplemental Technical Appendix in Supplemental Materials found at
https://doi.org/10.1016/j.jval.2021.11.1361.
Table 1Analysis methods.
CCC indicates category characteristic curve; CFA, confirmatory factor analysis; CFI, comparative fit index; DIF, differential item functioning; EFA, Exploratory factor analysis; EQ-5D-3L, EQ-5D 3 level version; EQ-5D-5L, EQ-5D 5 level version; EQ-VAS, EQ-visual analog scale; IRT indicates item response theory; RMSEA, root mean square error of approximation; SWEMWBS, short Warwick-Edinburgh Mental Wellbeing Scale; TLI, Tucker-Lewis index; UK, United Kingdom; USA, United States of America.
To support the consultation process that was used to inform item selection for the 2 EQ-HWB measures,
1Brazier J, Peasgood T, Mukuria C, et al. The EQ-HWB: Overview of the development of a measure of health and well- being and key results. Value Health. 2022, in press.
each country team summarized the all the psychometric evidence and face validity evidence
27Carlton J, et al. Patient and Public Involvement and Engagement (PPIE) Within the Development of the EQ Health and Wellbeing (EQ-HWB). Value Health. 2022. In press.
using a 4 to 1 scale (performs very well, fairly well, mixed evidence, and performs poorly) for each item. The project criteria for item selection
5- Peasgood T.
- Mukuria C.
- Carlton J.
- Connell J.
- Brazier J.
Criteria for item selection for a preference-based measure for use in economic evaluation.
aided item prioritization. This included judgments on whether there was evidence that items were unacceptable (eg, ambiguous, offensive), were interpreted and answered differently for different people (eg, experiencing differential item functioning [DIF]), or were too mild or extreme to be appropriate for inclusion in a generic measure.
Discussion
Overview
The results from testing the candidate items indicated that most items performed well across the patient groups. Missing data did not discriminate between items, in part because most data were collected via an online platform where skipping items was not possible. Most items achieved a good spread across the response choices. Skewed distributions were present for items in the domains of mobility, self-care, hearing, seeing, and safety reflecting expectations that most respondents would not have problems in these domains and only very few respondents would have severe problems. Pain and discomfort measured by severity also had very few respondents using the poorest response option; nevertheless, identifying patients with very severe pain can be important when evaluating interventions.
Most items were able to distinguish between respondents with physical and mental health conditions and by severity of condition where this was tested; hence, this provides little basis for discrimination. The known group validity evidence was mixed for carers with mostly small or insignificant effect sizes for carers versus noncarers and across high- versus low-hour carers. Although we may have expected to see high-hour carers scoring lower on some items (such as relationships and activities) without more detail on the type of caring, it is hard to draw conclusions from this. Furthermore, the matching across carers based on age, gender, and long-term condition may not have adequately captured other related characteristics.
The conceptual model was generally confirmed although there was evidence of high correlation between the final factors. The data were best modeled as a bifactor model with positive and negative measurement factors along with the construct/domain factors. Although we use the terms “negatively and positively worded measurement factors,” we acknowledge that we do not have a clear understanding of what is behind these latent constructs. The CFA identified a well-fitting model with 15 domains for UK, Australian, and US data. A further 4 domains (seeing, hearing, discomfort, and sleep problems) from the original conceptual model were not included for testing in this CFA but had been identified as independent in earlier (secondary) data analysis (see Technical
Appendix in Supplemental Materials found at
https://doi.org/10.1016/j.jval.2021.11.1361). Achieving well-fitting model for other countries involved removing energy and merging of domains leaving 13 separate domains for Germany, 12 for Argentina, and 10 for China. The items designed to capture meaningful and valuable activity, problems with daily activities, and feelings of control and autonomy did not clearly identify these different constructs.
The CFA model relied upon controlling for 2 measurement factors; nevertheless, there is no well-established method for conducting IRT on multidimensional models. Given that most (≥69% for all domains) of the variance in items in the UK data was explained by domain factors rather than the measurement factors, it was reasonable to conduct the IRT on separate domains.
Strengths and Limitations
The psychometric analysis relied on large mixed samples in different countries with different languages and cultural values. A mix of patients with physical and mental health conditions and social care users and carers were targeted in the different countries, which enabled assessment of the questions in future users. Online recruitment enabled certain patient groups to be recruited in a timely and cost-effective way that was standardized across the countries. Accepted methods for assessing the psychometric performance of items were applied to inform the selection of items for the measure. Focusing on specific analysis (eg, excluding factor analysis for domains such as hearing and seeing) and using available data (eg, using items from other questionnaires to allow factor analysis to be undertaken) ensured that there was a balance between research requirements and respondent burden.
Nevertheless, there were some limitations. There were few respondents in the lowest levels of pain/discomfort that may be important in assessments of interventions but those in very severe pain may be harder to recruit. Although social care users and carers were included, there was some ambiguity with regard to impact on health and wellbeing and no clear markers to test sensitivity. Receiving social care has an ambiguous interpretation on quality of life: it may indicate a greater need for social care or the effectiveness of receiving services; therefore, known group assessment was not undertaken for this group. For the caring role, caring for a friend or relative may suggest a person has higher wellbeing (such as close relationships
28The positive aspects of caregiving for cancer patients: a critical review of the literature and directions for future research.
); alternatively, the caring role may reduce health and wellbeing. The caring burden was proxied here through hours of caring, which may be a weak indicator.
Data were drawn from 6 countries but these do not cover the same samples, which limits our ability to make between country comparisons. Online recruitment meant assessment of missing data could not be fully undertaken although this was partly mitigated by the inclusion of a article-based survey in the United Kingdom. In this study, online respondents were younger than those who completed the measure by article. Practicalities of the article version distribution resulted in a limited use of randomization of the order of items, which left a risk of order effects.
The known group analysis was mainly patient versus healthy population comparison. This is only a crude test and may not discriminate between respondents (indeed we find high effect sizes for almost all items). Although there were comparisons of severity, these were based on relevant EQ-5D dimension levels for arthritis and mental health and there were no clinical severity indicators. The aim for the final measure is to be sensitive to changes in health and wellbeing from treatment or services provided to meet individual needs. The known group difference assessment did not include indicators that could assess this type of sensitivity.
There was some evidence of “negative” and “positive” measurement factors within the factor analysis suggesting that some respondents answer negatively and positively framed questions differently.
Although the psychometric analysis here suggests the potential to merge subdomains, particularly for China, Germany, and Argentina, where the best fitting model arose when subdomains were merged, this merging is data driven and could arise from a common co-occurrence of problems across conceptually distinct domains. This is a particular concern in the non-UK samples, which had a more limited diversity of patient groups. The extent to which conceptually separate domains could be treated as merged for future patient groups or users of the measure is not clear.
The initial conceptual model
3Mukuria C, et al. A targeted review of qualitative evidence on domains of quality of life important for patients, social care users and informal carers to inform the development of the EQ health and wellbeing (EQ-HWB).
was developed from qualitative evidence from Western context. This may explain why the model achieves a poorer fit within the data from China and points to the difficulty in developing a generic measure valid in all countries and cultures.
The DIF analysis relied upon the significance of DIF and did not explore the magnitude of DIF; hence, it is unclear how problematic the identified DIF is. Additionally, some DIF analysis comparing subgroups with smaller sample sizes may have lacked power.
29Testing differential item functioning in small samples.
The focus on this analysis was on providing evidence to support item selection that would complement qualitative face validity work. Further psychometric analysis on the final EQ-HWB instrument could address many other interesting questions.
Article and Author Information
Author Contributions: Concept and design: Peasgood, Mukuria, Brazier, Pickard, Engel
Acquisition of data: Peasgood, Mukuria, Marten, Kreimeier, Luo, Mulhern, Greiner, Pickard, Augustovski, Engel, Yang, Monteiro, Kuharic, Belizan
Analysis and interpretation of data: Peasgood, Mukuria, Brazier, Marten, Kreimeier, Mulhern, Greiner, Pickard, Augustovski, Engel, Gibbons, Yang, Monteiro, Kuharic, Belizan, Bjørner
Drafting of the manuscript: Peasgood, Mukuria, Brazier, Engel, Mulhern
Critical revision of the paper for important intellectual content: Peasgood, Mukuria, Brazier, Marten, Kreimeier, Luo, Mulhern, Greiner, Pickard, Augustovski, Engel, Gibbons, Yang, Monteiro, Kuharic, Belizan, Bjørner
Statistical analysis: Peasgood, Mukuria, Mulhern, Augustovski, Engel, Gibbons, Monteiro, Kuharic, Bjørner
Provision of study materials or patients: Peasgood, Marten, Kreimeier, Greiner, Augustovski
Obtaining funding: Peagood, Mukuria, Brazier, Marten, Kreimeier, Luo, Mulhern, Greiner, Pickard, Engel, Yang
Administrative, technical, or logistic support: Peasgood, Marten, Kreimeier, Greiner, Engel, Monteiro, Kuharic
Supervision: Pickard
Conflict of Interest Disclosures: Drs Peasgood, Brazier, Mulhern, Engel, and Yang and Ms Belizan reported receiving grants from the Medical Research Council and the EuroQol Research Foundation during the conduct of this study. Dr Mukuria reported receiving grants from the EuroQol Research Foundation during the conduct of this study and outside the submitted work and reported being a member of the EuroQol Research Association. Dr Brazier, Marten, Kreimeier, Mulhern, Greiner, Engel, and Yang reported being members of the EuroQol Group. Dr Brazier reported being a past member of the EuroQol Group Executive, reported receiving grants and personal fees from the EuroQol Research Foundation outside the submitted work, and reported receiving royalties paid to the University of Sheffield for the use of the SF-6D preference-based measure of health outside the submitted work. Drs Marten, Kreimeier, and Greiner reported receiving grants and nonfinancial support from the EuroQol Research Foundation during the conduct of this study. Dr Luo reported receiving grants and personal fees from EuroQol Research Foundation during the conduct of the study and outside the submitted work. Drs Luo and Mulhern are editors for Value in Health and had no role in the peer-review process of this article. Drs Pickard, Augustovski, and Gibbons and Mses Monteiro and Kuharic reported receiving grants from the EuroQol Research Foundation during the conduct of the study. Ms Kuharic reported receiving a fellowship from Takeda Pharmaceuticals USA outside the submitted work. No other disclosures were reported.
Funding/Support: This work was supported by grant 170620 from the UK Medical Research Council and grants 20180460, 20180600, 20190260, 20180450, 20180580, and 20180520 from the EuroQol Research Foundation.
Role of the Funder/Sponsor: The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.