The Scottish Longitudinal Study links three Scottish censuses (1991, 2001, and 2011) to a range of administrative data sources including health registers and vital events in a 5.3% sample of the Scottish population19. Through data linkage, SLS has the advantage to provide socio-demographic determinants from censuses as well as morbidity and mortality information for a representative sample of the population. For our study, the Scottish Census 2001 was linked to hospitalisation, disease registries, exits from Scotland, and mortality records allowing us to investigate the socio-demographic determinants and outcomes of specific disease trajectories in Scotland. We follow SLS participants over a 10 years period, from April 2001 to March 2011. We select participants aged 40–74 years old at the time of the 2001 Census, to focus on understanding disease trajectories from mid-adulthood which can be deemed more preventable than in older ages.
To accurately determine a sequence of diseases, we first need to identify as precisely as possible the onset of diseases. Our analysis focuses on the onset of three diseases that commonly occur in the population and can be relatively accurately identified from hospitalisation and disease registries. We identified any record of diabetes, CVD, and cancer from hospitalisation data, mental health, diabetes, and cancer registries and with a predefined list of codes from the International Classification of Disease version 10 (ICD10). The list of codes to include in order to define and identify each group of disease can vary. We used the codes E10-E14 to identify a record of diabetes and the codes C00-C97 (excluding non-melanoma skin cancer C44) to identify a record of cancer. We followed the approach of previous Scottish publications using hospitalisation records in Scotland to identify a record of CVD20,21 and used the codes I20-I25 (ischaemic heart disease), I50 (heart failure), I60-I69 (cerebrovascular diseases), I70 (atherosclerosis) and G45 (transient cerebral ischaemic attacks and related syndromes). Diagnosis records were available from 1997 onwards for diabetes and CVD and from 1980 onwards for cancer. We set any first record of each disease as a first diagnosis and as a proxy for disease onset.
Once a first diagnosis for each chronic condition is identified, the date of the first diagnosis is used to order diseases and their co-existence into a sequence of disease states. The resulting sequence is based on the principle of disease combination over time. If we have three diseases A, B, and C, this gives us eight possible states: no disease, A, B, C, AB, AC, BC, and ABC. With four diseases, we would have 16 possible states: no disease, A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, and ABCD. Too many states are difficult to visualise and interpret in sequence analysis. Therefore, we restrict our analysis to multimorbidity trajectories based on three diseases. Note that once a disease is diagnosed, it is kept as present and accumulates with the next disease diagnosed. Consequently, the number of diseases in a sequence can only increase over time. To account for people that left Scotland or died, and thus no longer at risk to develop a new disease that can be identified in Scotland, two states are added: “exit” and “death”. The final set of states included in our sequences is as follows: (1) “no disease”, (2) “diabetes”, (3) “CVD”, (4) “cancer”, (5) “diabetes, CVD”, (6) “diabetes, cancer”, (7) “CVD, cancer”, (8) “diabetes, CVD, cancer”, (9) “exit”, and (10) “death”. Months are used as the time unit for each element of the sequence. Therefore, our sequences are made of 120 consecutive states over a 10-year follow-up period. The first element of the sequence in April 2001 is based on past disease history. For example, if diabetes onset was identified in 2000 and there was no onset of CVD and cancer by the start of the follow-up period, the first state of the sequence is “diabetes”. Subsequent states are constructed by adding up any disease with a first diagnosis up to that month. Once “exit” or “death” occurred, the sequence keeps that state up to the end of the follow-up period (March 2011).
A range of sociodemographic variables were collected in the 2001 Scottish census including age, sex, marital status, household size, and socioeconomic circumstances such as educational level, and household tenure. The Scottish Index for Multiple Deprivation (SIMD), an area-based measure of socioeconomic status created in 2004 is also provided for each SLS participants based on their postcode. SIMD is categorised into quintiles. We categorise marital status into “single (never married)”, “married”, and “separated, divorced, or widowed”. The household size variable is reduced to three categories: “Household with 1 individual”, “Household with 2 individuals”, and “Household with 3 or more individuals”. Educational level is categorised into “no qualification”, “low qualification” (secondary education and first vocational qualifications) and “high qualification” (higher education, higher vocational and professional qualifications). Household tenure provides information on whether individuals lived in a household that they “own”, “private rent”, “social rent” or whether they “live rent free”.
Hospitalisation and mortality data were linked at the individual level for each SLS participant. We choose two hospitalisation outcomes as proxies for health care utilisation: the number of hospitalisations and the number of overnight stays. All-cause mortality was also used as another health outcome. To assess differences between groups, we need to consider a similar period of observation, ensuring a consistent and comparable measure of each outcome across groups. We measure all outcomes from the point of multimorbidity onset (i.e. the point of transition to two chronic diseases) for a 5-year period.
Sequence analysis is a non-parametric method commonly used in the social sciences to analyse trajectories and social processes. The method has the advantage to provide a holistic view of trajectories, describing how processes evolve over time and when transitions occur. Sequencing (the order of distinct state occurrence), duration (the length of spell in a state) and timing (when transition occurs) are key aspects of a sequence that can be of interest17. We use single channel sequence analysis with one sequence per person. Sequences show the accumulation and combination the three diseases of interest based on their diagnosis for each individual and as described in the sequence creation section. Multiple channel sequence analysis with multiple sequences per person, each sequence describing the trajectory of one disease, is also feasible. However, this approach would not consider sequencing from one disease to another at the individual level but rather concomitant trajectories of each disease.
First, we describe the sequences using descriptive statistics of the most common reduced sequences, a simplification of sequences focused on sequencing/order of states. For example, for a sequence with the following states “AAABBBCCC”, the associated reduced sequence is “ABC”. Then, we assume that individual trajectories are divided into groups forming typical trajectories. To group sequences together, we need to assess how similar they are. Optimal matching (OM) is the method most often used to assess the dissimilarity between all pairs of sequence15,17. At the OM stage, choices must be made on costs for three possible operations (substitution, insertion, and deletion) that allow two sequences to match. These costs are set by a substitution matrix (SM) (for substitution operations) and an indel value (for insertion and deletion operations). For this analysis, a SM with a constant value of 1 is chosen with the assumption that “all states are equally different” (cost of 1 for each transition from any state A to B). Alternatives include SM costs based on theory or on a data-driven approach17. In our preliminary analyses using different forms of SM including the popular data-driven approach with SM based on transition rates, the final clusters obtained were similar to those presented in our results section. A single indel-cost can be determined according to the value we attribute to the aspects of sequencing, duration and timing when assessing similarities between sequences. Since our interest lies mostly in the order of diseases (sequencing) rather than when transition occur (timing), a low indel cost would be appropriate to downplay the cost associated with time lags between two sequences. However, how fast individuals might transition from one state to the other (duration spent in a state) might also be of interest. Therefore, a series of indel values is chosen for sensitivity analyses: 0.5, 1, 1.5, and 5. The OM stage allows us to produce a dissimilarity matrix which can be used in cluster analysis to distinguish typical trajectories. At the cluster analysis stage, we follow a common approach using hierarchical cluster analysis applied to the dissimilarity matrix. Partitioning around the medoid with cluster quality measures (see Additional file 1) is used to decide on the number of clusters with the best clustering. Once clusters are identified, a chronogram (cross-sectional distribution of states at each time t) and a sequence index plot (longitudinal order of states for each individual) are presented to visualise and characterise the typical trajectories represented in each cluster.
In addition, to understand the characteristics associated with typical multimorbidity trajectories, we explore the sociodemographic profile of each trajectory. Descriptive statistics for age, sex, marital status, household size, educational level, household tenure, and SIMD are presented by trajectory cluster. Age and sex-adjusted and multivariable multinomial logistic regressions are also used to understand whether there are significant sociodemographic differences between clusters. We present odds ratios (ORs) and their 95% confidence intervals (CIs).
Finally, we wish to understand whether specific trajectories are associated with greater health care utilisation and worse outcomes. To account for different exposure time per individual, a 5-year denominator is created adjusted for any event of exits or deaths over that period (adjusted person-years). Differences in hospitalisation outcomes (number of hospitalisations and number of overnight stays) are analysed using Poisson regression with robust variance and an adjusted 5-year person-year as denominator. Risk Ratios (RRs) and their 95% CIs are presented adjusted for sex and age, and then subsequently for the five sociodemographic variables previously described. To account for other comorbidities playing a role in the likelihood of hospitalisations, analyses are further adjusted for a comorbidity count. The comorbidity count is created from 23 Elixhauser comorbidities (excluding eight Elixhauser comorbidities already covered by comorbidities at the core of our analysis i.e. diabetes, CVD, and cancer) and based on the ICD10 codes from the Quan et al. algorithm22,23. Cox regression is used to explore cluster differences in 5-year mortality risk, censoring for the end of the 5-year follow-up period or exit. Cox models are also adjusted for age and sex and subsequently for the five sociodemographic variables and the comorbidity count. Hazard Ratios (HRs) and their 95% CIs are presented. The proportional hazard assumption is checked visually using Kaplan Meier survival curves by cluster and appears satisfied.
Data preparation, sequence creation, descriptive statistics, and regression analyses were done using SAS version 9.4 (SAS Institute Inc, Cary, NC, USA). Sequence analysis, optimal matching, and cluster analysis were done using the TraMineR and WeightedCluster libraries in R version 3.4.324. Graphical representations were created using R.
Ethics, data access and disclosure
Ethical approval was obtained from the University Teaching and Research Ethics Committee at the University of St Andrews (reference GG14300). This study was also approved by the SLS Research Board (SLS project number 2018_012) and by the Public Benefit and Privacy Panel for Health and Social Care of NHS Scotland (reference 1819–0093). All analyses were performed in accordance with the relevant SLS guidelines and regulations. Data analysis was conducted in a secure environment, the SLS safe haven, at National Records of Scotland, by a named researcher (GC) with appropriate training and clearance. Analyses followed SLS guidelines to ensure the confidentiality of the data. In addition, results were prepared following the SLS statistical disclosure control protocol. Numerators and denominators are presented rounded to the nearest 10 and percentage estimated from rounded numbers. However, model estimates (odd and risk ratios) and their confidence intervals were calculated based on real numbers.
Ethical approval and consent to participate
We obtained ethical approval for this study from the University Teaching and Research Ethics Committee at the University of St Andrews (reference GG14300). Our SLS study linking the Scottish censuses to health record was approved by the SLS Research Board (SLS project number 2018_012) and the Public Benefit and Privacy Panel for Health and Social Care of NHS Scotland in October 2019 (reference 1819–0093). Individual consent was not sought. SLS linked datasets are anonymised and available in a dedicated safe haven following a strict protocol on access and disclosure control to ensure the safety and confidentiality of the data. Access is restricted to named researchers with appropriate training and clearance.