A systematic review of using and reporting survival analyses in acute lymphoblastic leukemia literature

Backgrounds Survival analysis is commonly used to determine the treatment effect among acute lymphoblastic leukemia (ALL) patients who undergo allogeneic stem cell transplantation (allo-SCT) or other treatments. The aim of this study was to evaluate the use and reporting of survival analyses in these articles. Methods We performed a systematic review by searching the MEDLINE, EMBASE and Cochrane library databases from inception to April 2015. Clinical trials of patients with ALL comparing allo-SCT compared to another treatment were included. We included only studies that used survival analysis as a part of the statistical methods. Results There were 14 studies included in the review. Sample size estimation was described in 4 (29 %) studies. Only 4 (29 %) studies reported the list of covariates assessed in the Cox regression and 6 (43 %) studies provided a description of censorship. All studies reported survival curves using the Kaplan-Meier method. The comparisons between groups were investigated using the log-rank test and Wilcoxon test. Crossing survival curves were observed in 11(79 %) studies. The Cox regression model was incorporated in 10 (71 %) studies. None of the studies assessed the Cox proportional hazards assumption or goodness-of-fit. Conclusions The use and reporting of survival analysis in adult ALL patients undergoing allo-SCT have significant limitations. Notably, the finding of crossing survival curves was common and none of the studies assessed for the proportional hazards assumption. We encourage authors, reviewers and editors to improve the quality of the use and reporting of survival analysis in the hematology literature.


Background
Survival analysis measures the time from a defined starting point to the occurrence of an interested event where the risk changes over time. The goals of survival analysis serve three purposes: (1) to estimate survival and hazard functions from survival data, (2) to compare survival and hazard functions between groups and (3) to assess the relationship between predictor variables and survival time. The essential components for survival analysis include the time to event and the binary event outcome (success or failure).
The probability of survival can be represented generating a Kaplan-Meier (KM) curve from survival data. Indeed, the KM plot is based on the estimate of the conditional probability of the time to failure [1] calculated at each time point recording an event. The difference in survival between two or more groups (or the treatment effect if treatment is what defines the two groups) can be commonly compared using the log-rank test [2].
The Cox proportional hazard (PH) model is a widely used regression method for survival data. The Cox PH model estimates the effect of predictor variables using the hazard function which does not require specifying a baseline hazard rate [3]. The measure of the effect, unadjusted or adjusted for covariates, is demonstrated as a hazard ratio (HR) which is expressed as an exponent of a regression coefficient in the model. An important property of the Cox PH model is that the PH assumption requires the hazard ratio to be constant over time [4]. Therefore, the Cox PH model is considered to be a semi-parametric model. Other regression models that can be used for survival analysis include an extended Cox PH model or parametric survival model (Weibull, exponential, log-logistic, lognormal, etc.) [5].
A recent systematic review demonstrated that survival analysis was incorporated in only 29 % of internal medicine articles [6]. However, there has been an increasing trend to using survival analysis in all categories of medical journals [6].
Allogeneic stem cell transplantation (allo-SCT) is the most potent post-remission therapy in adult acute lymphoblastic leukemia (ALL). The benefit of allo-SCT in adult ALL remains controversial [7]. Survival analysis is generally used to determine the treatment effect among ALL patients who undergo allo-SCT or other treatments, both in terms of prolongation and increased likelihood of survival. Allo-SCT is associated with high treatment-related mortality. Patients who tolerate the treatment are more likely to have a prolong event free survival and overall survival. On the other hand, non allo-SCT is less intensive treatment but may be associated with lower long-term event free survival. ALL literature were chosen because we expected that the use and report of survival analysis in such articles are complicated. To investigate whether the heterogeneity in study results is at least in part explained by a more or less appropriate use of time to event analysis, we conducted a systematic review of clinical trials which investigated the efficacy and safety of allo-SCT in adult patients with acute ALL. The aim of this study was to evaluate the use and reporting of survival analyses in these articles.

Data sources
We performed a systematic review by searching in the MEDLINE, EMBASE and The Cochrane library (The Cochrane Register of Controlled Trials and Cochrane Database of Systematic Reviews) databases. The reference lists were searched from the retrieved articles. The search terms were: Bone Marrow Transplantation OR Hematopoietic Stem Cell Transplantation OR Peripheral Blood Stem Cell Transplantation AND nonmyeloblat* OR non-myeloblat* OR Precursor Cell Lymphoblastic Leukemia-Lymphoma OR lymphoblast* OR lymphoid. AND (random* OR RCT OR control* OR trial). The database search was performed from inception to April 2015 with no language restrictions.

Selection criteria
The studies were included if they met the following criteria; were a clinical trial, controlled clinical trial or randomized control trial with allo-SCT compared to autologous SCT or non-transplantation therapy in patients with ALL in first complete remission. We only included studies that used survival analysis as one of the statistical methods.

Study selection and data extraction
Two investigators (CC and CH) independently identified articles using predefined inclusion criteria. Disagreements were resolved by consensus. Two investigators (CC and CH) independently extracted the data using a standardized data extraction from. Disagreements were again resolved by consensus.
We collected the following data: study design, outcome of interest (death, relapse), number of patients and number of events, survival curves estimate, regression method to estimate the hazard rate (Cox PH model or parametric survival model), methods for comparing the survival curves, the shape of the survival curves, variable selection, model building strategy, censoring description, length of follow-up, sample size calculation, test of interaction between variables, test for time dependent covariates, test for proportionality assumption and test for goodness-of-fit.

Analytic criteria
To evaluate the quality of reporting survival analyses, we used the following list of criteria for the proper use and description of the survival analyses.

Study characteristics
A total of 881 citations were identified by the systematic search strategy. Of these, 325 studies were duplicates. After screening of the titles and abstracts using predefined inclusion criteria, 541 studies were excluded. The reasons for exclusion are summarised in Fig. 1. Of these, we identified 15 potential studies for full-text review. Two studies were identified following manual review of the references. We excluded three studies due to no clinical trials comparing allo-SCT with other treatments. Thus, 14 studies [9][10][11][12][13][14][15][16][17][18][19][20][21][22] were included in our systematic review.
The study characteristics are summarised in Table 1. All of the studies were clinical trials. Patients were randomized to receive either allo-SCT or other treatments (autologous SCT or consolidation chemotherapy). Patients were allocated to undergo allo-SCT if the patient had a human leukocyte antigen (HLA) matched sibling donor, otherwise, the patient received autologous SCT or consolidation chemotherapy according to the study protocols. The median follow-up ranged from 59 to 114 months. The time-to-event outcomes in the included studies were overall survival and disease-free survival.

Sample size
Sample size estimation was described in 4 of 14 studies ( Table 2). The proportion of events per total patients ranged from 36 to 78 %. With respect to the sample size and number of covariates assessed in the regression analysis, only four studies reported the list of covariate assessed in the Cox regression model. Of these, two studies obtained more than ten events-per-covariate (event-per-covariate 20.4 and 23.2, respectively) [11,13]. However, the other two studies had an event-per-covariate 8.3 [20] and 3.8 [22].

Censoring description
There were six studies that provided the censoring description.

Survival curves
All studies reported survival curves using the KM method. The comparisons between the groups were investigated using a log-rank test in all studies (two studies used both log-rank test and Wilcoxon test). With regards to the shape of the survival curves, 11 studies reported crossing survival curves [9][10][11][12][13][14][15][17][18][19][20] whereas one study reported unevenly separate survival curves [21] and one study reported evenly separated survival curves [22]. The overlapping survival curves were observed in five studies [9,12,14,17,20]. We were not able to compare survival curves in one study where the graphs were plotted in the separately [16].

Statistical significance
All of the studies reported the statistical test used to measure the difference between survival curves. Of these, five studies reported statistical significance for the treatment effect between groups. However, eight studies reported non-statistical significance (one study did not report).
One study mentioned the test for interaction and competing risk analysis [11]. None of the studies described variable selection. Only one study mentioned the strategy used for model building [18].

Check for the PH assumption
In studies that used Cox PH model, PH assumption checking was not mentioned in any of the studies that used Cox PH model.

Model checking
The summary measures of the regression diagnostic and goodness-of-fit were not mentioned in any of the studies.

Discussion
Our study demonstrates that survival analyses have been used extensively in the landmark trials evaluating all-SCT in adult patients with acute ALL. However, the majority of the trials poorly reported their statistical methods and results. Sample size estimation and censoring description were not routinely described. Almost all the presented survival curves crossed. Moreover, the Cox assumption was not assessed even if the investigators used the Cox PH model. In addition, goodness-of-fit or regression residual analysis were lacking in all of the trials. Regarding the sample size estimation, according to Consolidated Standards of Reporting Trials (CONSORT), it is important that the authors indicate how sample size was determined [23]. The intent of the sample size estimation is to ensure that a particular study has sufficient statistical power to detect a difference in the treatment effect between groups. Our review demonstrates that only 4 of 14 (29 %) trials described a sample size estimation. With respect to the regression analysis, only four studies provided a full list of covariates. Of these, only two studies  appeared to be sufficiently powered (event-to-covariate ratio more than 10). In survival analysis, patients who do not experience the relevant outcome over the study period, patients who are lost to follow-up during the study period and patients who withdraw from the study are censored. There are three assumptions regarding censorship in survival analysis: independent, random and noninformative [4]. Thus, the description of censorship is an important aspect to report in publication. However, only 6 of 14 (43 %) trials described their censoring. More importantly, if relapse is the outcome of interest in these studies, patients who die from any cause will be censored. In this circumstance, censoring may be considered informative because patients may die from disease progression or treatment-related causes. Consequently, the results may change based on different censoring descriptions. Providing a definition of censorship is a critical component to reporting these trials in the literature.
All of the studies utilized survival curves. Not surprisingly, crossing survival curves were found in 10 of 14 studies. Allo-SCT is considered the most potent post remission therapy in adult ALL [24]. In long-term follow-up studies, the patients who underwent allo-SCT had a lower relapse rate due to a graft-versusleukemia effect [11]. However, these patients had a higher early mortality rate from the toxicity of myeloablative chemotherapy when compared with patients who received autologous SCT or consolidation chemotherapy [11,13]. Therefore, survival curves comparing these two treatments may be expected to cross at some point. Early death from treatmentrelated complications (commonly found in allo-SCT) and late death from relapsed disease (commonly found in autologous SCT) should be taken into the account in the treatment of ALL. Crossing survival curves make the interpretation of the treatment effects from the interventions much more complicated.
The log-rank test is the most common method used to compare the difference between survival curves based on the chi-square test [25]. It is important to note that the log-rank test may be invalid if the survival curves cross because of an increase of the probability of type II error. Moreover, the log-rank test may lose power in the circumstance of crossing survival curves [26]. Our study reveals that, among ten analyses with crossed survival curves, eight were non-statistically significant and two were statistically significant. We found that five studies had overlapping survival curves that might be explainable for insignificant findings of the interventions. It was difficult to make a conclusion on the rest of the studies based on the log-rank test of crossing over survival curves.
Strategies have been proposed to overcome the limitation of the log-rank test when the survival curves cross. The authors may consider analysing the survival curves at a fixed point in time [27]. Another alternative includes using a weighted log-rank (Harrington-Fleming) test which gives more weight to the later events [28]. Other weighted log-rank tests that may be useful are the methods developed by Gill et al. or Pepe and Fleming [29,30]. Li et al. recently published a simulation study which investigated several statistical methods in the situation of crossing survival curves. This study showed that adaptive Neyman's smooth tests and the two-stage procedure provided greater stability and higher power as compared to the other methods [29].
Relapse disease and death are the most common outcomes in the ALL literature. Conventional KM method and Cox proportional hazard model convey no information regarding possible competing risks. Competing risk is an event that modifies the chance of the interested outcome [31]. For example, death from any cause is a competing risk for relapse disease. Using the competing risk analysis is therefore considered to be more appropriate in the treatment with high rate of complications. We observed only one study that used competing risk analysis [11]. We encouraged investigators to incorporate competing risk analysis, at least in the sensitivity analysis.
We found that the Cox PH model was commonly used in the collection of articles in our review. There was substantial inadequacy of the description of variable selection, the strategy used for fitting procedure and test for goodness-of-fit. As mentioned above, sample size estimation related to regression analysis was noted in only four studies. Of these, two studies were found to be underpowered based on low event-per-covariate ratio [8]. We strongly encourage authors to describe the process of variable selection, strategy of model building and provide evidence that the sample size is sufficient for regression analysis.
A lack of PH assumption checking may introduce bias to the regression analysis. Our review shows that none of the studies described an assessment of the PH assumption. The Cox PH model assumes that the hazard ratio for comparing any two groups of predictor variables is constant over time [4]. If this assumption is not met, the Cox PH model is not valid for the analysis. We observed that 11 of 14 (79 %) studies had crossing survival curves. A clear violation of the PH assumption occurs if survival curves cross [32,33]. Therefore, a hazard ratio should not be used to compare the treatment effect between groups. We suggest that authors check for the PH assumption if the Cox PH model is incorporated in the analysis. When the PH assumption is violated, authors may consider using an alternative regression analysis, such as the extended Cox PH model or parametric survival analysis (Weibull, exponential, log-logistic or lognormal model).

Conclusions
Our systematic review evaluating reporting methods for survival analysis in adult ALL patients undergoing allo-SCT show significant shortcomings in the use and reporting of survival analysis. Sample size estimation was not routinely described and studies are frequently statistically underpowered. There was a lack of censoring description. Most notably, crossing survival curves were common and none of the studies checked for the PH assumption. Finally, the description of variable selection, fitting procedure and model checking were neglected.
Survival analysis has been used increasingly in medical research studies [6]. We raise awareness of these limitations and encourage authors, reviewers and editors to improve the quality of the use and reporting survival analysis in the literature.