**Survival analysis** corresponds to a set of statistical methods for investigating the time it takes for an event of interest to occur.

In this chapter, we start by describing how to fit survival curves and how to perform logrank tests comparing the survival time of two or more groups of individuals. We continue by demonstrating how to assess simultaneously the impact of multiple risk factors on the survival time using the Cox regression model. Finally, we describe how to check the validy Cox model assumptions.

We’ll use two R packages for survival data analysis and visualization :

- the
*survival*package for survival analyses, - and the
*survminer*package for ggplot2-based elegant visualization of survival analysis results

For survival analyses, the following function [in survival package] will be used:

*Surv*() to create a survival object*survfit*() to fit survival curves (Kaplan-Meier estimates)*survdiff*() to perform log-rank test comparing survival curves*coxph*() to compute the Cox proportional hazards model

For the visualization, we’ll use the following function available in the survminer package:

*ggsurvplot*() for visualizing survival curves*ggcoxzph*(),*ggcoxdiagnostics*() and*ggcoxfunctional*() for checking the Cox model assumptions.

These two packages can be installed as follow:

```
install.packages("survival")
install.packages("survminer")
```

- Objectives
- Basic concepts
- Survival time and type of events in cancer studies
- Censoring
- Survival and hazard functions
- Kaplan-Meier survival estimate

- Survival analysis in R
- Install and load required R package
- Example data sets
- Compute survival curves: survfit()
- Access to the value returned by survfit()
- Visualize survival curves
- Kaplan-Meier life table: summary of survival curves
- Log-Rank test comparing survival curves: survdiff()
- Fit complex survival curves

Read more –> Survival Analysis Basics: Curves and Logrank Tests

- The need for multivariate statistical modeling
- Basics of the Cox proportional hazards model
- Compute the Cox model in R
- Install and load required R package
- R function to compute the Cox model: coxph()
- Example data sets
- Compute the Cox model
- Visualizing the estimated distribution of survival times

Read more –> Cox Proportional Hazards Model.

- Diagnostics for the Cox model
- Assessing the validy of a Cox model in R
- Installing and loading required R packages
- Computing a Cox model
- Testing proportional Hazards assumption
- Testing influential observations
- Testing non linearity

Read more –> Cox Model Assumptions.

This analysis has been performed using **R software** (ver. 3.3.2).

Previously, we described the basic methods for analyzing survival data, as well as, the Cox proportional hazards methods to deal with the situation where several factors impact on the survival process.

In the current article, we continue the series by describing methods to evaluate the validity of the **Cox model assumptions**.

Note that, when used inappropriately, statistical models may give rise to misleading conclusions. Therefore, it’s important to check that a given model is an appropriate representation of the data.

The Cox proportional hazards model makes sevral assumptions. Thus, it is important to assess whether a fitted Cox regression model adequately describes the data.

Here, we’ll disscuss three types of diagonostics for the Cox model:

- Testing the proportional hazards assumption.
- Examining influential observations (or outliers).
- Detecting nonlinearity in relationship between the log hazard and the covariates.

In order to check these model assumptions, *Residuals* method are used. The common residuals for the Cox model include:

*Schoenfeld residuals*to check the proportional hazards assumption*Martingale residual*to assess nonlinearity*Deviance residual*(symmetric transformation of the Martinguale residuals), to examine influential observations

We’ll use two R packages:

**survival**for computing survival analyses**survminer**for visualizing survival analysis resultsInstall the packages

`install.packages(c("survival", "survminer"))`

- Load the packages

```
library("survival")
library("survminer")
```

We’ll use the lung data sets and the *coxph*() function in the survival package.

To compute a Cox model, type this:

```
library("survival")
res.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data = lung)
res.cox
```

```
Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
coef exp(coef) se(coef) z p
age 0.02009 1.02029 0.00966 2.08 0.0377
sex -0.52103 0.59391 0.17435 -2.99 0.0028
wt.loss 0.00076 1.00076 0.00619 0.12 0.9024
Likelihood ratio test=14.7 on 3 df, p=0.00212
n= 214, number of events= 152
(14 observations deleted due to missingness)
```

The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the *scaled Schoenfeld residuals*.

In principle, the *Schoenfeld residuals* are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption.

The function *cox.zph*() [in the *survival* package] provides a convenient solution to test the proportional hazards assumption for each covariate included in a Cox refression model fit.

For each covariate, the function *cox.zph*() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole.

The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.

To illustrate the test, we start by computing a Cox regression model using the lung data set [in survival package]:

```
library("survival")
res.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data = lung)
res.cox
```

```
Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
coef exp(coef) se(coef) z p
age 0.02009 1.02029 0.00966 2.08 0.0377
sex -0.52103 0.59391 0.17435 -2.99 0.0028
wt.loss 0.00076 1.00076 0.00619 0.12 0.9024
Likelihood ratio test=14.7 on 3 df, p=0.00212
n= 214, number of events= 152
(14 observations deleted due to missingness)
```

To test for the proportional-hazards (PH) assumption, type this:

```
test.ph <- cox.zph(res.cox)
test.ph
```

```
rho chisq p
age -0.0483 0.378 0.538
sex 0.1265 2.349 0.125
wt.loss 0.0126 0.024 0.877
GLOBAL NA 2.846 0.416
```

From the output above, the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards.

It’s possible to do a graphical diagnostic using the function *ggcoxzph*() [in the *survminer* package], which produces, for each covariate, graphs of the scaled Schoenfeld residuals against the transformed time.

`ggcoxzph(test.ph)`

In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit.

Note that, systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates \(\beta_1, \beta_2, \beta_3\) do not vary much over time.

From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates sex (which is, recall, a two-level factor, accounting for the two bands in the graph), wt.loss and age.

Another graphical methods for checking proportional hazards is to plot log(-log(S(t))) vs. t or log(t) and look for parallelism. This can be done only for categorical covariates.

A violations of proportional hazards assumption can be resolved by:

- Adding covariate*time interaction
- Stratification

Stratification is usefull for “nuisance” confounders, where you do not care to estimate the effect. You cannot examine the effects of the stratification variable (John Fox & Sanford Weisberg).

To read more about how to accomodate with non-proportional hazards, read the following articles:

- Jadwiga Borucka, PAREXEL, Warsaw, Poland. Extensions of cox model for non-proportional hazards purpose. 2013.
- John Fox & Sanford Weisberg. Cox Proportional-Hazards Regression for Survival Data in R.
- Max Gordon. Dealing with non-proportional hazards in R. March 29, 2016.

To test influential observations or outliers, we can visualize either:

- the
*deviance residuals*or - the
*dfbeta*values

The function *ggcoxdiagnostics*()[in *survminer* package] provides a convenient solution for checkind influential observations. The simplified format is as follow:

`ggcoxdiagnostics(fit, type = , linear.predictions = TRUE)`

- fit: an object of class coxph.object
- type: the type of residuals to present on Y axis. Allowed values include one of c(“martingale”, “deviance”, “score”, “schoenfeld”, “dfbeta”, “dfbetas”, “scaledsch”, “partial”).
- linear.predictions: a logical value indicating whether to show linear predictions for observations (TRUE) or just indexed of observations (FALSE) on X axis.

Specifying the argument *type = “dfbeta”*, plots the estimated changes in the regression coefficients upon deleting each observation in turn; likewise, *type=“dfbetas”* produces the estimated changes in the coefficients divided by their standard errors.

For example:

```
ggcoxdiagnostics(res.cox, type = "dfbeta",
linear.predictions = FALSE, ggtheme = theme_bw())
```

(Index plots of dfbeta for the Cox regression of time to death on age, sex and wt.loss)

The above index plots show that comparing the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually, even though some of the dfbeta values for age and wt.loss are large compared with the others.

It’s also possible to check outliers by visualizing the deviance residuals. The deviance residual is a normalized transform of the martingale residual. These residuals should be roughtly symmetrically distributed about zero with a standard deviation of 1.

- Positive values correspond to individuals that “died too soon” compared to expected survival times.
- Negative values correspond to individual that “lived too long”.
- Very large or small values are outliers, which are poorly predicted by the model.

Example of deviance residuals:

```
ggcoxdiagnostics(res.cox, type = "deviance",
linear.predictions = FALSE, ggtheme = theme_bw())
```

The pattern looks fairly symmetric around 0.

Often, we assume that continuous covariates have a linear form. However, this assumption should be checked.

Plotting the *Martingale residuals* against continuous covariates is a common approach used to detect *nonlinearity* or, in other words, to assess the functional form of a covariate. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit.

Nonlinearity is not an issue for categorical variables, so we only examine plots of martingale residuals and partial residuals against a continuous variable.

Martingale residuals may present any value in the range (-INF, +1):

- A value of martinguale residuals near 1 represents individuals that “died too soon”,
- and large negative values correspond to individuals that “lived too long”.

To assess the functional form of a continuous variable in a Cox proportional hazards model, we’ll use the function *ggcoxfunctional*() [in the *survminer* R package].

The function *ggcoxfunctional*() displays graphs of continuous covariates against martingale residuals of null cox proportional hazards model. This might help to properly choose the functional form of continuous variable in the Cox model. Fitted lines with lowess function should be linear to satisfy the Cox proportional hazards model assumptions.

For example, to assess the functional forme of age, type this:

`ggcoxfunctional(Surv(time, status) ~ age + log(age) + sqrt(age), data = lung)`

It appears that, nonlinearity is slightly here.

We described how to assess the valididy of the Cox model assumptions using the survival and survminer packages.

This analysis has been performed using **R software** (ver. 3.3.2).

The **Cox proportional-hazards model** (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.

In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:

- the definition of hazard and survival functions,
- the construction of Kaplan-Meier survival curves for different patient groups
- the logrank test for comparing two or more survival curves

The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of *univariate analysis*. They describe the survival according to one factor under investigation, but ignore the impact of any others.

Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age.

An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.

In this article, we’ll describe the Cox regression model and provide practical examples using R software.

In clinical investigations, there are many situations, where several known quantities (known as *covariates*), potentially affect patient prognosis.

For instance, suppose two groups of patients are compared: those with and those without a specific genotype. If one of the groups also contains older individuals, any difference in survival may be attributable to genotype or age or indeed both. Hence, when investigating survival in relation to any one factor, it is often desirable to adjust for the impact of others.

Statistical model is a frequently used tool that allows to analyze survival with respect to several factors simultaneously. Additionally, statistical model provides the effect size for each factor.

The cox proportional-hazards model is one of the most important methods used for modelling survival analysis data. The next section introduces the basics of the Cox regression model.

The purpose of the model is to evaluate simultaneously the effect of several factors on survival. In other words, it allows us to examine how specified factors influence the rate of a particular event happening (e.g., infection, death) at a particular point in time. This rate is commonly referred as the hazard rate. Predictor variables (or factors) are usually termed *covariates* in the survival-analysis literature.

The Cox model is expressed by the *hazard function* denoted by h(t). Briefly, the hazard function can be interpreted as the risk of dying at time t. It can be estimated as follow:

\[ h(t) = h_0(t) \times exp(b_1x_1 + b_2x_2 + ... + b_px_p) \]

where,

*t*represents the survival time- \(h(t)\) is the hazard function determined by a set of p covariates (\(x_1, x_2, ..., x_p\))
- the coefficients (\(b_1, b_2, ..., b_p\)) measure the impact (i.e., the effect size) of covariates.
- the term \(h_0\) is called the baseline hazard. It corresponds to the value of the hazard if all the \(x_i\) are equal to zero (the quantity exp(0) equals 1). The ‘t’ in h(t) reminds us that the hazard may vary over time.

The Cox model can be written as a multiple linear regression of the logarithm of the hazard on the variables \(x_i\), with the baseline hazard being an ‘intercept’ term that varies with time.

The quantities \(exp(b_i)\) are called hazard ratios (HR). A value of \(b_i\) greater than zero, or equivalently a hazard ratio greater than one, indicates that as the value of the \(i^{th}\) covariate increases, the event hazard increases and thus the length of survival decreases.

Put another way, a hazard ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.

In summary,

- HR = 1: No effect
- HR < 1: Reduction in the hazard
- HR > 1: Increase in Hazard

Note that in cancer studies:

- A covariate with hazard ratio > 1 (i.e.: b > 0) is called bad prognostic factor
- A covariate with hazard ratio < 1 (i.e.: b < 0) is called good prognostic factor

A key assumption of the Cox model is that the hazard curves for the groups of observations (or patients) should be proportional and cannot cross.

Consider two patients k and k’ that differ in their x-values. The corresponding hazard function can be simply written as follow

- Hazard function for the patient k:

\[ h_k(t) = h_0(t)e^{\sum\limits_{i=1}^n{\beta x}} \]

- Hazard function for the patient k’:

\[ h_{k'}(t) = h_0(t)e^{\sum\limits_{i=1}^n{\beta x'}} \]

- The hazard ratio for these two patients [\(\frac{h_k(t)}{h_{k'}(t)} = \frac{h_0(t)e^{\sum\limits_{i=1}^n{\beta x}}}{h_0(t)e^{\sum\limits_{i=1}^n{\beta x'}}} = \frac{e^{\sum\limits_{i=1}^n{\beta x}}}{e^{\sum\limits_{i=1}^n{\beta x'}}}\)] is independent of time t.

Consequently, the Cox model is a *proportional-hazards model*: the hazard of the event in any group is a constant multiple of the hazard in any other. This assumption implies that, as mentioned above, the hazard curves for the groups should be proportional and cannot cross.

In other words, if an individual has a risk of death at some initial time point that is twice as high as that of another individual, then at all later times the risk of death remains twice as high.

This assumption of proportional hazards should be tested. We’ll discuss methods for assessing proportionality in the next article in this series: Cox Model Assumptions.

We’ll use two R packages:

**survival**for computing survival analyses**survminer**for visualizing survival analysis resultsInstall the packages

`install.packages(c("survival", "survminer"))`

- Load the packages

```
library("survival")
library("survminer")
```

The function *coxph*()[in *survival* package] can be used to compute the Cox proportional hazards regression model in R.

The simplified format is as follow:

`coxph(formula, data, method)`

- formula: is linear model with a survival object as the response variable. Survival object is created using the function
*Surv*() as follow:*Surv(time, event)*. - data: a data frame containing the variables
- method: is used to specify how to handle ties. The default is ‘efron’. Other options are ‘breslow’ and ‘exact’. The default ‘efron’ is generally preferred to the once-popular “breslow” method. The “exact” method is much more computationally intensive.

We’ll use the lung cancer data in the survival R package.

```
data("lung")
head(lung)
```

```
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
4 5 210 2 57 1 1 90 60 1150 11
5 1 883 2 60 1 0 100 90 NA 0
6 12 1022 1 74 1 1 50 80 513 0
```

- inst: Institution code
- time: Survival time in days
- status: censoring status 1=censored, 2=dead
- age: Age in years
- sex: Male=1 Female=2
- ph.ecog: ECOG performance score (0=good 5=dead)
- ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
- pat.karno: Karnofsky performance score as rated by patient
- meal.cal: Calories consumed at meals
- wt.loss: Weight loss in last six months

We’ll fit the Cox regression using the following covariates: age, sex, ph.ecog and wt.loss.

We start by computing univariate Cox analyses for all these variables; then we’ll fit multivariate cox analyses using two variables to describe how the factors jointly impact on survival.

Univariate Cox analyses can be computed as follow:

```
res.cox <- coxph(Surv(time, status) ~ sex, data = lung)
res.cox
```

```
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)
coef exp(coef) se(coef) z p
sex -0.531 0.588 0.167 -3.18 0.0015
Likelihood ratio test=10.6 on 1 df, p=0.00111
n= 228, number of events= 165
```

The function *summary*() for Cox models produces a more complete report:

`summary(res.cox)`

```
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)
n= 228, number of events= 165
coef exp(coef) se(coef) z Pr(>|z|)
sex -0.5310 0.5880 0.1672 -3.176 0.00149 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
sex 0.588 1.701 0.4237 0.816
Concordance= 0.579 (se = 0.022 )
Rsquare= 0.046 (max possible= 0.999 )
Likelihood ratio test= 10.63 on 1 df, p=0.001111
Wald test = 10.09 on 1 df, p=0.001491
Score (logrank) test = 10.33 on 1 df, p=0.001312
```

The Cox regression results can be interpreted as follow:

*Statistical significance*. The column marked “z” gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The wald statistic evaluates, whether the beta (\(\beta\)) coefficient of a given variable is statistically significantly different from 0. From the output above, we can conclude that the variable sex have highly statistically significant coefficients.*The regression coefficients*. The second feature to note in the Cox model results is the the sign of the regression coefficients (coef). A positive sign means that the hazard (risk of death) is higher, and thus the prognosis worse, for subjects with higher values of that variable. The variable sex is encoded as a numeric vector. 1: male, 2: female. The R summary for the Cox model gives the hazard ratio (HR) for the second group relative to the first group, that is, female versus male. The beta coefficient for sex = -0.53 indicates that females have lower risk of death (lower survival rates) than males, in these data.*Hazard ratios*. The exponentiated coefficients (exp(coef) = exp(-0.53) = 0.59), also known as*hazard ratios*, give the effect size of covariates. For example, being female (sex=2) reduces the hazard by a factor of 0.59, or 41%. Being female is associated with good prognostic.*Confidence intervals of the hazard ratios*. The summary output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)), lower 95% bound = 0.4237, upper 95% bound = 0.816.*Global statistical significance of the model*. Finally, the output gives p-values for three alternative tests for overall significance of the model: The likelihood-ratio test, Wald test, and score logrank statistics. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

To apply the univariate coxph function to multiple covariates at once, type this:

```
covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,
function(x) as.formula(paste('Surv(time, status)~', x)))
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
# Extract data
univ_results <- lapply(univ_models,
function(x){
x <- summary(x)
p.value<-signif(x$wald["pvalue"], digits=2)
wald.test<-signif(x$wald["test"], digits=2)
beta<-signif(x$coef[1], digits=2);#coeficient beta
HR <-signif(x$coef[2], digits=2);#exp(beta)
HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
HR <- paste0(HR, " (",
HR.confint.lower, "-", HR.confint.upper, ")")
res<-c(beta, HR, wald.test, p.value)
names(res)<-c("beta", "HR (95% CI for HR)", "wald.test",
"p.value")
return(res)
#return(exp(cbind(coef(x),confint(x))))
})
res <- t(as.data.frame(univ_results, check.names = FALSE))
as.data.frame(res)
```

```
beta HR (95% CI for HR) wald.test p.value
age 0.019 1 (1-1) 4.1 0.042
sex -0.53 0.59 (0.42-0.82) 10 0.0015
ph.karno -0.016 0.98 (0.97-1) 7.9 0.005
ph.ecog 0.48 1.6 (1.3-2) 18 2.7e-05
wt.loss 0.0013 1 (0.99-1) 0.05 0.83
```

The output above shows the regression beta coefficients, the effect sizes (given as hazard ratios) and statistical significance for each of the variables in relation to overall survival. Each factor is assessed through separate univariate Cox regressions.

From the output above,

The variables sex, age and ph.ecog have highly statistically significant coefficients, while the coefficient for ph.karno is not significant.

age and ph.ecog have positive beta coefficients, while sex has a negative coefficient. Thus, older age and higher ph.ecog are associated with poorer survival, whereas being female (sex=2) is associated with better survival.

Now, we want to describe how the factors jointly impact on survival. To answer to this question, we’ll perform a multivariate Cox regression analysis. As the variable ph.karno is not significant in the univariate Cox analysis, we’ll skip it in the multivariate analysis. We’ll include the 3 factors (sex, age and ph.ecog) into the multivariate model.

A Cox regression of time to death on the time-constant covariates is specified as follow:

```
res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
summary(res.cox)
```

```
Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)
n= 227, number of events= 164
(1 observation deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.011067 1.011128 0.009267 1.194 0.232416
sex -0.552612 0.575445 0.167739 -3.294 0.000986 ***
ph.ecog 0.463728 1.589991 0.113577 4.083 4.45e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0111 0.9890 0.9929 1.0297
sex 0.5754 1.7378 0.4142 0.7994
ph.ecog 1.5900 0.6289 1.2727 1.9864
Concordance= 0.637 (se = 0.026 )
Rsquare= 0.126 (max possible= 0.999 )
Likelihood ratio test= 30.5 on 3 df, p=1.083e-06
Wald test = 29.93 on 3 df, p=1.428e-06
Score (logrank) test = 30.5 on 3 df, p=1.083e-06
```

The p-value for all three overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. These tests evaluate the omnibus null hypothesis that all of the betas (\(\beta\)) are 0. In the above example, the test statistics are in close agreement, and the omnibus null hypothesis is soundly rejected.

In the multivariate Cox analysis, the covariates sex and ph.ecog remain significant (p < 0.05). However, the covariate age fails to be significant (p = 0.23, which is grater than 0.05).

The p-value for sex is 0.000986, with a hazard ratio HR = exp(coef) = 0.58, indicating a strong relationship between the patients’ sex and decreased risk of death. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 0.58, or 42%. We conclude that, being female is associated with good prognostic.

Similarly, the p-value for ph.ecog is 4.45e-05, with a hazard ratio HR = 1.59, indicating a strong relationship between the ph.ecog value and increased risk of death. Holding the other covariates constant, a higher value of ph.ecog is associated with a poor survival.

By contrast, the p-value for age is now p=0.23. The hazard ratio HR = exp(coef) = 1.01, with a 95% confidence interval of 0.99 to 1.03. Because the confidence interval for HR includes 1, these results indicate that age makes a smaller contribution to the difference in the HR after adjusting for the ph.ecog values and patient’s sex, and only trend toward significance. For example, holding the other covariates constant, an additional year of age induce daily hazard of death by a factor of exp(beta) = 1.01, or 1%, which is not a significant contribution.

Having fit a Cox model to the data, it’s possible to visualize the predicted survival proportion at any given point in time for a particular risk group. The function *survfit*() estimates the survival proportion, by default at the mean values of covariates.

```
# Plot the baseline survival function
ggsurvplot(survfit(res.cox), color = "#2E9FDF",
ggtheme = theme_minimal())
```

We may wish to display how estimated survival depends upon the value of a covariate of interest.

Consider that, we want to assess the impact of the sex on the estimated survival probability. In this case, we construct a new data frame with two rows, one for each value of sex; the other covariates are fixed to their average values (if they are continuous variables) or to their lowest level (if they are discrete variables). For a dummy covariate, the average value is the proportion coded 1 in the data set. This data frame is passed to *survfit*() via the *newdata* argument:

```
# Create the new data
sex_df <- with(lung,
data.frame(sex = c(1, 2),
age = rep(mean(age, na.rm = TRUE), 2),
ph.ecog = c(1, 1)
)
)
sex_df
```

```
sex age ph.ecog
1 1 62.44737 1
2 2 62.44737 1
```

```
# Survival curves
fit <- survfit(res.cox, newdata = sex_df)
ggsurvplot(fit, conf.int = TRUE, legend.labs=c("Sex=1", "Sex=2"),
ggtheme = theme_minimal())
```

In this article, we described the Cox regression model for assessing simultaneously the relationship between multiple risk factors and patient’s survival time. We demonstrated how to compute the Cox model using the *survival* package. Additionally, we described how to visualize the results of the analysis using the *survminer* package.

- Cox DR (1972). Regression models and life tables (with discussion). J R Statist Soc B 34: 187–220
- MJ Bradburn, TG Clark, SB Love and DG Altman. Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods. British Journal of Cancer (2003) 89, 431 – 436

This analysis has been performed using **R software** (ver. 3.3.2).

**Survival analysis** corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur.

**Survival analysis** is used in a variety of field such as:

*Cancer studies*for patients survival time analyses,*Sociology*for “event-history analysis”,- and in
*engineering*for “failure-time analysis”.

In cancer studies, typical research questions are like:

- What is the impact of certain clinical characteristics on patient’s survival
- What is the probability that an individual survives 3 years?
- Are there differences in survival between groups of patients?

The aim of this chapter is to describe the basic concepts of survival analysis. In cancer studies, most of survival analyses use the following methods:

*Kaplan-Meier plots*to visualize survival curves*Log-rank test*to compare the survival curves of two or more groups*Cox proportional hazards regression*to describe the effect of variables on survival. The Cox model is discussed in the next chapter: Cox proportional hazards model.

Here, we’ll start by explaining the essential concepts of survival analysis, including:

- how to generate and interpret survival curves,
- and how to quantify and test survival differences between two or more groups of patients.

Then, we’ll continue by describing multivariate analysis using Cox proportional hazards model.

Here, we start by defining fundamental terms of survival analysis including:

- Survival time and event
- Censoring
- Survival function and hazard function

There are different types of events, including:

- Relapse
- Progression
- Death

The time from ‘response to treatment’ (complete remission) to the occurrence of the event of interest is commonly called *survival time* (or time to event).

The two most important measures in cancer studies include: i) the *time to death*; and ii) the *relapse-free survival time*, which corresponds to the time between response to treatment and recurrence of the disease. It’s also known as *disease-free survival time* and *event-free survival time*.

As mentioned above, survival analysis focuses on the expected duration of time until occurrence of an event of interest (relapse or death). However, the event may not be observed for some individuals within the study time period, producing the so-called *censored* observations.

Censoring may arise in the following ways:

- a patient has not (yet) experienced the event of interest, such as relapse or death, within the study time period;
- a patient is lost to follow-up during the study period;
- a patient experiences a different event that makes further follow-up impossible.

This type of censoring, named *right censoring*, is handled in survival analysis.

Two related probabilities are used to describe survival data: the *survival probability* and the *hazard probability*.

The *survival probability*, also known as the survivor function \(S(t)\), is the probability that an individual survives from the time origin (e.g. diagnosis of cancer) to a specified future time t.

The *hazard*, denoted by \(h(t)\), is the probability that an individual who is under observation at a time t has an event at that time.

Note that, in contrast to the survivor function, which focuses on not having an event, the hazard function focuses on the event occurring.

The Kaplan-Meier (KM) method is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).

The survival probability at time \(t_i\), \(S(t_i)\), is calculated as follow:

\[S(t_i) = S(t_{i-1})(1-\frac{d_i}{n_i})\]

Where,

- \(S(t_{i-1})\) = the probability of being alive at \(t_{i-1}\)
- \(n_i\) = the number of patients alive just before \(t_i\)
- \(d_i\) = the number of events at \(t_i\)
- \(t_0\) = 0, \(S(0)\) = 1

The estimated probability (\(S(t)\)) is a step function that changes value only at the time of each event. It’s also possible to compute confidence intervals for the survival probability.

The KM survival curve, a plot of the KM survival probability against time, provides a useful summary of the data that can be used to estimate measures such as median survival time.

We’ll use two R packages:

*survival*for computing survival analyses*survminer*for summarizing and visualizing the results of survival analysisInstall the packages

`install.packages(c("survival", "survminer"))`

- Load the packages

```
library("survival")
library("survminer")
```

We’ll use the lung cancer data available in the survival package.

```
data("lung")
head(lung)
```

```
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
4 5 210 2 57 1 1 90 60 1150 11
5 1 883 2 60 1 0 100 90 NA 0
6 12 1022 1 74 1 1 50 80 513 0
```

- inst: Institution code
- time: Survival time in days
- status: censoring status 1=censored, 2=dead
- age: Age in years
- sex: Male=1 Female=2
- ph.ecog: ECOG performance score (0=good 5=dead)
- ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
- pat.karno: Karnofsky performance score as rated by patient
- meal.cal: Calories consumed at meals
- wt.loss: Weight loss in last six months

We want to compute the survival probability by sex.

The function *survfit*() [in *survival* package] can be used to compute kaplan-Meier survival estimate. Its main arguments include:

- a survival object created using the function
*Surv*() - and the data set containing the variables.

To compute survival curves, type this:

```
fit <- survfit(Surv(time, status) ~ sex, data = lung)
print(fit)
```

```
Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
n events median 0.95LCL 0.95UCL
sex=1 138 112 270 212 310
sex=2 90 53 426 348 550
```

By default, the function print() shows a short summary of the survival curves. It prints the number of observations, number of events, the median survival and the confidence limits for the median.

If you want to display a more complete summary of the survival curves, type this:

```
# Summary of survival curves
summary(fit)
# Access to the sort summary table
summary(fit)$table
```

The function *survfit*() returns a list of variables, including the following components:

- n: total number of subjects in each curve.
- time: the time points on the curve.
- n.risk: the number of subjects at risk at time t
- n.event: the number of events that occurred at time t.
- n.censor: the number of censored subjects, who exit the risk set, without an event, at time t.
- lower,upper: lower and upper confidence limits for the curve, respectively.
- strata: indicates stratification of curve estimation. If strata is not NULL, there are multiple curves in the result. The levels of strata (a factor) are the labels for the curves.

The components can be accessed as follow:

```
d <- data.frame(time = fit$time,
n.risk = fit$n.risk,
n.event = fit$n.event,
n.censor = fit$n.censor,
surv = fit$surv,
upper = fit$upper,
lower = fit$lower
)
head(d)
```

```
time n.risk n.event n.censor surv upper lower
1 11 138 3 0 0.9782609 1.0000000 0.9542301
2 12 135 1 0 0.9710145 0.9994124 0.9434235
3 13 134 2 0 0.9565217 0.9911586 0.9230952
4 15 132 1 0 0.9492754 0.9866017 0.9133612
5 26 131 1 0 0.9420290 0.9818365 0.9038355
6 30 130 1 0 0.9347826 0.9768989 0.8944820
```

We’ll use the function *ggsurvplot*() [in *Survminer* R package] to produce the survival curves for the two groups of subjects.

It’s also possible to show:

- the 95%
*confidence limits*of the survivor function using the argument*conf.int = TRUE*. - the number and/or the percentage of
*individuals at risk*by time using the option*risk.table*. Allowed values for*risk.table*include:- TRUE or FALSE specifying whether to show or not the risk table. Default is FALSE.
- “absolute” or “percentage”: to show the
*absolute number*and the*percentage*of subjects at risk by time, respectively. Use “abs_pct” to show both absolute number and percentage.

- the
*p-value*of the Log-Rank test comparing the groups using*pval = TRUE*. - horizontal/vertical line at
*median survival*using the argument*surv.median.line*. Allowed values include one of c(“none”, “hv”, “h”, “v”). v: vertical, h:horizontal.

```
# Change color, linetype by strata, risk.table color by strata
ggsurvplot(fit,
pval = TRUE, conf.int = TRUE,
risk.table = TRUE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
surv.median.line = "hv", # Specify median survival
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"))
```

The plot can be further customized using the following arguments:

*conf.int.style = “step”*to change the style of confidence interval bands.*xlab*to change the x axis label.*break.time.by = 200*break x axis in time intervals by 200.*risk.table = “abs_pct”*to show both absolute number and percentage of individuals at risk.*risk.table.y.text.col = TRUE*and*risk.table.y.text = FALSE*to provide bars instead of names in text annotations of the legend of risk table.*ncensor.plot = TRUE*to plot the number of censored subjects at time t. As suggested by Marcin Kosinski, This is a good additional feedback to survival curves, so that one could realize: how do survival curves look like, what is the number of risk set AND what is the cause that the risk set become smaller: is it caused by events or by censored events?*legend.labs*to change the legend labels.

```
ggsurvplot(
fit, # survfit object with calculated statistics.
pval = TRUE, # show p-value of log-rank test.
conf.int = TRUE, # show confidence intervals for
# point estimaes of survival curves.
conf.int.style = "step", # customize style of confidence intervals
xlab = "Time in days", # customize X axis label.
break.time.by = 200, # break X axis in time intervals by 200.
ggtheme = theme_light(), # customize plot and risk table with a theme.
risk.table = "abs_pct", # absolute number and percentage at risk.
risk.table.y.text.col = T,# colour risk table text annotations.
risk.table.y.text = FALSE,# show bars instead of names in text annotations
# in legend of risk table.
ncensor.plot = TRUE, # plot the number of censored subjects at time t
surv.median.line = "hv", # add the median survival pointer.
legend.labs =
c("Male", "Female"), # change legend labels.
palette =
c("#E7B800", "#2E9FDF") # custom color palettes.
)
```

The Kaplan-Meier plot can be interpreted as follow:

The horizontal axis (x-axis) represents time in days, and the vertical axis (y-axis) shows the probability of surviving or the proportion of people surviving. The lines represent survival curves of the two groups. A vertical drop in the curves indicates an event. The vertical tick mark on the curves means that a patient was censored at this time.

- At time zero, the survival probability is 1.0 (or 100% of the participants are alive).
- At time 250, the probability of survival is approximately 0.55 (or 55%) for sex=1 and 0.75 (or 75%) for sex=2.
- The median survival is approximately 270 days for sex=1 and 426 days for sex=2, suggesting a good survival for sex=2 compared to sex=1

The median survival times for each group can be obtained using the code below:

`summary(fit)$table`

```
records n.max n.start events *rmean *se(rmean) median 0.95LCL 0.95UCL
sex=1 138 138 138 112 325.0663 22.59845 270 212 310
sex=2 90 90 90 53 458.2757 33.78530 426 348 550
```

The median survival times for each group represent the time at which the survival probability, S(t), is 0.5.

The median survival time for sex=1 (Male group) is 270 days, as opposed to 426 days for sex=2 (Female). There appears to be a survival advantage for female with lung cancer compare to male. However, to evaluate whether this difference is statistically significant requires a formal statistical test, a subject that is discussed in the next sections.

Note that, the confidence limits are wide at the tail of the curves, making meaningful interpretations difficult. This can be explained by the fact that, in practice, there are usually patients who are lost to follow-up or alive at the end of follow-up. Thus, it may be sensible to shorten plots before the end of follow-up on the x-axis (Pocock et al, 2002).

The survival curves can be shorten using the argument *xlim* as follow:

```
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
xlim = c(0, 600))
```

Note that, three often used transformations can be specified using the argument *fun*:

- “log”: log transformation of the survivor function,
- “event”: plots cumulative events (f(y) = 1-y). It’s also known as the cumulative incidence,
- “cumhaz” plots the cumulative hazard function (f(y) = -log(y))

For example, to plot cumulative events, type this:

```
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
fun = "event")
```

The **cummulative hazard** is commonly used to estimate the hazard probability. It’s defined as \(H(t) = -log(survival function) = -log(S(t))\). The cumulative hazard (\(H(t)\)) can be interpreted as the cumulative force of mortality. In other words, it corresponds to the number of events that would be expected for each individual by time t if the event were a repeatable process.

To plot cumulative hazard, type this:

```
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
fun = "cumhaz")
```

As mentioned above, you can use the function *summary*() to have a complete summary of survival curves:

`summary(fit)`

It’s also possible to use the function *surv_summary*() [in *survminer* package] to get a summary of survival curves. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.

```
res.sum <- surv_summary(fit)
head(res.sum)
```

```
time n.risk n.event n.censor surv std.err upper lower strata sex
1 11 138 3 0 0.9782609 0.01268978 1.0000000 0.9542301 sex=1 1
2 12 135 1 0 0.9710145 0.01470747 0.9994124 0.9434235 sex=1 1
3 13 134 2 0 0.9565217 0.01814885 0.9911586 0.9230952 sex=1 1
4 15 132 1 0 0.9492754 0.01967768 0.9866017 0.9133612 sex=1 1
5 26 131 1 0 0.9420290 0.02111708 0.9818365 0.9038355 sex=1 1
6 30 130 1 0 0.9347826 0.02248469 0.9768989 0.8944820 sex=1 1
```

The function *surv_summary*() returns a data frame with the following columns:

- time: the time points at which the curve has a step.
- n.risk: the number of subjects at risk at t.
- n.event: the number of events that occur at time t.
- n.censor: number of censored events.
- surv: estimate of survival probability.
- std.err: standard error of survival.
- upper: upper end of confidence interval
- lower: lower end of confidence interval
- strata: indicates stratification of curve estimation. The levels of strata (a factor) are the labels for the curves.

In a situation, where survival curves have been fitted with one or more variables, surv_summary object contains extra columns representing the variables. This makes it possible to facet the output of ggsurvplot by strata or by some combinations of factors.

*surv_summary* object has also an attribute named ‘table’ containing information about the survival curves, including medians of survival with confidence intervals, as well as, the total number of subjects and the number of event in each curve. To get access to the attribute ‘table’, type this:

`attr(res.sum, "table")`

The *log-rank test* is the most widely used method of comparing two or more survival curves. The null hypothesis is that there is no difference in survival between the two groups. The log rank test is a non-parametric test, which makes no assumptions about the survival distributions. Essentially, the log rank test compares the observed number of events in each group to what would be expected if the null hypothesis were true (i.e., if the survival curves were identical). The log rank statistic is approximately distributed as a chi-square test statistic.

The function *survdiff*() [in *survival* package] can be used to compute *log-rank test* comparing two or more survival curves.

*survdiff*() can be used as follow:

```
surv_diff <- survdiff(Surv(time, status) ~ sex, data = lung)
surv_diff
```

```
Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 138 112 91.6 4.55 10.3
sex=2 90 53 73.4 5.68 10.3
Chisq= 10.3 on 1 degrees of freedom, p= 0.00131
```

The function returns a list of components, including:

- n: the number of subjects in each group.
- obs: the weighted observed number of events in each group.
- exp: the weighted expected number of events in each group.
- chisq: the chisquare statistic for a test of equality.
- strata: optionally, the number of subjects contained in each stratum.

The log rank test for difference in survival gives a p-value of p = 0.0013, indicating that the sex groups differ significantly in survival.

In this section, we’ll compute survival curves using the combination of multiple factors. Next, we’ll facet the output of ggsurvplot() by a combination of factors

- Fit (complex) survival curves using colon data sets

```
require("survival")
fit2 <- survfit( Surv(time, status) ~ sex + rx + adhere,
data = colon )
```

- Visualize the output using survminer. The plot below shows survival curves by the sex variable faceted according to the values of rx & adhere.

```
# Plot survival curves by sex and facet by rx and adhere
ggsurv <- ggsurvplot(fit2, fun = "event", conf.int = TRUE,
ggtheme = theme_bw())
ggsurv$plot +theme_bw() +
theme (legend.position = "right")+
facet_grid(rx ~ adhere)
```

Survival analysis is a set of statistical approaches for data analysis where the outcome variable of interest is time until an event occurs.

Survival data are generally described and modeled in terms of two related functions:

the survivor function representing the probability that an individual survives from the time of origin to some time beyond time t. It’s usually estimated by the Kaplan-Meier method. The logrank test may be used to test for differences between survival curves for groups, such as treatment arms.

The hazard function gives the instantaneous potential of having an event at a time, given survival up to that time. It is used primarily as a diagnostic tool or for specifying a mathematical model for survival analysis.

In this article, we demonstrate how to perform and visualize survival analyses using the combination of two R packages: *survival* (for the analysis) and *survminer* (for the visualization).

- Clark TG, Bradburn MJ, Love SB and Altman DG. Survival Analysis Part I: Basic concepts and first analyses. British Journal of Cancer (2003) 89, 232 – 238
- Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53: 457–481.
- Pocock S, Clayton TC, Altman DG (2002) Survival plots of time-to-event outcomes in clinical trials: good practice and pitfalls. Lancet 359: 1686– 1689.

This analysis has been performed using **R software** (ver. 3.3.2).