Easy Guides

Survival Analysis

Tue, 13 Dec 2016 01:26:54 +0100

Survival analysis corresponds to a set of statistical methods for investigating the time it takes for an event of interest to occur.

In this chapter, we start by describing how to fit survival curves and how to perform logrank tests comparing the survival time of two or more groups of individuals. We continue by demonstrating how to assess simultaneously the impact of multiple risk factors on the survival time using the Cox regression model. Finally, we describe how to check the validy Cox model assumptions.

Survival analysis toolkits in R

We’ll use two R packages for survival data analysis and visualization :

the survival package for survival analyses,
and the survminer package for ggplot2-based elegant visualization of survival analysis results

For survival analyses, the following function [in survival package] will be used:

Surv() to create a survival object
survfit() to fit survival curves (Kaplan-Meier estimates)
survdiff() to perform log-rank test comparing survival curves
coxph() to compute the Cox proportional hazards model

For the visualization, we’ll use the following function available in the survminer package:

ggsurvplot() for visualizing survival curves
ggcoxzph(), ggcoxdiagnostics() and ggcoxfunctional() for checking the Cox model assumptions.

These two packages can be installed as follow:

install.packages("survival")
install.packages("survminer")

Survival analysis basics: curves and logrank tests

Survival analysis

Objectives
Basic concepts
- Survival time and type of events in cancer studies
- Censoring
- Survival and hazard functions
- Kaplan-Meier survival estimate
Survival analysis in R
- Install and load required R package
- Example data sets
- Compute survival curves: survfit()
- Access to the value returned by survfit()
- Visualize survival curves
- Kaplan-Meier life table: summary of survival curves
- Log-Rank test comparing survival curves: survdiff()
- Fit complex survival curves

Cox proportional hazards model

The need for multivariate statistical modeling
Basics of the Cox proportional hazards model
Compute the Cox model in R
- Install and load required R package
- R function to compute the Cox model: coxph()
- Example data sets
- Compute the Cox model
- Visualizing the estimated distribution of survival times

Read more –> Cox Proportional Hazards Model.

Cox model assumptions

Diagnostics for the Cox model
Assessing the validy of a Cox model in R
- Installing and loading required R packages
- Computing a Cox model
- Testing proportional Hazards assumption
- Testing influential observations
- Testing non linearity

Infos

This analysis has been performed using R software (ver. 3.3.2).

Cox Model Assumptions

Tue, 13 Dec 2016 00:03:00 +0100

Previously, we described the basic methods for analyzing survival data, as well as, the Cox proportional hazards methods to deal with the situation where several factors impact on the survival process.

In the current article, we continue the series by describing methods to evaluate the validity of the Cox model assumptions.

Note that, when used inappropriately, statistical models may give rise to misleading conclusions. Therefore, it’s important to check that a given model is an appropriate representation of the data.

Contents

Diagnostics for the Cox model
Assessing the validy of a Cox model in R
Summary
Infos

Diagnostics for the Cox model

The Cox proportional hazards model makes sevral assumptions. Thus, it is important to assess whether a fitted Cox regression model adequately describes the data.

Here, we’ll disscuss three types of diagonostics for the Cox model:

Testing the proportional hazards assumption.
Examining influential observations (or outliers).
Detecting nonlinearity in relationship between the log hazard and the covariates.

In order to check these model assumptions, Residuals method are used. The common residuals for the Cox model include:

Schoenfeld residuals to check the proportional hazards assumption
Martingale residual to assess nonlinearity
Deviance residual (symmetric transformation of the Martinguale residuals), to examine influential observations

Assessing the validy of a Cox model in R

Installing and loading required R packages

We’ll use two R packages:

survival for computing survival analyses
survminer for visualizing survival analysis results
Install the packages

install.packages(c("survival", "survminer"))

Load the packages

library("survival")
library("survminer")

Computing a Cox model

We’ll use the lung data sets and the coxph() function in the survival package.

To compute a Cox model, type this:

library("survival")
res.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data =  lung)
res.cox

Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)

            coef exp(coef) se(coef)     z      p
age      0.02009   1.02029  0.00966  2.08 0.0377
sex     -0.52103   0.59391  0.17435 -2.99 0.0028
wt.loss  0.00076   1.00076  0.00619  0.12 0.9024

Likelihood ratio test=14.7  on 3 df, p=0.00212
n= 214, number of events= 152 
   (14 observations deleted due to missingness)

Testing proportional Hazards assumption

The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals.

In principle, the Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption.

The function cox.zph() [in the survival package] provides a convenient solution to test the proportional hazards assumption for each covariate included in a Cox refression model fit.

For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole.

The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.

To illustrate the test, we start by computing a Cox regression model using the lung data set [in survival package]:

library("survival")
res.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data =  lung)
res.cox

Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)

            coef exp(coef) se(coef)     z      p
age      0.02009   1.02029  0.00966  2.08 0.0377
sex     -0.52103   0.59391  0.17435 -2.99 0.0028
wt.loss  0.00076   1.00076  0.00619  0.12 0.9024

Likelihood ratio test=14.7  on 3 df, p=0.00212
n= 214, number of events= 152 
   (14 observations deleted due to missingness)

To test for the proportional-hazards (PH) assumption, type this:

test.ph <- cox.zph(res.cox)
test.ph

            rho chisq     p
age     -0.0483 0.378 0.538
sex      0.1265 2.349 0.125
wt.loss  0.0126 0.024 0.877
GLOBAL       NA 2.846 0.416

From the output above, the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards.

It’s possible to do a graphical diagnostic using the function ggcoxzph() [in the survminer package], which produces, for each covariate, graphs of the scaled Schoenfeld residuals against the transformed time.

ggcoxzph(test.ph)

Cox Model Assumptions

In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit.

Note that, systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates \(\beta_1, \beta_2, \beta_3\) do not vary much over time.

From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates sex (which is, recall, a two-level factor, accounting for the two bands in the graph), wt.loss and age.

Another graphical methods for checking proportional hazards is to plot log(-log(S(t))) vs. t or log(t) and look for parallelism. This can be done only for categorical covariates.

A violations of proportional hazards assumption can be resolved by:

Adding covariate*time interaction
Stratification

Stratification is usefull for “nuisance” confounders, where you do not care to estimate the effect. You cannot examine the effects of the stratification variable (John Fox & Sanford Weisberg).

To read more about how to accomodate with non-proportional hazards, read the following articles:

Jadwiga Borucka, PAREXEL, Warsaw, Poland. Extensions of cox model for non-proportional hazards purpose. 2013.
John Fox & Sanford Weisberg. Cox Proportional-Hazards Regression for Survival Data in R.
Max Gordon. Dealing with non-proportional hazards in R. March 29, 2016.

Testing influential observations

To test influential observations or outliers, we can visualize either:

the deviance residuals or
the dfbeta values

The function ggcoxdiagnostics()[in survminer package] provides a convenient solution for checkind influential observations. The simplified format is as follow:

ggcoxdiagnostics(fit, type = , linear.predictions = TRUE)

fit: an object of class coxph.object
type: the type of residuals to present on Y axis. Allowed values include one of c(“martingale”, “deviance”, “score”, “schoenfeld”, “dfbeta”, “dfbetas”, “scaledsch”, “partial”).
linear.predictions: a logical value indicating whether to show linear predictions for observations (TRUE) or just indexed of observations (FALSE) on X axis.

Specifying the argument type = “dfbeta”, plots the estimated changes in the regression coefficients upon deleting each observation in turn; likewise, type=“dfbetas” produces the estimated changes in the coefficients divided by their standard errors.

For example:

ggcoxdiagnostics(res.cox, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())

Cox Model Assumptions

(Index plots of dfbeta for the Cox regression of time to death on age, sex and wt.loss)

The above index plots show that comparing the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually, even though some of the dfbeta values for age and wt.loss are large compared with the others.

It’s also possible to check outliers by visualizing the deviance residuals. The deviance residual is a normalized transform of the martingale residual. These residuals should be roughtly symmetrically distributed about zero with a standard deviation of 1.

Positive values correspond to individuals that “died too soon” compared to expected survival times.
Negative values correspond to individual that “lived too long”.
Very large or small values are outliers, which are poorly predicted by the model.

Example of deviance residuals:

ggcoxdiagnostics(res.cox, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())

Cox Model Assumptions

The pattern looks fairly symmetric around 0.

Testing non linearity

Often, we assume that continuous covariates have a linear form. However, this assumption should be checked.

Plotting the Martingale residuals against continuous covariates is a common approach used to detect nonlinearity or, in other words, to assess the functional form of a covariate. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit.

Nonlinearity is not an issue for categorical variables, so we only examine plots of martingale residuals and partial residuals against a continuous variable.

Martingale residuals may present any value in the range (-INF, +1):

A value of martinguale residuals near 1 represents individuals that “died too soon”,
and large negative values correspond to individuals that “lived too long”.

To assess the functional form of a continuous variable in a Cox proportional hazards model, we’ll use the function ggcoxfunctional() [in the survminer R package].

The function ggcoxfunctional() displays graphs of continuous covariates against martingale residuals of null cox proportional hazards model. This might help to properly choose the functional form of continuous variable in the Cox model. Fitted lines with lowess function should be linear to satisfy the Cox proportional hazards model assumptions.

For example, to assess the functional forme of age, type this:

ggcoxfunctional(Surv(time, status) ~ age + log(age) + sqrt(age), data = lung)

Cox Model Assumptions

It appears that, nonlinearity is slightly here.

Summary

We described how to assess the valididy of the Cox model assumptions using the survival and survminer packages.

Infos

This analysis has been performed using R software (ver. 3.3.2).

Cox Proportional-Hazards Model

Mon, 12 Dec 2016 23:45:53 +0100

The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.

In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:

the definition of hazard and survival functions,
the construction of Kaplan-Meier survival curves for different patient groups
the logrank test for comparing two or more survival curves

The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.

Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age.

An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.

In this article, we’ll describe the Cox regression model and provide practical examples using R software.

Contents

The need for multivariate statistical modeling
Basics of the Cox proportional hazards model
Compute the Cox model in R
Summary
References
Infos

The need for multivariate statistical modeling

In clinical investigations, there are many situations, where several known quantities (known as covariates), potentially affect patient prognosis.

For instance, suppose two groups of patients are compared: those with and those without a specific genotype. If one of the groups also contains older individuals, any difference in survival may be attributable to genotype or age or indeed both. Hence, when investigating survival in relation to any one factor, it is often desirable to adjust for the impact of others.

Statistical model is a frequently used tool that allows to analyze survival with respect to several factors simultaneously. Additionally, statistical model provides the effect size for each factor.

The cox proportional-hazards model is one of the most important methods used for modelling survival analysis data. The next section introduces the basics of the Cox regression model.

Basics of the Cox proportional hazards model

The purpose of the model is to evaluate simultaneously the effect of several factors on survival. In other words, it allows us to examine how specified factors influence the rate of a particular event happening (e.g., infection, death) at a particular point in time. This rate is commonly referred as the hazard rate. Predictor variables (or factors) are usually termed covariates in the survival-analysis literature.

The Cox model is expressed by the hazard function denoted by h(t). Briefly, the hazard function can be interpreted as the risk of dying at time t. It can be estimated as follow:

\[ h(t) = h_0(t) \times exp(b_1x_1 + b_2x_2 + ... + b_px_p) \]

where,

t represents the survival time
\(h(t)\) is the hazard function determined by a set of p covariates (\(x_1, x_2, ..., x_p\))
the coefficients (\(b_1, b_2, ..., b_p\)) measure the impact (i.e., the effect size) of covariates.
the term \(h_0\) is called the baseline hazard. It corresponds to the value of the hazard if all the \(x_i\) are equal to zero (the quantity exp(0) equals 1). The ‘t’ in h(t) reminds us that the hazard may vary over time.

The Cox model can be written as a multiple linear regression of the logarithm of the hazard on the variables \(x_i\), with the baseline hazard being an ‘intercept’ term that varies with time.

The quantities \(exp(b_i)\) are called hazard ratios (HR). A value of \(b_i\) greater than zero, or equivalently a hazard ratio greater than one, indicates that as the value of the \(i^{th}\) covariate increases, the event hazard increases and thus the length of survival decreases.

Put another way, a hazard ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.

In summary,

HR = 1: No effect
HR < 1: Reduction in the hazard
HR > 1: Increase in Hazard

Note that in cancer studies:

A covariate with hazard ratio > 1 (i.e.: b > 0) is called bad prognostic factor
A covariate with hazard ratio < 1 (i.e.: b < 0) is called good prognostic factor

A key assumption of the Cox model is that the hazard curves for the groups of observations (or patients) should be proportional and cannot cross.

Consider two patients k and k’ that differ in their x-values. The corresponding hazard function can be simply written as follow

Hazard function for the patient k:

\[ h_k(t) = h_0(t)e^{\sum\limits_{i=1}^n{\beta x}} \]

Hazard function for the patient k’:

\[ h_{k'}(t) = h_0(t)e^{\sum\limits_{i=1}^n{\beta x'}} \]

The hazard ratio for these two patients [\(\frac{h_k(t)}{h_{k'}(t)} = \frac{h_0(t)e^{\sum\limits_{i=1}^n{\beta x}}}{h_0(t)e^{\sum\limits_{i=1}^n{\beta x'}}} = \frac{e^{\sum\limits_{i=1}^n{\beta x}}}{e^{\sum\limits_{i=1}^n{\beta x'}}}\)] is independent of time t.

Consequently, the Cox model is a proportional-hazards model: the hazard of the event in any group is a constant multiple of the hazard in any other. This assumption implies that, as mentioned above, the hazard curves for the groups should be proportional and cannot cross.

In other words, if an individual has a risk of death at some initial time point that is twice as high as that of another individual, then at all later times the risk of death remains twice as high.

This assumption of proportional hazards should be tested. We’ll discuss methods for assessing proportionality in the next article in this series: Cox Model Assumptions.

Compute the Cox model in R

Install and load required R package

We’ll use two R packages:

survival for computing survival analyses
survminer for visualizing survival analysis results
Install the packages

install.packages(c("survival", "survminer"))

Load the packages

library("survival")
library("survminer")

R function to compute the Cox model: coxph()

The function coxph()[in survival package] can be used to compute the Cox proportional hazards regression model in R.

The simplified format is as follow:

coxph(formula, data, method)

formula: is linear model with a survival object as the response variable. Survival object is created using the function Surv() as follow: Surv(time, event).
data: a data frame containing the variables
method: is used to specify how to handle ties. The default is ‘efron’. Other options are ‘breslow’ and ‘exact’. The default ‘efron’ is generally preferred to the once-popular “breslow” method. The “exact” method is much more computationally intensive.

Example data sets

We’ll use the lung cancer data in the survival R package.

data("lung")
head(lung)

  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0

inst: Institution code
time: Survival time in days
status: censoring status 1=censored, 2=dead
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score (0=good 5=dead)
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months

Compute the Cox model

We’ll fit the Cox regression using the following covariates: age, sex, ph.ecog and wt.loss.

We start by computing univariate Cox analyses for all these variables; then we’ll fit multivariate cox analyses using two variables to describe how the factors jointly impact on survival.

Univariate Cox regression

Univariate Cox analyses can be computed as follow:

res.cox <- coxph(Surv(time, status) ~ sex, data = lung)
res.cox

Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)

      coef exp(coef) se(coef)     z      p
sex -0.531     0.588    0.167 -3.18 0.0015

Likelihood ratio test=10.6  on 1 df, p=0.00111
n= 228, number of events= 165

The function summary() for Cox models produces a more complete report:

summary(res.cox)

Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)

  n= 228, number of events= 165 

       coef exp(coef) se(coef)      z Pr(>|z|)   
sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    exp(coef) exp(-coef) lower .95 upper .95
sex     0.588      1.701    0.4237     0.816

Concordance= 0.579  (se = 0.022 )
Rsquare= 0.046   (max possible= 0.999 )
Likelihood ratio test= 10.63  on 1 df,   p=0.001111
Wald test            = 10.09  on 1 df,   p=0.001491
Score (logrank) test = 10.33  on 1 df,   p=0.001312

The Cox regression results can be interpreted as follow:

Statistical significance. The column marked “z” gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The wald statistic evaluates, whether the beta (\(\beta\)) coefficient of a given variable is statistically significantly different from 0. From the output above, we can conclude that the variable sex have highly statistically significant coefficients.
The regression coefficients. The second feature to note in the Cox model results is the the sign of the regression coefficients (coef). A positive sign means that the hazard (risk of death) is higher, and thus the prognosis worse, for subjects with higher values of that variable. The variable sex is encoded as a numeric vector. 1: male, 2: female. The R summary for the Cox model gives the hazard ratio (HR) for the second group relative to the first group, that is, female versus male. The beta coefficient for sex = -0.53 indicates that females have lower risk of death (lower survival rates) than males, in these data.
Hazard ratios. The exponentiated coefficients (exp(coef) = exp(-0.53) = 0.59), also known as hazard ratios, give the effect size of covariates. For example, being female (sex=2) reduces the hazard by a factor of 0.59, or 41%. Being female is associated with good prognostic.
Confidence intervals of the hazard ratios. The summary output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)), lower 95% bound = 0.4237, upper 95% bound = 0.816.
Global statistical significance of the model. Finally, the output gives p-values for three alternative tests for overall significance of the model: The likelihood-ratio test, Wald test, and score logrank statistics. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

To apply the univariate coxph function to multiple covariates at once, type this:

covariates <- c("age", "sex",  "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,
                        function(x) as.formula(paste('Surv(time, status)~', x)))
                        
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})

# Extract data 
univ_results <- lapply(univ_models,
                       function(x){ 
                          x <- summary(x)
                          p.value<-signif(x$wald["pvalue"], digits=2)
                          wald.test<-signif(x$wald["test"], digits=2)
                          beta<-signif(x$coef[1], digits=2);#coeficient beta
                          HR <-signif(x$coef[2], digits=2);#exp(beta)
                          HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
                          HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
                          HR <- paste0(HR, " (", 
                                       HR.confint.lower, "-", HR.confint.upper, ")")
                          res<-c(beta, HR, wald.test, p.value)
                          names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", 
                                        "p.value")
                          return(res)
                          #return(exp(cbind(coef(x),confint(x))))
                         })
res <- t(as.data.frame(univ_results, check.names = FALSE))
as.data.frame(res)

           beta HR (95% CI for HR) wald.test p.value
age       0.019            1 (1-1)       4.1   0.042
sex       -0.53   0.59 (0.42-0.82)        10  0.0015
ph.karno -0.016      0.98 (0.97-1)       7.9   0.005
ph.ecog    0.48        1.6 (1.3-2)        18 2.7e-05
wt.loss  0.0013         1 (0.99-1)      0.05    0.83

The output above shows the regression beta coefficients, the effect sizes (given as hazard ratios) and statistical significance for each of the variables in relation to overall survival. Each factor is assessed through separate univariate Cox regressions.

From the output above,

The variables sex, age and ph.ecog have highly statistically significant coefficients, while the coefficient for ph.karno is not significant.
age and ph.ecog have positive beta coefficients, while sex has a negative coefficient. Thus, older age and higher ph.ecog are associated with poorer survival, whereas being female (sex=2) is associated with better survival.

Now, we want to describe how the factors jointly impact on survival. To answer to this question, we’ll perform a multivariate Cox regression analysis. As the variable ph.karno is not significant in the univariate Cox analysis, we’ll skip it in the multivariate analysis. We’ll include the 3 factors (sex, age and ph.ecog) into the multivariate model.

Multivariate Cox regression analysis

A Cox regression of time to death on the time-constant covariates is specified as follow:

res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data =  lung)
summary(res.cox)

Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)

  n= 227, number of events= 164 
   (1 observation deleted due to missingness)

             coef exp(coef)  se(coef)      z Pr(>|z|)    
age      0.011067  1.011128  0.009267  1.194 0.232416    
sex     -0.552612  0.575445  0.167739 -3.294 0.000986 ***
ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0111     0.9890    0.9929    1.0297
sex        0.5754     1.7378    0.4142    0.7994
ph.ecog    1.5900     0.6289    1.2727    1.9864

Concordance= 0.637  (se = 0.026 )
Rsquare= 0.126   (max possible= 0.999 )
Likelihood ratio test= 30.5  on 3 df,   p=1.083e-06
Wald test            = 29.93  on 3 df,   p=1.428e-06
Score (logrank) test = 30.5  on 3 df,   p=1.083e-06

The p-value for all three overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. These tests evaluate the omnibus null hypothesis that all of the betas (\(\beta\)) are 0. In the above example, the test statistics are in close agreement, and the omnibus null hypothesis is soundly rejected.

In the multivariate Cox analysis, the covariates sex and ph.ecog remain significant (p < 0.05). However, the covariate age fails to be significant (p = 0.23, which is grater than 0.05).

The p-value for sex is 0.000986, with a hazard ratio HR = exp(coef) = 0.58, indicating a strong relationship between the patients’ sex and decreased risk of death. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 0.58, or 42%. We conclude that, being female is associated with good prognostic.

Similarly, the p-value for ph.ecog is 4.45e-05, with a hazard ratio HR = 1.59, indicating a strong relationship between the ph.ecog value and increased risk of death. Holding the other covariates constant, a higher value of ph.ecog is associated with a poor survival.

By contrast, the p-value for age is now p=0.23. The hazard ratio HR = exp(coef) = 1.01, with a 95% confidence interval of 0.99 to 1.03. Because the confidence interval for HR includes 1, these results indicate that age makes a smaller contribution to the difference in the HR after adjusting for the ph.ecog values and patient’s sex, and only trend toward significance. For example, holding the other covariates constant, an additional year of age induce daily hazard of death by a factor of exp(beta) = 1.01, or 1%, which is not a significant contribution.

Visualizing the estimated distribution of survival times

Having fit a Cox model to the data, it’s possible to visualize the predicted survival proportion at any given point in time for a particular risk group. The function survfit() estimates the survival proportion, by default at the mean values of covariates.

# Plot the baseline survival function
ggsurvplot(survfit(res.cox), color = "#2E9FDF",
           ggtheme = theme_minimal())

Cox Proportional-Hazards Model

We may wish to display how estimated survival depends upon the value of a covariate of interest.

Consider that, we want to assess the impact of the sex on the estimated survival probability. In this case, we construct a new data frame with two rows, one for each value of sex; the other covariates are fixed to their average values (if they are continuous variables) or to their lowest level (if they are discrete variables). For a dummy covariate, the average value is the proportion coded 1 in the data set. This data frame is passed to survfit() via the newdata argument:

# Create the new data  
sex_df <- with(lung,
               data.frame(sex = c(1, 2), 
                          age = rep(mean(age, na.rm = TRUE), 2),
                          ph.ecog = c(1, 1)
                          )
               )
sex_df

  sex      age ph.ecog
1   1 62.44737       1
2   2 62.44737       1

# Survival curves
fit <- survfit(res.cox, newdata = sex_df)
ggsurvplot(fit, conf.int = TRUE, legend.labs=c("Sex=1", "Sex=2"),
           ggtheme = theme_minimal())

Cox Proportional-Hazards Model

Summary

In this article, we described the Cox regression model for assessing simultaneously the relationship between multiple risk factors and patient’s survival time. We demonstrated how to compute the Cox model using the survival package. Additionally, we described how to visualize the results of the analysis using the survminer package.

References

Cox DR (1972). Regression models and life tables (with discussion). J R Statist Soc B 34: 187–220
MJ Bradburn, TG Clark, SB Love and DG Altman. Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods. British Journal of Cancer (2003) 89, 431 – 436

Infos

This analysis has been performed using R software (ver. 3.3.2).

Survival Analysis Basics

Mon, 12 Dec 2016 20:27:40 +0100

Survival analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur.

Survival analysis is used in a variety of field such as:

Cancer studies for patients survival time analyses,
Sociology for “event-history analysis”,
and in engineering for “failure-time analysis”.

In cancer studies, typical research questions are like:

What is the impact of certain clinical characteristics on patient’s survival
What is the probability that an individual survives 3 years?
Are there differences in survival between groups of patients?

Contents

Objectives
Basic concepts
Survival analysis in R
Summary
References
Infos

Objectives

The aim of this chapter is to describe the basic concepts of survival analysis. In cancer studies, most of survival analyses use the following methods:

Kaplan-Meier plots to visualize survival curves
Log-rank test to compare the survival curves of two or more groups
Cox proportional hazards regression to describe the effect of variables on survival. The Cox model is discussed in the next chapter: Cox proportional hazards model.

Here, we’ll start by explaining the essential concepts of survival analysis, including:

how to generate and interpret survival curves,
and how to quantify and test survival differences between two or more groups of patients.

Then, we’ll continue by describing multivariate analysis using Cox proportional hazards model.

Basic concepts

Here, we start by defining fundamental terms of survival analysis including:

Survival time and event
Censoring
Survival function and hazard function

Survival time and type of events in cancer studies

There are different types of events, including:

Relapse
Progression
Death

The time from ‘response to treatment’ (complete remission) to the occurrence of the event of interest is commonly called survival time (or time to event).

The two most important measures in cancer studies include: i) the time to death; and ii) the relapse-free survival time, which corresponds to the time between response to treatment and recurrence of the disease. It’s also known as disease-free survival time and event-free survival time.

Censoring

As mentioned above, survival analysis focuses on the expected duration of time until occurrence of an event of interest (relapse or death). However, the event may not be observed for some individuals within the study time period, producing the so-called censored observations.

Censoring may arise in the following ways:

a patient has not (yet) experienced the event of interest, such as relapse or death, within the study time period;
a patient is lost to follow-up during the study period;
a patient experiences a different event that makes further follow-up impossible.

This type of censoring, named right censoring, is handled in survival analysis.

Survival and hazard functions

Two related probabilities are used to describe survival data: the survival probability and the hazard probability.

The survival probability, also known as the survivor function \(S(t)\), is the probability that an individual survives from the time origin (e.g. diagnosis of cancer) to a specified future time t.

The hazard, denoted by \(h(t)\), is the probability that an individual who is under observation at a time t has an event at that time.

Note that, in contrast to the survivor function, which focuses on not having an event, the hazard function focuses on the event occurring.

Kaplan-Meier survival estimate

The Kaplan-Meier (KM) method is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).

The survival probability at time \(t_i\), \(S(t_i)\), is calculated as follow:

\[S(t_i) = S(t_{i-1})(1-\frac{d_i}{n_i})\]

Where,

\(S(t_{i-1})\) = the probability of being alive at \(t_{i-1}\)
\(n_i\) = the number of patients alive just before \(t_i\)
\(d_i\) = the number of events at \(t_i\)
\(t_0\) = 0, \(S(0)\) = 1

The estimated probability (\(S(t)\)) is a step function that changes value only at the time of each event. It’s also possible to compute confidence intervals for the survival probability.

The KM survival curve, a plot of the KM survival probability against time, provides a useful summary of the data that can be used to estimate measures such as median survival time.

Survival analysis in R

Install and load required R package

We’ll use two R packages:

survival for computing survival analyses
survminer for summarizing and visualizing the results of survival analysis
Install the packages

install.packages(c("survival", "survminer"))

Load the packages

library("survival")
library("survminer")

Example data sets

We’ll use the lung cancer data available in the survival package.

data("lung")
head(lung)

  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0

inst: Institution code
time: Survival time in days
status: censoring status 1=censored, 2=dead
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score (0=good 5=dead)
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months

Compute survival curves: survfit()

We want to compute the survival probability by sex.

The function survfit() [in survival package] can be used to compute kaplan-Meier survival estimate. Its main arguments include:

a survival object created using the function Surv()
and the data set containing the variables.

To compute survival curves, type this:

fit <- survfit(Surv(time, status) ~ sex, data = lung)
print(fit)

Call: survfit(formula = Surv(time, status) ~ sex, data = lung)

        n events median 0.95LCL 0.95UCL
sex=1 138    112    270     212     310
sex=2  90     53    426     348     550

By default, the function print() shows a short summary of the survival curves. It prints the number of observations, number of events, the median survival and the confidence limits for the median.

If you want to display a more complete summary of the survival curves, type this:

# Summary of survival curves
summary(fit)

# Access to the sort summary table
summary(fit)$table

Access to the value returned by survfit()

The function survfit() returns a list of variables, including the following components:

n: total number of subjects in each curve.
time: the time points on the curve.
n.risk: the number of subjects at risk at time t
n.event: the number of events that occurred at time t.
n.censor: the number of censored subjects, who exit the risk set, without an event, at time t.
lower,upper: lower and upper confidence limits for the curve, respectively.
strata: indicates stratification of curve estimation. If strata is not NULL, there are multiple curves in the result. The levels of strata (a factor) are the labels for the curves.

The components can be accessed as follow:

d <- data.frame(time = fit$time,
                  n.risk = fit$n.risk,
                  n.event = fit$n.event,
                  n.censor = fit$n.censor,
                  surv = fit$surv,
                  upper = fit$upper,
                  lower = fit$lower
                  )
head(d)

  time n.risk n.event n.censor      surv     upper     lower
1   11    138       3        0 0.9782609 1.0000000 0.9542301
2   12    135       1        0 0.9710145 0.9994124 0.9434235
3   13    134       2        0 0.9565217 0.9911586 0.9230952
4   15    132       1        0 0.9492754 0.9866017 0.9133612
5   26    131       1        0 0.9420290 0.9818365 0.9038355
6   30    130       1        0 0.9347826 0.9768989 0.8944820

Visualize survival curves

We’ll use the function ggsurvplot() [in Survminer R package] to produce the survival curves for the two groups of subjects.

It’s also possible to show:

the 95% confidence limits of the survivor function using the argument conf.int = TRUE.
the number and/or the percentage of individuals at risk by time using the option risk.table. Allowed values for risk.table include:
- TRUE or FALSE specifying whether to show or not the risk table. Default is FALSE.
- “absolute” or “percentage”: to show the absolute number and the percentage of subjects at risk by time, respectively. Use “abs_pct” to show both absolute number and percentage.
the p-value of the Log-Rank test comparing the groups using pval = TRUE.
horizontal/vertical line at median survival using the argument surv.median.line. Allowed values include one of c(“none”, “hv”, “h”, “v”). v: vertical, h:horizontal.

# Change color, linetype by strata, risk.table color by strata
ggsurvplot(fit,
          pval = TRUE, conf.int = TRUE,
          risk.table = TRUE, # Add risk table
          risk.table.col = "strata", # Change risk table color by groups
          linetype = "strata", # Change line type by groups
          surv.median.line = "hv", # Specify median survival
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"))

Survival Analysis

The plot can be further customized using the following arguments:

conf.int.style = “step” to change the style of confidence interval bands.
xlab to change the x axis label.
break.time.by = 200 break x axis in time intervals by 200.
risk.table = “abs_pct”to show both absolute number and percentage of individuals at risk.
risk.table.y.text.col = TRUE and risk.table.y.text = FALSE to provide bars instead of names in text annotations of the legend of risk table.
ncensor.plot = TRUE to plot the number of censored subjects at time t. As suggested by Marcin Kosinski, This is a good additional feedback to survival curves, so that one could realize: how do survival curves look like, what is the number of risk set AND what is the cause that the risk set become smaller: is it caused by events or by censored events?
legend.labs to change the legend labels.

ggsurvplot(
   fit,                     # survfit object with calculated statistics.
   pval = TRUE,             # show p-value of log-rank test.
   conf.int = TRUE,         # show confidence intervals for 
                            # point estimaes of survival curves.
   conf.int.style = "step",  # customize style of confidence intervals
   xlab = "Time in days",   # customize X axis label.
   break.time.by = 200,     # break X axis in time intervals by 200.
   ggtheme = theme_light(), # customize plot and risk table with a theme.
   risk.table = "abs_pct",  # absolute number and percentage at risk.
  risk.table.y.text.col = T,# colour risk table text annotations.
  risk.table.y.text = FALSE,# show bars instead of names in text annotations
                            # in legend of risk table.
  ncensor.plot = TRUE,      # plot the number of censored subjects at time t
  surv.median.line = "hv",  # add the median survival pointer.
  legend.labs = 
    c("Male", "Female"),    # change legend labels.
  palette = 
    c("#E7B800", "#2E9FDF") # custom color palettes.
)

Survival Analysis

The Kaplan-Meier plot can be interpreted as follow:

The horizontal axis (x-axis) represents time in days, and the vertical axis (y-axis) shows the probability of surviving or the proportion of people surviving. The lines represent survival curves of the two groups. A vertical drop in the curves indicates an event. The vertical tick mark on the curves means that a patient was censored at this time.

At time zero, the survival probability is 1.0 (or 100% of the participants are alive).
At time 250, the probability of survival is approximately 0.55 (or 55%) for sex=1 and 0.75 (or 75%) for sex=2.
The median survival is approximately 270 days for sex=1 and 426 days for sex=2, suggesting a good survival for sex=2 compared to sex=1

The median survival times for each group can be obtained using the code below:

summary(fit)$table

      records n.max n.start events   *rmean *se(rmean) median 0.95LCL 0.95UCL
sex=1     138   138     138    112 325.0663   22.59845    270     212     310
sex=2      90    90      90     53 458.2757   33.78530    426     348     550

The median survival times for each group represent the time at which the survival probability, S(t), is 0.5.

The median survival time for sex=1 (Male group) is 270 days, as opposed to 426 days for sex=2 (Female). There appears to be a survival advantage for female with lung cancer compare to male. However, to evaluate whether this difference is statistically significant requires a formal statistical test, a subject that is discussed in the next sections.

Note that, the confidence limits are wide at the tail of the curves, making meaningful interpretations difficult. This can be explained by the fact that, in practice, there are usually patients who are lost to follow-up or alive at the end of follow-up. Thus, it may be sensible to shorten plots before the end of follow-up on the x-axis (Pocock et al, 2002).

The survival curves can be shorten using the argument xlim as follow:

ggsurvplot(fit,
          conf.int = TRUE,
          risk.table.col = "strata", # Change risk table color by groups
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"),
          xlim = c(0, 600))

Survival Analysis

Note that, three often used transformations can be specified using the argument fun:

“log”: log transformation of the survivor function,
“event”: plots cumulative events (f(y) = 1-y). It’s also known as the cumulative incidence,
“cumhaz” plots the cumulative hazard function (f(y) = -log(y))

For example, to plot cumulative events, type this:

ggsurvplot(fit,
          conf.int = TRUE,
          risk.table.col = "strata", # Change risk table color by groups
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"),
          fun = "event")

Survival Analysis

The cummulative hazard is commonly used to estimate the hazard probability. It’s defined as \(H(t) = -log(survival function) = -log(S(t))\). The cumulative hazard (\(H(t)\)) can be interpreted as the cumulative force of mortality. In other words, it corresponds to the number of events that would be expected for each individual by time t if the event were a repeatable process.

To plot cumulative hazard, type this:

ggsurvplot(fit,
          conf.int = TRUE,
          risk.table.col = "strata", # Change risk table color by groups
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"),
          fun = "cumhaz")

Survival Analysis

Kaplan-Meier life table: summary of survival curves

As mentioned above, you can use the function summary() to have a complete summary of survival curves:

summary(fit)

It’s also possible to use the function surv_summary() [in survminer package] to get a summary of survival curves. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.

res.sum <- surv_summary(fit)
head(res.sum)

  time n.risk n.event n.censor      surv    std.err     upper     lower strata sex
1   11    138       3        0 0.9782609 0.01268978 1.0000000 0.9542301  sex=1   1
2   12    135       1        0 0.9710145 0.01470747 0.9994124 0.9434235  sex=1   1
3   13    134       2        0 0.9565217 0.01814885 0.9911586 0.9230952  sex=1   1
4   15    132       1        0 0.9492754 0.01967768 0.9866017 0.9133612  sex=1   1
5   26    131       1        0 0.9420290 0.02111708 0.9818365 0.9038355  sex=1   1
6   30    130       1        0 0.9347826 0.02248469 0.9768989 0.8944820  sex=1   1

The function surv_summary() returns a data frame with the following columns:

time: the time points at which the curve has a step.
n.risk: the number of subjects at risk at t.
n.event: the number of events that occur at time t.
n.censor: number of censored events.
surv: estimate of survival probability.
std.err: standard error of survival.
upper: upper end of confidence interval
lower: lower end of confidence interval
strata: indicates stratification of curve estimation. The levels of strata (a factor) are the labels for the curves.

In a situation, where survival curves have been fitted with one or more variables, surv_summary object contains extra columns representing the variables. This makes it possible to facet the output of ggsurvplot by strata or by some combinations of factors.

surv_summary object has also an attribute named ‘table’ containing information about the survival curves, including medians of survival with confidence intervals, as well as, the total number of subjects and the number of event in each curve. To get access to the attribute ‘table’, type this:

attr(res.sum, "table")

Log-Rank test comparing survival curves: survdiff()

The log-rank test is the most widely used method of comparing two or more survival curves. The null hypothesis is that there is no difference in survival between the two groups. The log rank test is a non-parametric test, which makes no assumptions about the survival distributions. Essentially, the log rank test compares the observed number of events in each group to what would be expected if the null hypothesis were true (i.e., if the survival curves were identical). The log rank statistic is approximately distributed as a chi-square test statistic.

The function survdiff() [in survival package] can be used to compute log-rank test comparing two or more survival curves.

survdiff() can be used as follow:

surv_diff <- survdiff(Surv(time, status) ~ sex, data = lung)
surv_diff

Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)

        N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 138      112     91.6      4.55      10.3
sex=2  90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.00131

The function returns a list of components, including:

n: the number of subjects in each group.
obs: the weighted observed number of events in each group.
exp: the weighted expected number of events in each group.
chisq: the chisquare statistic for a test of equality.
strata: optionally, the number of subjects contained in each stratum.

The log rank test for difference in survival gives a p-value of p = 0.0013, indicating that the sex groups differ significantly in survival.

Fit complex survival curves

In this section, we’ll compute survival curves using the combination of multiple factors. Next, we’ll facet the output of ggsurvplot() by a combination of factors

Fit (complex) survival curves using colon data sets

require("survival")
fit2 <- survfit( Surv(time, status) ~ sex + rx + adhere,
                data = colon )

Visualize the output using survminer. The plot below shows survival curves by the sex variable faceted according to the values of rx & adhere.

# Plot survival curves by sex and facet by rx and adhere
ggsurv <- ggsurvplot(fit2, fun = "event", conf.int = TRUE,
                     ggtheme = theme_bw())
   
ggsurv$plot +theme_bw() + 
  theme (legend.position = "right")+
  facet_grid(rx ~ adhere)

Survival Analysis

Summary

Survival analysis is a set of statistical approaches for data analysis where the outcome variable of interest is time until an event occurs.

Survival data are generally described and modeled in terms of two related functions:

the survivor function representing the probability that an individual survives from the time of origin to some time beyond time t. It’s usually estimated by the Kaplan-Meier method. The logrank test may be used to test for differences between survival curves for groups, such as treatment arms.
The hazard function gives the instantaneous potential of having an event at a time, given survival up to that time. It is used primarily as a diagnostic tool or for specifying a mathematical model for survival analysis.

In this article, we demonstrate how to perform and visualize survival analyses using the combination of two R packages: survival (for the analysis) and survminer (for the visualization).

References

Clark TG, Bradburn MJ, Love SB and Altman DG. Survival Analysis Part I: Basic concepts and first analyses. British Journal of Cancer (2003) 89, 232 – 238
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53: 457–481.
Pocock S, Clayton TC, Altman DG (2002) Survival plots of time-to-event outcomes in clinical trials: good practice and pitfalls. Lancet 359: 1686– 1689.

Infos

This analysis has been performed using R software (ver. 3.3.2).