This vignette is an introduction to performing survival analysis in mlr3proba.

## A very quick introduction to survival analysis

Survival analysis is a sub-field of supervised machine learning in which the aim is to predict the survival distribution of a given individual. Arguably the main feature of survival analysis is that unlike classification and regression, learners are trained on two features: 1. the time until the event takes place, 2. the event type: either censoring or death. At a particular time-point, an individual is either: alive, dead, or censored. Censoring occurs if it is unknown if an individual is alive or dead. For example, say we are interested in patients in hospital and every day it is recorded if they are alive or dead, then after a patient leaves it is unknown if they are alive or dead, hence they are censored.

In the case that there is no censoring, but a predicted probability distribution is still the goal, then probabilistic regression learners are advised instead.

## Evaluate - crank, lp, and distr

Every PredictionSurv object can predict one or more of:

• lp - Linear predictor calculated as the fitted coefficients multiplied by the test data.
• distr - Predicted survival distribution, either discrete or continuous. Implemented in distr6.
• crank - Continuous risk ranking.

lp and crank can be used with measures of discrimination such as the concordance index. Whilst lp is a specific mathematical prediction, crank is any continuous ranking that identifies who is more or less likely to experience the event. So far the only implemented learner that only returns a continuous ranking is surv.svm. If a PredictionSurv returns an lp then the crank is identical to this. Otherwise crank is calculated as the expectation of the predicted survival distribution. Note that for linear proportional hazards models, the ranking (but not necessarily the crank score itself) given by lp and the expectation of distr, is identical.

## Composition

Finally we take a look at the PipeOps implemented in mlr3proba, which are used for composition of predict types. For example, if a learner only returns a linear predictor, then PipeOpDistrCompositor can be used to estimate a survival distribution. Or, if a learner returns a distr then PipeOpCrankCompositor can be used to estimate crank from distr. See mlr3pipelines for full tutorials and details on PipeOps.

### PipeOpDistrCompositor

library(mlr3pipelines)

# PipeOpDistrCompositor - Train one model with a baseline distribution,
# (Kaplan-Meier or Nelson-Aalen), and another with a predicted linear predictor.
task = tgen("simsurv")$generate(30) leaner_lp = lrn("surv.gbm", bag.fraction = 1, n.trees = 50L) leaner_distr = lrn("surv.kaplan") prediction_lp = leaner_lp$train(task)$predict(task) prediction_distr = leaner_distr$train(task)$predict(task) prediction_lp$distr
#> NULL

# Doesn't need training. Base = baseline distribution. ph = Proportional hazards.

pod = po("distrcompose", param_vals = list(form = "ph", overwrite = FALSE))
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output

# Now we have a predicted distr!

prediction$distr #> WeightDisc1 WeightDisc2 ... WeightDisc29 WeightDisc30 # This can all be simplified by using the distrcompose wrapper gbm.distr = distrcompositor(learner = lrn("surv.gbm", bag.fraction = 1, n.trees = 50L), estimator = "kaplan", form = "aft", overwrite = FALSE) gbm.distr$train(task)$predict(task) #> <PredictionSurv> for 30 observations: #> row_id time status crank distr lp #> 1 5.000000 FALSE -0.19367047 <VectorDistribution> -0.19367047 #> 2 3.568860 TRUE -0.66748168 <VectorDistribution> -0.66748168 #> 3 1.128017 TRUE -0.05205574 <VectorDistribution> -0.05205574 #> --- #> 28 1.635774 TRUE -0.17106619 <VectorDistribution> -0.17106619 #> 29 5.000000 FALSE -0.47381121 <VectorDistribution> -0.47381121 #> 30 3.155294 TRUE -0.66748168 <VectorDistribution> -0.66748168 ### PipeOpCrankCompositor Note that a PredictionSurv will always return crank, but this may either be the same as the lp or the expectation of distr. This compositor allows you to change the estimation method. # PipeOpCrankCompositor - Only one model required. leaner = lrn("surv.coxph") prediction = leaner$train(task)$predict(task) # Doesn't need training - Note: no overwrite option as crank is always # present so the compositor if used will always overwrite. poc = po("crankcompose", param_vals = list(method = "mean")) composed_prediction = poc$predict(list(prediction))$output # Note that whilst the actual values of lp and crank are different, # the rankings are the same, so discrimination measures are unchanged. prediction$crank[1:10]
#>  [1]  0.19248733 -0.43871848  0.39422978  0.24340349 -0.20020759  0.15094767
#>  [7]  0.60484819 -0.56638731  0.01629422  0.42356422
composed_prediction$crank[1:10] #> [1] 1.6122990 1.0944552 1.7651577 1.6525095 1.2873051 1.5788921 1.8987710 #> [8] 0.9969524 1.4680414 1.7856538 all(order(prediction$crank) == order(composed_prediction$crank)) #> [1] TRUE cbind(Original = prediction$score(), Composed = composed_prediction$score()) #> Original Composed #> surv.harrellC 0.5854342 0.5854342 # Again a wrapper can be used to simplify this crankcompositor(lrn("surv.coxph"), method = "mean")$train(task)$predict(task) #> <PredictionSurv> for 30 observations: #> row_id time status crank distr lp #> 1 5.000000 FALSE 1.612299 <VectorDistribution> 0.19248733 #> 2 3.568860 TRUE 1.094455 <VectorDistribution> -0.43871848 #> 3 1.128017 TRUE 1.765158 <VectorDistribution> 0.39422978 #> --- #> 28 1.635774 TRUE 1.232586 <VectorDistribution> -0.26655353 #> 29 5.000000 FALSE 1.436710 <VectorDistribution> -0.02125514 #> 30 3.155294 TRUE 1.073284 <VectorDistribution> -0.46592203 ## All Together Now Putting all of this together we can perform a (overly simplified) benchmark experiment to find the best learner for making predictions on a simulated dataset. library(mlr3pipelines); library(mlr3); library(mlr3tuning); library(paradox) set.seed(42) task = tgen("simsurv")$generate(50)

composed_lrn_gbm = distrcompositor(lrn("surv.gbm", bag.fraction = 1, n.trees = 50L),
"kaplan", "ph")

lrns = lapply(paste0("surv.", c("kaplan", "coxph", "parametric")), lrn)
lrns[[3]]$param_set$values = list(dist = "weibull", type = "ph")

design = benchmark_grid(tasks = task, learners = c(lrns, list(composed_lrn_gbm)),
resamplings = rsmp("cv", folds = 2))

bm = benchmark(design)
bm\$aggregate(lapply(c("surv.harrellC","surv.graf","surv.grafSE"), msr))[,c(4, 7:9)]
#>                           learner_id surv.harrellC surv.graf surv.grafSE
#> 1:                       surv.kaplan     0.5000000 0.2107141  0.02003976
#> 2:                        surv.coxph     0.5003190 0.2259178  0.03210873
#> 3:                   surv.parametric     0.5066986 0.4481043  0.06730402
#> 4: surv.kaplan.surv.gbm.distrcompose     0.4811204 0.2774313  0.04296744