This calibration method is defined by calculating the following statistic: $$s = B/n \sum_i (P_i - n/B)^2$$ where \(B\) is number of 'buckets' (that equally divide \([0,1]\) into intervals), \(n\) is the number of predictions, and \(P_i\) is the observed proportion of observations in the \(i\)th interval. An observation is assigned to the \(i\)th bucket, if its predicted survival probability at the time of event falls within the corresponding interval. This statistic assumes that censoring time is independent of death time.

A model is well-calibrated if \(s \sim Unif(B)\), tested with `chisq.test`

(\(p > 0.05\) if well-calibrated).
Model \(i\) is better calibrated than model \(j\) if \(s(i) < s(j)\),
meaning that *lower values* of this measure are preferred.

## Details

This measure can either return the test statistic or the p-value from the `chisq.test`

.
The former is useful for model comparison whereas the latter is useful for determining if a model
is well-calibrated. If `chisq = FALSE`

and `s`

is the predicted value then you can manually
compute the p.value with `pchisq(s, B - 1, lower.tail = FALSE)`

.

NOTE: This measure is still experimental both theoretically and in implementation. Results should therefore only be taken as an indicator of performance and not for conclusive judgements about model calibration.

## Dictionary

This Measure can be instantiated via the dictionary mlr_measures or with the associated sugar function msr():

## Parameters

Id | Type | Default | Levels | Range |

B | integer | 10 | \([1, \infty)\) | |

chisq | logical | FALSE | TRUE, FALSE | - |

truncate | numeric | Inf | \([0, \infty)\) |

## Parameter details

`B`

(`integer(1)`

)

Number of buckets to test for uniform predictions over. Default of`10`

is recommended by Haider et al. (2020). Changing this parameter affects`truncate`

.`chisq`

(`logical(1)`

)

If`TRUE`

returns the p-value of the corresponding chisq.test instead of the measure. Default is`FALSE`

and returns the statistic`s`

. You can manually get the p-value by executing`pchisq(s, B - 1, lower.tail = FALSE)`

. The null hypothesis is that the model is D-calibrated.`truncate`

(`double(1)`

)

This parameter controls the upper bound of the output statistic, when`chisq`

is`FALSE`

. We use`truncate = Inf`

by default but \(10\) may be sufficient for most purposes, which corresponds to a p-value of 0.35 for the chisq.test using \(B = 10\) buckets. Values \(>10\) translate to even lower p-values and thus less calibrated models. If the number of buckets \(B\) changes, you probably will want to change the`truncate`

value as well to correspond to the same p-value significance. Note that truncation may severely limit automated tuning with this measure.

## References

Haider, Humza, Hoehn, Bret, Davis, Sarah, Greiner, Russell (2020).
“Effective Ways to Build and Evaluate Individual Survival Distributions.”
*Journal of Machine Learning Research*, **21**(85), 1--63.
https://jmlr.org/papers/v21/18-772.html.

## See also

Other survival measures:
`mlr_measures_surv.calib_alpha`

,
`mlr_measures_surv.calib_beta`

,
`mlr_measures_surv.chambless_auc`

,
`mlr_measures_surv.cindex`

,
`mlr_measures_surv.graf`

,
`mlr_measures_surv.hung_auc`

,
`mlr_measures_surv.intlogloss`

,
`mlr_measures_surv.logloss`

,
`mlr_measures_surv.mae`

,
`mlr_measures_surv.mse`

,
`mlr_measures_surv.nagelk_r2`

,
`mlr_measures_surv.oquigley_r2`

,
`mlr_measures_surv.rcll`

,
`mlr_measures_surv.rmse`

,
`mlr_measures_surv.schmid`

,
`mlr_measures_surv.song_auc`

,
`mlr_measures_surv.song_tnr`

,
`mlr_measures_surv.song_tpr`

,
`mlr_measures_surv.uno_auc`

,
`mlr_measures_surv.uno_tnr`

,
`mlr_measures_surv.uno_tpr`

,
`mlr_measures_surv.xu_r2`

Other calibration survival measures:
`mlr_measures_surv.calib_alpha`

,
`mlr_measures_surv.calib_beta`

Other distr survival measures:
`mlr_measures_surv.calib_alpha`

,
`mlr_measures_surv.graf`

,
`mlr_measures_surv.intlogloss`

,
`mlr_measures_surv.logloss`

,
`mlr_measures_surv.rcll`

,
`mlr_measures_surv.schmid`

## Super classes

`mlr3::Measure`

-> `mlr3proba::MeasureSurv`

-> `MeasureSurvDCalibration`