Feedback on the censored package for survival analysis with tidymodels

hannah · November 2, 2021, 9:18pm

Survival analysis is an important field in modelling and while there are many R packages available implementing various models, tidymodels so far has not been as feature-rich for survival analysis as we'd like it to be.

We have been working on the new censored package which, along with parsnip, offers several new models, engines, and prediction types. The censored package is not on CRAN yet but we are looking for early adopters to try it out and give us feedback! You can install it via

devtools::install_github("tidymodels/censored")

More details on what we already have and what we plan to do next are in the blog post Survival Analysis in tidymodels!

We'd love to hear from you! Some questions we have include: Would you find this useful? Are there particular models/methods you'd like to see? Do you use time-dependent effects? Do you use interactions? Which types of censoring do you use?

Please let us know in the replies to this post!

rafazam · November 3, 2021, 3:56pm

Congratulations on this great package !! It would be a good idea to incorporate assumption testing printing, python's lifelines package has a good implementation of this Testing the proportional hazard assumptions — lifelines 0.26.3 documentation?

Max · November 3, 2021, 8:06pm

We could add these bits later on. Some of the main survival infrastructure packages are not easily extended so it may take some work.

schelhorn · November 4, 2021, 9:16am

Hi there, this is a hugely welcome addition to the R stack for the pharma community! Thanks so much for working on this.

Apart from the issues which are still on your ToDo list, I would like to highlight a particular model which has been shown to work very well in benchmarks as well as in practice, Multi-task Logistic Regression (MTLR).

I submitted a feature request on your GitHub page.

hannah · November 4, 2021, 9:43am

Thank you for the great issue you posted on GitHub! We'll look into MTLR.

Frequentist · November 4, 2021, 5:23pm

If you still plan the roadmap, please consider adding the advanced and cutting edge tools, necessary in advanced analysis (I use them on regular basis, cannot imagine completing my daily work without them), especially in Pharma (my field is the clinical trials):

Fine-Gray sub-distribution hazards for competing risks + Aalen-Johansen estimator of the CIF
counting processes: Andersen-Gill, Prentice-Williams-Peterson, Wei-Lin-Weissfeld models for recurrent events (each has own properties making it relevant in certain scenarios)
Multi-State models for handling both competing risks and recurrent events at the same time. Actually, lots of analyses of real data, especially in pharma, involve both competing risks and recurrent events (of many types) at the same time. MSM constitutes a true workhorse here, along with joint frailty model
frailty models (e.g. via frailtypack), including: nested, shared and joint models. Joint frailty model handles both competing risks and recurrent events.
interval censoring - happened a few times I needed it
AFT (accelerated failure time) parametric models (usually I use Weibull; would be great to assess the fit vs it and other distributions too)
advanced Cox with: stratified and piecewise, with clustering and time-varying covariates + analysis of deviance (e.g. via emmeans package which simplifies testing specific contrasts) and weighting
censored quantile mixed-effect regression to model quantiles of the survival time with covariates + the appropriate analysis of deviance (ANCOVA) to obtain the main and interaction effects + contrasts
modern tests for comparing unadjusted survival curves, like the one proposed by the Group for the Non-Proportional Hazards (an initiative by the FDA) "Max-Combo" test (a combo of Fleming-Harrington tests for 4 combination of parameters), which is likely IMO to replace Log-rank soon. Other tests would be welcomed too, as often we need to plan the kind of effect to detect (early, delayed, crossed, diminishing), so having Tharone-Ware, Peto-Peto (+Andersen), Wilcoxon, Gehan-Breslow, Renyi's at my disposal would be very convenient.

PS: would be nice to have an easy way to adjust p-values, e.g. Bonferroni, Holm, Hochberg, fallback-sequential (the last 3 ones are especially important in trials with multiple and ordered hypotheses), but can be done easily manually to, e.g. via emmeans and other packages (e.g. Mediana, onlineFDR, multcomp, etc). But that's minor thing.

Restricted Mean Survival Time along with confidence intervals and tests for comparing them
confidence intervals for quantile of the survival time + tests
support for the Mean Cumulative Function + CIs
landmark analysis would be welcomed too
visualising interactions on various scales, e.g. HR, hazard rate would be nice!
Pepe-Mori, Gray tests
Beta Product Confidence Procedure for small samples in case of heavy censoring
Gamma and Risk Score imputation (for informative censoring)

I hope I didn't miss anything

If you need more details, please don't hesitate to either ask here or email me. I'm available also on the LinkedIn.

hannah · November 5, 2021, 7:23pm

Thanks for the comprehensive list! Competing risks and recurring events are definitely things we've been hearing in a few places now. Stratified Cox models are already available in censored, as well accelerated failure time models:

library(survival)
library(censored)
#> Loading required package: parsnip

# via survival
# (Chapter 4 of https://cran.r-project.org/web/packages/survival/vignettes/survival.pdf)
fit_s <- survreg(Surv(time, status) ~ age + sex + ph.karno, data = lung, dist = "weibull")
predict(fit_s, lung[1:3,])
#>        1        2        3 
#> 354.5663 374.0385 416.2499
fit_strata_s <- survreg(Surv(time, status) ~ sex + age + ph.karno + strata(sex),  data = lung)
predict(fit_strata_s, lung[1:3,])
#>        1        2        3 
#> 364.0834 379.2643 411.5515

# via censored
fit_c <- survival_reg(engine = "survival", dist = "weibull") %>%
  fit(Surv(time, status) ~ age + sex + ph.karno, data = lung)
predict(fit_c, lung[1:3,], type = "time")
#> # A tibble: 3 × 1
#>   .pred_time
#>        <dbl>
#> 1       355.
#> 2       374.
#> 3       416.
fit_strata_c <- survival_reg(engine = "survival", dist = "weibull") %>%
  fit(Surv(time, status) ~ age + sex + ph.karno + strata(sex), data = lung)
predict(fit_strata_c, lung[1:3,], type = "time")
#> # A tibble: 3 × 1
#>   .pred_time
#>        <dbl>
#> 1       364.
#> 2       379.
#> 3       412.

^{Created on 2021-11-05 by the reprex package (v2.0.1)}

As for the survival curves: This is a by-product of how it's plotted, I've updated the blog post to reflect that it's an approximation. The probability estimates are based on the step function: predictions for time points between event times are the estimate of the previous event time, i.e., the value of the corresponding horizontal stretch of the survival curve.

Frequentist · November 6, 2021, 11:08pm

You're very welcome! I'm glad if it helped a bit This is a list of really (not just theoretically) used tools, coming from a real practice in a fully R-based CRO, where I work.

The frailty models (joint) and Multi-State models are absolute must-have, as they handle both repetitions and competitions at once. Fine-Gray and Andersen-Gill are definitely nice to have, but in more advanced trials we often have both multiple competing events (SAEs, deaths for different reason, PI's decision about patient's withdrawal, etc.) and often find recurrences (recurrent fractures, changes in therapy, change in the stage of a disease (e.g. cancer), surgeries, etc) except "progression-free survival" or "overall survival" , where we search for the first occurrence; for death - the first and only.

AFT are nice for non-proportional hazards and if we're interested truly in the time to event to be modelled. Time-varying covariates, weighting and piece-wise Cox is invaluable in non-PH and non-constant risks over time. Non-constatnt effects can be handled via strata(), but it's already done, as you pointed out. Clustering helps for natural clusters like in mixed-models, when we have additional sources of within-subject (co)variance.

Regarding the survival curves - I'm not sure why this issue happened. There are several ways to generate proper survival curves. You can start with SurvMiner::ggsurvplot, which does everything for you (for doesn't allows for >2 faceting, which is quite common) or ggquickeda with a set of "geom_" functions: geom_step, geom_kmticks for censored observations + geom_kmband for the CIs.

It's also simple to generate the proper survival curves in ggplot2 without any 3rd party package, just with geom_step. This is my preferred way. I don't use any external packages for drawing survival curves, as the packages are usually too limited for my needs (and often too bloated, with tons of unresolved issues, introducing too much risk to my analysis), so I prefer to draw it on my own to control everything I need. Luckily, it's not that complicated

Let me show you two examples from my real analyses (data changed a bit).

The first one is made with the geom_xxx from the ggquickeda package. With a small trick (geom-text placed at some bottom area filled with white rectangle) I was able to add the simple table with patients at risk and events - practically always mandatory in publications. It supports faceting (facet_wrap / grid) and I was able to add the simple table with number of events at each panel.

I can share the code with you, if you wish, so maybe you can add it (or take some ideas?) to your toolbox?

And the second one is made manually, using only the ggplot2
geom_step + patchwork to combine the curves with the survival table (cowplot is still a worth alternative, as patchwork has some issues):

/ sorry, the forum doesn't allow me to put 2 screens I'll add it in a 2nd comment below /

The key moments in the code are as follows (this is from another graph, where I compare the naive KM estimator and 1-CIF for the curve adjusted for competing risk).

Note: I could create appropriate data.frames with all stuff pre-calculated, so I don't need to put stuff into ggplot2 calls, but I'm in a hurry a bit I hope it's still readable The bind_rows just adds a missing 0-timepoint, nothing more.

Confidence intervals are easy to add via geom_ribbon (the CIs are already in the fit object)

# classic survival
fit <- survfit( Surv(time_last, event) ~ 1, data = result_all, conf.type = "log-log")

# with competing risks via Multi-State model

fit_comp <- survfit( Surv(time_last, event, type="mstate") ~ 1, data = result_com, conf.type = "log-log")

ggplot() +

  # steps
  geom_step(data=with(fit, rbind(c(time=0, surv=1), data.frame(time, surv))), 
            aes(x=time, y=surv, linetype = "Date of discontinuation", col="All events combined"), width=1)  +

  # censoring ticks
    geom_linerange(data=with(fit, rbind(c(time=0, surv=1, n.censor=0), data.frame(time, surv, n.censor))) %>% filter(n.censor>0),
                 aes(x=time, y=surv, ymin=surv-tick_size, ymax=surv+tick_size, col = "All events combined"), show.legend = FALSE) +

   # 1-CIF for the "Decision" curve adjusted for competing risks. Can be ignored, but wanted to show you it's easy too.

    geom_step(data=with(fit_disc_com, rbind(c(time=0, surv=1, n.censor=0), data.frame(time, surv=1-pstate[,2]))), 
              aes(x=time, y=surv, linetype = "Date of discontinuation", col="1-CIF for PI's decision adjusted for death & SAE"), width=1)

Curves that have non-vertical steps aren't survival curves, making no statistical sense. If published, would be questioned and rejected. The probability drops instantly when an event occurs. There is no any intermediate probabilities making "sloped/skewed" steps, this has no interpretation.

I hope this helps a bit. I'll be happy to discuss it in greater details if you need.

Would you consider correcting the blog post to reflect the above, and obtain the proper survival curves? I can offer some support. We can contact also via email or schedule a TC, if you wish

Frequentist · November 6, 2021, 11:09pm

That's the second graph done manually, using a similar code to the one given in my post above:

hannah · November 8, 2021, 7:56pm

Thank you for the extended feedback. Our goal is to provide an interface to easily generate predictions of survival probability at various time points, across various models. I've changed the visualization of the predictions from the penalized Cox model to a step function with ggplot2::geom_step(). For more detailed survival curve plotting we recommend using a dedicated package, for example, those suggested by you here.

Frequentist · November 8, 2021, 8:16pm

Thank you! Perfect and quick reaction to a feedback. Now it looks all good!

/ You might consider starting the survival probability axis from 0, as it's the traditional way of presenting it, and allows one to quicker realize whether the median survival time is reached or not. It's contrary to CIF curves, which often don't reach 1, but it's not an important issue. Everyone can adjust the axes per needs, so please treat this only as a suggestion. /

hannah · November 9, 2021, 10:39am

Cheers! Good point about the axis, I'll do that in future.

mattwarkentin · November 9, 2021, 8:04pm

I'm so excited that the tidymodels team is dedicating some focus on survival analysis, which has a special place in my work/heart. I'm mindful of not wanting to request the moon, so here is a relatively focused list of my preferences:

My list of features I would love to see:

Time-dependent performance metrics
- This is probably the area that I would love to see a consensus API be formed. I can think of probably 5+ time-dependent AUC packages off the top of my head. Each one generally implements one or many of the available methods for defining tdAUC, and the expected inputs can vary (some require artificial censoring at the chosen horizon, some do this internally). Some use IPCW, some don't; it's all over the place. The inclusion of time-dependent metrics API for yardstick would be an amazing addition
- Candidate time dependent metrics: AUC, C-index, Brier, Sens/Spec/Recall/Precision/etc. (basically any/all of the standard classification metrics would be amenable to time-dependent flavours)
- Simply implementing some of the artificial censoring algorithms (e.g. C/D) and leveraging existing yardstick functions would go a long way
Support for competing risks models (e.g. Fine-Gray, cause-specific hazards modeling)
- I think {flexsurv} and {riskRegression} lead the way in this respect from what I've seen/used. I'm also quite fond of the prodlim package for various non-parametric estimators like Kaplan-Meier and Aalen-Johansen.
I echo the previous sentiment about frailty models, AFT, interval censoring, etc. but these are really just flavours of standard models and should probably come somewhat for free with more general model implementations
As far as predictions, I would say you would get most of your mileage from 'time', 'survival', 'risk', 'cumhaz', and 'hazard'. Others like restricted mean survival or quantiles are nice.
Multi-state models - this may be out of scope for the tidymodels world because of how weird multi-state modeling can get. Thankfully (AFAIK), there are really two main packages for multi state models in R and they are both rich in features. mstate and msm. Building an API for multi-state models would be a bit daunting, so this is my longshot suggestion.

Anyway. I am happy to provide any insights into helping to fully integrate the flexsurv package. I am a long-time user and also a contributor to that package. I added the predict, tidy, glance. and augment methods. Please do not hesitate to let me know if there is anything I can help with or if those features need any extra work for tidymodels support.

mattwarkentin · November 9, 2021, 8:44pm

Oh and while I'm thinking of it, I would suggest supporting flexsurv::flexsurvspline() in addition to flexsurv::flexsurvreg() for survival_reg(). The latter is based on known parametric distributions (e.g. Weibull, etc.), while the former supports arbitrarily flexible baseline (cumulative) hazards with natural cubic splines (Royston-Parmar model).

Maybe some sentinel function like dist_spline(k, knots, bknots, scale = 'hazard', timescale = 'log') that defines the spline model, where k is mutually exclusive with knots/bknots. An API like:

survival_reg(engine = 'flexsurv', dist = dist_spline(k = tune()))

This might require adding a dials::knots() function, which would be very similar to the dials::deg_free() function.

Max · November 9, 2021, 11:25pm

About the ROC metrics... there are a lot of them but almost all are designed for the case where you want to measure the ROC AUC at time points to measure the effectiveness of one or more biomarkers. In other words, they are not really designed to measure the effectiveness of predicted values. They could be used to measure model performance but it seems like we would not be using the right tool for the job.

However, the "alternate estimator" of Chambless and Diao (2006, Sect 3.3) seems ready-made for our use case and the second author gave us some prototype code that we will adapt. The papers that I've seen indicate that the statistical characteristics of this estimate are good.

This may be the only ROC metric that we expose for the foreseeable future.

mattwarkentin · November 10, 2021, 4:37pm

there are a lot of them but almost all are designed for the case where you want to measure the ROC AUC at time points to measure the effectiveness of one or more biomarkers . In other words, they are not really designed to measure the effectiveness of predicted values.

Hmm. Could you explain the difference? To me, it seemed like if the one "biomarker" was a model-predicted probability then it would work just fine. Maybe I'm missing something.

Max · November 10, 2021, 5:25pm

A biomarker, or some other generic predictor, has very little known about it (e.g., distributional assumptions, sampling properties, etc.). Because of this, things need to be derived from the training set (or by some other methods or assumptions).

Most importantly, more ROC methods for censored data are not predictive in the sense that we have some other data held aside to measure performance (so it is a lot like re-predicting the training set). The same data are being used to derive distributional properties and evaluate the model. While I have not evaluated this, usage of a low bias black box method might generate overly optimistic estimates of performance.

A predicted survival probability (or other estimate) is a different type of quantity and we know more about it. Doing these operations listed above is, at best, inefficient and might lead to a poor estimate of the ROC statistics. There is a lot more research that could be done on this.

Is it a big deal? Maybe. However, I think that we should start with a technique that is designed for this specific task.

runjin326 · December 1, 2021, 1:32pm

Thanks so much for putting together this super helpful package! We were interested in the boost_tree function so we tried running the following

fit_results <- boost_tree(mtry=i, trees=j, min_n=p) %>% 
          set_engine("mboost") %>% 
          set_mode("censored regression") %>% 
          fit.model_spec(tidymodel.formula,
                         data = data.train)

pred_results <- predict(fit_results, 
                                type = 'survival', 
                                new_data = data.test, 
                                time=seq(0, max(input_data$OS_days), 1))

The question is, in neither output, is there a place where we can get an aggregated hazard/risk score that we can use for C-index calculation? Thanks so much!

Max · December 1, 2021, 7:11pm

The censored and parsnip packages are focused on model fitting. We will be adding metrics to the yardstick package in time.

hannah · January 5, 2022, 7:12pm

This topic was automatically closed 35 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.