SAVI 2026 — Anytime-valid log-rank testing for randomised trials

This page hosts materials for the poster Anytime-valid log-rank testing for randomised trials, presented at SAVI 2026 (Sequential Anytime-Valid Inference).

Authors. Joren Brunekreef¹, Renée X. Menezes², Rianne de Heide³,⁴

¹ Netherlands Cancer Institute, Department of Radiotherapy and AI for Oncology, Amsterdam, NL · ² Netherlands Cancer Institute, Biostatistics Centre and Department of Psychosocial Research and Epidemiology, Amsterdam, NL · ³ University of Twente, Enschede, NL · ⁴ Centrum Wiskunde & Informatica, Amsterdam, NL

Poster

Download the poster (PDF)

Summary

Anytime-valid (AV) log-rank tests (Ter Schure et al. 2023) monitor survival trials continuously with Type-I error control. We characterise the fixed-δ AV log-rank by simulation, generalising classical fixed-n power to a power-over-time curve. At the classical trial end, AV’s power sits just below the classical test’s empirically realised power, the small price of anytime-validity. Past that end time, a simple continuation rule extends the trial only when the evidence is suggestive but not yet decisive, and AV’s total rejection rate closes the gap and overtakes classical power, at a bounded and visible cost in extra follow-up. We then deploy the framework retrospectively on a real randomised trial.

Further detail

A few things that didn’t fit on the poster.

Continuation-rule design space

The poster shows one continuation rule with two pre-specified cutoffs: a futility threshold τ on the e-value at the classical trial end, and an extended end time that caps how long monitoring may continue. Their joint setting controls the four-way outcome split shown in the stacked-bar figure (reject early, reject late, futility stop, inconclusive).

The futility cutoff τ filters which trials are allowed to continue past the classical trial end. A low τ admits most borderline trials into the extension, recovering more late rejections at the cost of more inconclusive continuations. A high τ stops aggressively at the classical trial end and gives up rejections that would have crossed under continuation.

The extended end time bounds how long the extension can run. A short extension limits the inconclusive tail but cuts off late rejections that would have crossed given more time. A long extension accepts a larger fraction of futile continuations in exchange for more late rejections.

Choosing the futility cutoff and the extended end time is therefore a deliberate trade between late rejections and wasted continuations.

Extended outcome decomposition under a sweep of (τ, m) continuation rules, on the same misspec_sweep_06 scenario as the poster (assumed HR 0.75, true HR 0.81, n=1000, t_max=400, 10,000 reps). Dashed line: classical log-rank's empirical rejection rate.

The “stop at t_max” baseline (no continuation) sits at AV’s reject-early mass alone. Each (τ, m) bar carves the remaining grey into late rejections (light green), futility stops (grey), and inconclusive continuations (red). A low τ admits most borderline trials into the extension and grows both the late-rejection and inconclusive masses. A high τ converts most of that into early futility stops. A longer extension (m=1.5 vs m=1.25) buys a few more late rejections at any τ.

What AV log-rank doesn’t give you

The fixed-δ AV log-rank deployed here is one concrete construction. Several limitations are worth stating here.

Cohort-complete monitoring only. The test martingale requires every patient to be enrolled before monitoring starts, so that the risk set is fixed at the moment the martingale begins. During-accrual monitoring (looking at e-values while new patients are still being randomised) cannot be done within this construction, because the risk set changes as patients enter and the martingale cannot accommodate that without breaking Type-I error control. It is not yet known whether this can be resolved with a more sophisticated mathematical construction. One natural direction is a cohort-batched variant that updates the martingale only between accrual blocks, but we have not pursued it. This restriction has practical consequences for head-to-head comparisons against real trials; see Event time vs calendar time below.

Fixed-δ is one choice among many. Our framework uses a single assumed hazard ratio δ throughout, and the resulting e-value is GROW (growth-rate optimal in the worst case) against that simple point alternative. It is not growth-optimal if the true hazard ratio differs from δ, and a misspecified δ costs power, as the nominal-versus-empirical power curves on the poster show. The wider literature offers several alternatives we did not deploy. Ter Schure et al. (2024) construct e-values that learn from the accumulating data; our initial explorations indicated these do not generally lead to higher power than the misspecified fixed-δ construction used here, but the discrepancy is worth investigating in more detail. Baas, ter Schure & van Rosmalen (2026) construct design-optimal e-values that target specific objectives (maximising power at a horizon, minimising expected sample size) rather than the design-agnostic GROW or fixed-δ defaults. Composite alternatives (where δ ranges over an interval rather than a point) are another direction we have not yet pursued.

No estimation, just testing. The output is a valid sequential test. Anytime-valid confidence sequences for the hazard ratio exist in the broader literature but are not what we deploy here.

One-sided. Our construction is framed against δ < 1 (treatment helps). A two-sided test would need a different construction.

No free lunch on power. At the classical trial end, AV’s rejection rate sits below the classical test’s empirically realised power. The continuation rule recovers and then exceeds the gap, but the recovery is paid for in additional follow-up time.

Event time vs calendar time

The martingale’s natural clock is events, but trials are experienced in calendar time. An event-count saving is therefore only half the picture. Translating a saving in events into a saving in weeks or months requires the accrual schedule and the event-rate curve over calendar time.

The cohort-complete restriction (see above) has a real cost here. Classical interim triggers like Haybittle-Peto act on every event observed up to the interim, including events accumulated during the accrual window. Our martingale, by contrast, only starts counting once the last patient is randomised, with subsequent times shifted so that events are measured in patient-time rather than calendar-time. Even if AV and a classical interim end up crossing at the same event count, the classical interim will typically cross earlier in calendar time. This is a genuine handicap on head-to-head calendar-time comparisons.

For a real deployment, or for a retrospective that can claim calendar-time savings, we need individual patient data with enrolment and event dates preserved. That is a natural next step.

Behind the retrospective

Why this trial. Individual patient data for the Intergroup adjuvant colon trial (Moertel 1990) are publicly available through R’s survival::colon dataset. The two-arm subset (Levamisole + 5-FU vs Observation, n=619) is clean, sizeable, and both the classical log-rank and our AV log-rank reject the null. We use it as a concordance demonstration that the framework is sensible on real trial data.

The interim trigger. Moertel et al. report a three-arm group-sequential design with O’Brien-Fleming boundaries. The 192-event count we cite for the interim trigger is the projection of the trial’s second interim into the two-arm subset (114 Observation + 78 Lev+5FU deaths). It is the closest comparable event count we can extract from the original paper.

AV’s rejection point. Our AV log-rank crosses 1/α at 207 events. That is 84 events earlier than the trial’s 291-event follow-up cap, and 15 events later than the interim trigger. The first number is the saving; the second is the cost of not having a pre-specified analysis schedule.

Calendar time? The 84-event saving is measured in patient-time from randomisation. Translating it into a calendar-time saving requires enrolment dates that the public colon dataset does not carry, and the cohort-complete restriction means the classical interim would in any case cross earlier in calendar time than the event count alone suggests. See Event time vs calendar time above for the general point.

What we learn. AV reaches the same conclusion as the interim trigger without any pre-specified analysis schedule and saves 84 events compared with the cap. That is the value proposition: the same scientific verdict from continuous monitoring instead of locked analysis points.

Caveat. This trial was a clear positive. AV’s behaviour on an inconclusive or marginal trial is a separate question.

Kaplan-Meier survival curves for the two arms (Lev+5FU in NKI blue, Observation in lapis) with the AV e-value trajectory in green on a log scale. Vertical guides mark the interim trigger (192 events), the AV rejection point (207 events), and the trial's 291-event follow-up cap.

References

The poster cites the safe-test / e-value / AV log-rank literature; the most directly relevant references are:

Ter Schure, J., Pérez-Ortiz, M.F., Ly, A., & Grünwald, P. (2024). The Anytime-Valid Logrank Test: Error Control Under Continuous Monitoring with Unlimited Horizon. The New England Journal of Statistics in Data Science, 2(2), 190–214.
Grünwald, P., de Heide, R., & Koolen, W.M. (2024). Safe testing. Journal of the Royal Statistical Society B, 86(5), 1091–1128.
Ramdas, A., Grünwald, P., Vovk, V., & Shafer, G. (2023). Game-theoretic statistics and safe anytime-valid inference. Statistical Science, 38(4), 576–601.
Baas, S., ter Schure, J., & van Rosmalen, J. (2026). Adaptive clinical trials based on design-optimal e-values with automatic curtailment: An application to single-arm trials with binary data. arXiv:2605.28653.
Moertel, C.G., Fleming, T.R., Macdonald, J.S., et al. (1990). Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma. New England Journal of Medicine, 322, 352–358.

Contact

For questions or follow-ups, reach me at j.brunekreef@nki.nl.

Disclaimer: the text on this page was partially drafted by Claude Opus 4.8 and subsequently edited by me. I take responsibility for the final version.