Real-world data-driven machine learning in CLL

Identifying patients at risk of infection and treatment in CLL for a clinical trial testing pre-emptive CLL treatment.
Real-world data-driven machine learning in CLL

Contributors: Carsten Utoft Niemann MD PhD, Rudi Agius PhD.

Link to Paper:

Until a few years ago, we did not realize that a quarter of patients diagnosed with Chronic Lymphocytic Leukemia (CLL) were suffering from infections, and 10% of those even dying, prior to any CLL specific treatment (see Figure 1)1.

Cumulative incidence for infection, treatment and death in CLL
Figure 1. Cumulative incidence for infection, treatment and death in CLL: Aalen Johansen cumulative incidence estimates for the three outcomes stacked on top of each other. Each patient could only have one event, that being the event whichever came first. Thus, infections subsequent to treatment and vice versa were not included. Time zero being time of diagnosis for all patients

Realizing this, we wanted to assess whether pre-emptive treatment can re-modulate immune dysfunction in CLL for these patients. In attempting so, we first needed to identify those patients at the highest risk of serious infections and/or CLL treatment. With access to an extensive multidimensional time series of health data through the Danish National CLL Registry2 and the data lake, we started modelling CLL-TIM at the Medical Information Services Program (MISPCAMP). Here, professionals from a broad range of disciplines ranging from medicine and biology to bioinformaticians and data analysts were brought together for a week to tackle, amongst other projects, the problem of predicting risk of infection for patients with CLL. The complexity and beauty of the data at our disposal can be better appreciated by the spaghetti plot of even just one variable for a 1000 CLL patients (see Figure 2).

Figure 2
Figure 2. Lymphocyte counts visualized for 1000 patients diagnosed with CLL. Time of diagnosis is aligned in the middle of the x-axis for all patients, x-axis depicts time, y-axis depicts lymphocyte counts. The lymphocyte trajectory for a single patient is highlighted in yellow. Image produced by Michael A. Andersen.

What was quickly evident during the MISPCAMP, was the diverse set of propositions put forward for the modelling approach. Each person had their go-to model and go-to features. Even the definition of an infection as an outcome sparked lengthy discussions. Inevitably, the concept of ensembling was brought up. Through ensembling, we had a means to model several different view-points within one model. Post-MISPCAMP, the methodology then progressed towards more theoretical grounds on building ensembles. The other concern that was brought up during MISPCAMP was the necessity of having a model that is usable in the clinic. We expressed this through a design that was able to make predictions for all CLL patients (irrespective of missing data) and one that provided uncertainty estimates and personalized risk factors for each patient prediction. 

Figure 3. CLL group in MISPCAMP

Besides successfully developing, one important revelation is that modelling patients using a single snapshot around the time of diagnosis may not be sufficient for accurate results. This is evidenced by certain patients whose risks could only be successfully modelled upon inclusion of medical history several years prior to diagnosis. Additionally, we found that a variable range of risk factors account for a patient’s correct risk assignment, emphasizing the need for a complex approach to modelling a complex disease like CLL. In turn, this also enabled us to, for the first time, provide personalized risk factors for each CLL patient. Modelling patient data prior to CLL diagnosis however brought up its own issues. Namely, external cohorts with the level of patient histories and data that we had at our disposal, were not available. We hope that this work may be used as inspiration for current data management strategies to bring together more detailed patient histories. In fact, data availability more so then lack of methods is the current bottleneck in developing prognostic models. 

Finally, modelling the joint outcome of infection and treatment together served as an eye-opening exercise. By doing so, we observed a synergy which was beneficial to the prediction of both outcomes despite the outcomes not being clearly linked at the patient level. For future approaches, modeling several clinically relevant outcomes together in one model may not only improve the model itself, but also open a path to identify common mechanisms of action behind apparently unlinked outcomes. CLL-TIM is now being prospectively tested to assign patients for the investigator-initiated, international PreVent-ACall clinical trial (NCT03868722) within the Hovon and Nordic CLL Study Groups (, where we are aiming to improve immune dysfunction for patients with CLL by 12 weeks treatment with acalabrutinib and venetoclax. 

1. Andersen MA, Eriksen CT, Brieghel C, et al: Incidence and predictors of infection among patients prior to treatment of chronic lymphocytic leukemia: a Danish nationwide cohort study. Haematologica 103:e300-e303, 2018

2. da Cunha-Bang C, Geisler CH, Enggaard L, et al: The Danish National Chronic Lymphocytic Leukemia Registry. Clin Epidemiol 8:561-565, 2016