Disease worlds models are essential for personalised medicine [Work in progress]
LLM based healthcare AI is NOT enough for personalised medicine:
LLM is not for personalised medicine: Every day it has become more evident that current LLMs will one day be as good as human doctors at making diagnoses. However, personalised medicine is far beyond just making a diagnosis. Unfortunately, I do not think that the current LLM systems or their future versions are designed for accurate personalised medicine. This is because they can only access partial clinical representations of patients, rather than looking holistically at the whole picture at both the individual and population levels.
World Models for personalised medicine: Recent progress in World Models has shed some light on how to realise personalised medicine for humanity using AI. World Models, such as the Google Genie series, can create physical worlds with interactive virtual environments. By analogy, we could create a Disease World Model that is fully aware of patients’ clinical representations, with an interactive environment for virtual interactions. Such a model would be able to perform virtual screening to save patients from a preventative perspective. The Disease World Model should also be able to provide virtual personal disease progression modelling for optimal treatment planning, and it could even be helpful for personalised drug development with virtual interventions.
Contents of this blog: This blog will first lay out the definitions of a Disease World Model, which provides a few future research directions and opportunities. Some early use cases with promising results on lung fibrosis will later be discussed.
Invites for collaborators This is a work in progress, so please feel free to contact me if you find a typo, a mistake, or would like to discuss or collaborate. xumoucheng28@gmail.com.
Figure 1: Comparison between Disease World Models and Large Language Models in AI driven healthcare. LLMs can do diagnosis but Disease World Models can do more, even personalised medicine.
Setup for Disease World Model
Before we dive into the formal definitions, we need to define a few terms:
- $X$: synthetic artifacts, as high-dimensional clinical representations (e.g., medical images, medical images + lab reports) of an individual $i \in {1, …, N}$.
- $G$: a generative system that produces a temporal sequence of artifacts ${X_i^t}_{t=0}^{t=D_i}$, where $X_i^t$ is the artifact for individual $i$ at time stamp $t$, $D_i$ is the death time point of the individual $i$.
- $Y$: real artifacts of an individual $i \in {1, …, N}$ be ${Y_i^t}_{t=S}^{t=E}$, where the real clinical representations are available during the time period between the starting point $t=S$ and the end point $t=E$.
- $M$: a diagnostic agent (e.g., human doctors, clinical LLM) that can map the clinical representations ($X$ and $Y$) to diagnosis. The accuracy of the diagnosis is measured by a metric $\mathcal{L}$.
- $I$: an identification function to recognise the identities of all sequences of the medical artifacts.
Definition 1: Disease World Model
A generative system $G$ is a Disease World Model if its generated artifact sequences ${X_i^t}_{t=0}^{t=D_i}$ satisfy the following two properties for all individuals $i$.
- Clinical Reliability: Each generated artifact $X_i^t$ is clinically reliable for all $t$.
- Individual Characterizability: Each generated artifact sequence ${X_i^t}{t=0}^{t=D_i}$ is _individually characterizable.
Definition 1.1: Clinical Reliability
An artifact $X_i^t$ is clinically reliable if, for a given diagnostic agent $M$ and accuracy metric $\mathcal{L}$, the following conditions hold:
-
Diagnostic Verifiability: The accuracy of a diagnosis derived from the synthetic artifact should be larger than a threshold $\alpha$:
$L(M(X_i^t)) \ge \alpha, \quad t \in {S, \dots, E}$
In the strictest scenario, the diagnostic accuracy from the synthetic artifact should be at least as good as the real artifacts, more formally, we can define $\alpha \ge L(M(Y_i^t)), \quad t \in {S, \dots, E}$. In less strict scenarios, a threshold $\alpha$ needs to be determined to be clinically accetable. This condition ensures the diagnostic utility of the synthetic artifacts is useable and verifiable.
-
Temporal Consistency: The diagnostic accuracy is a non-decreasing function of time. For any two time points $t_1$ and $t_2$ such that $t_1 < t_2$, we have:
$L(M(X_i^{t_1})) \le L(M(X_i^{t_2})), \quad t_1 < t_2$
This condition ensures the diagnostic utility of the synthetic artifacts is reliable, even for future unseen timepoints.
Definition 1.2: Individual Characterizability
A sequence of artifacts ${X_i^t}_{t=0}^{t=D_i}$ is individually characterizable if there exists an identification function $I: \mathcal{X} \to {1, …, N}$ (where $\mathcal{X}$ is the space of all possible sequences) that can identify the individual $i$ from their generated sequence with a high probability $\beta$ close to 1.
$\mathbb{P}(I({X_i^t}_{t=0}^{t=D_i}) = i) \ge \beta$
This ensures that the sequence contains a unique signature of the individual’s disease progression, preventing the model from generating generic sequences or wrong sequences. An example of such an ientification function $I$ is an unsupervised contrastive clustering algorithm at a very granular level.
How to train such a Disease World Model
Figure 2: A proposal for the training strategy of a disease world model.
An early attempted use case that is applied to lung fibrosis disease progression modelling
4D-VQGAN: We built an AI model called 4D VQ-GAN for disease progression modelling of the lung fibrosis disease that satisfies the clinical reliability condition of a disease world model. In the context of the disease progression, we can adapt temporal medical imaging as the environments, on the analogy as using the videos as environments in physical world models. As a proof of concept, our early attempt only focus on only one interactive action with the virtual progression trajectories, which is the time. The technical details of this preliminary study can be found in the published conference paper: 4D VQGAN. Given two 3D CT scans of an Idiopathic Pulmonary Fibrosis patient at irregular time points, 4D VQ-GAN can generate synthetic 3D images at any desired time point, effectively modelling a virtual continuous disease progression trajectory for each individual. More importantly, we found that biomarkers derived from the generated CT volumes exhibit a strong clinical correlation with survival outcome, partially satisfying the clinical reliability condition as defined above, thereby highlighting the potential of 4D VQ-GAN for personalized treatment planning and more personalised medicine tasks.
Figure 3: An example of generated synthetic imaging medical artifacts sequence compared against the real sequence. Note that the real scans only contain scans at time point year 0, year2, year 3.5, but the generated scans have more time points at year 0, year 1.5, year 2, year 3.5, year 5.5.
Figure 4: Highlighted key visual features of lung fibrosis in generated imaging artifacts. A zoomed region of the left lower lobe (yellow box) in the real and generated CT scans show comparable amounts of architectural distortion, patterned ground glass opacification and reticulation, all hallmarks of lung fibrosis. The availability of our scans are not uniform across time and across patients, the model is trained on scans at irregular time points
Verifiable Clinical Reliability: We explore our model’s clinical utility using a survival analysis based approach to mimic the clinical workflows. Radiologists track prognostic imaging biomarkers in IPF over time to assess disease progression. Though we lack comprehensive visual scores for all cases, we propose a method that mirrors clinical workflows, including selecting key prognostic biomarkers, analyzing their longitudinal changes, and comparing their prognostic value in synthesized vs. real scans. These extracted imaging biomarkers, along with the covariates, are input into the Cox model to assess their prognostic value in the test dataset. This analysis evaluates the consistency of biomarker trajectories between real and synthetic scans, and explores the potential utility of synthetic scans in tracking changes over time. As shown in the below results, it is very interesting to see that the generated CT images can yield survival outcomes that are sometimes even more accurate than those derived from the real images. However, these results may be overestimated due to the limited sample size.
Figure 5: For the cross-sectional imaging biomarker, the C-index using the generated third scans is 0.943. In comparison, using biomarkers derived from real CT scans for survival prediction yields a slightly lower C-index of 0.914. Next, we compute the longitudinal biomarker by evaluating the change in these top five significant biomarkers over the course of one year, both for real and generated CT scans. By inputting these longitudinal biomarkers along with covariates into the Cox model, we obtain C-indices of 1.0 for both real and generated CT scans.
Although the existing 4D VQ-GAN is a step towards the disease world model with promising results, further study is still required to verify if the 4D VQ-GAN also satisfies the clinical reliability condition due to the lack of time and computational resources.