Part III: The role of machine learning in medical imaging

ML has shown promise for a wide range of applications in the field of medicine due to its ability to process and analyze large volumes of data in a relatively short amount of time [69]. Today, medical data is being generated at an unprecedented rate, providing further incentive towards the use of ML, which has already been applied to analyze electronic health records, medical imaging, genomic data, and wearable sensor data [146, 128, 29]. ML can assist in administrative tasks, such as organizing and analyzing patient data [163], categorizing medical records, and optimizing hospital bed allocation [24]. These can lead to more efficient hospital operations and improved patient care. ML has also been used for medical applications of natural language processing to extract relevant information from clinical notes and other unstructured data sources [158, 60]. Automated processing of medical data such as symptoms, diagnoses and treatments, has the potential to improve efficiency in the clinic by replacing the manual labor of such tasks [143]. It could also be used to improve clinical documentation, medical education, and reduce errors in medical coding [76].

Furthermore, ML has tremendous potential in medical imaging applications due to its fast pace of data processing and analysis. The availability of medical imaging tools and improved early detection techniques has resulted in an increase in the popularity of all imaging modalities, however this rise in medical imaging data has not been matched by an equivalent increase in specialized radiation oncologists [51]. As a result, there is a pressing need for more effective and resource-efficient radiotherapy and diagnostic radiology workflows that can alleviate the burden on medical staff.

ML algorithms can automate certain aspects of medical imaging analysis, such as computer-aided detection and diagnosis. ML can be immensely useful in this area as well, by automating time-consuming and repetitive tasks, freeing up medical staff and allowing them to focus on more critical aspects of patient care. Overall, the potential applications of medical imaging and ML in healthcare are vast and exciting, with the potential to revolutionize the field and improve patient outcomes [98].

The popularity of ML solutions for medical applications has also increased due to the general hype around the ML field. Publicly available datasets and ML challenges with prizes (such as publications in high impact journals or cash) have created an attractive environment for researchers to test and develop new ML algorithms, driving innovation in the field, but at the same time raising concerns about the quality of the published data [167].

ML applications in radiotherapy

ML has potential applications in nearly all stages of the radiotherapy workflow [61, 11]. In connection to medical imaging, ML techniques can improve the speed and quality of data collection, manipulate the collected images to gain more information about the underlying anatomy, or to support various downstream processing tasks.

In the data collection stage, ML methods to speed up the scanning time or to improve the quality of the signal are inherently linked to the scanning equipment. In MRI, a speedup in the imaging process could be achieved by collecting a smaller k-space and adding the higher frequency details using ML. Here, an active decision would have to be made when setting up the scanner parameters to achieve faster, but lower quality images, knowing that ML solutions will improve their overall quality [111]. Some MRI scanners have built-in ML-based solutions can improve the SNR of the image. Whereas for the case of CT, high dose scans can be generated using low dose scans using ML [62].

Some imaging artefacts can be corrected during the post-processing stage as well, independent of the scanner, which means these solutions are expected to work on a wide range of scanners and their respective settings, unless their limitations are clearly stated. ML is effective at removing image artefacts, such as bias fields, motion artefacts, implants, and image noise, which can impact the quality of MRI images [150]. These artefacts can therefore affect the accuracy of downstream automated processes.

More common data manipulation tasks come in the form of modality transfer, i.e. transferring the signal to a modality that would not have been possible to achieve using the given scanner, or it would have required additional scans (e.g. sCT as described in Part II.). ML techniques can be leveraged to provide additional information, such as changing the image contrast, or providing quantitative MRI (qMRI) measurements [65], which can reveal underlying tissue properties. As an example, the Sweden-based company SyMRI offers a workflow for acquiring the quantitative properties of the brain, which can be used to construct synthetic MRI scans with custom scanner settings.

One of the most powerful modality transfers helping MRI-only radiotherapy is sCT generation from MRI images [114, 16]. By implementing this in clinical practice, the need for an additional CT scan for radiotherapy is eliminated. In fact, the company Spectronic Medical AB has already taken a major step in this direction with a commercially available sCT model that utilizes a $T2$ -weighted MRI scan as input. This modality transfer offers a more streamlined and efficient approach to radiotherapy planning, reducing radiation exposure to the patients and imaging costs.

Downstream image processing tasks in radiotherapy with ML-based solutions are wide-ranging. From the automated segmentation of clinically relevant structures [123] to predicting the probability of a medical condition, the registration of the CT and MRI images [104] or the recommendation of deliverable treatment plans [52].

Commercial solutions

The popularity and effectiveness of ML have attracted large companies to join the field with research groups like FAIR and OpenAI, funded by Facebook and Microsoft respectively. Apart from research, there are a large number of commercially available solutions performing a single task with an underlying model, trained on a valuable, well-curated, not publicly available dataset [170].

The privacy around the used dataset increases the value of the product, but it keeps users in the dark about the generalization and the possible limitations of their proposed model. This is an often discussed topic in other fields, such as Large Language Models (LLMs), however their impact in medical imaging solutions is far more explicit and dangerous. This impact introduces an urgency to the field to address data availability and transparency [142].

Research areas

With the fast pace of progress in the field, there is fierce competition to be considered state-of-the-art [141]. This can lead to keeping published methods private, making them hard to reproduce, which harms the scientific community as a whole. Additionally, it can lead to methods built on other recently published components that might not be appropriately validated. The frequent use of batch normalization, without a proper justification for its good performance [137, 154] shows that the majority of ML research is still exploratory, and the lack of empirical foundations for the components is a common criticism of DL [141, 107].

Public challenges and datasets provide a solution to data availability and transparency concerns. Popular challenges often connected to well-known conferences provide a competitive, yet transparent way of comparing solutions, using thoroughly reviewed, high quality datasets, whereas other, smaller challenges might provide data that is not described sufficiently, and therefore is hard to navigate. The simplicity of creating challenges and public datasets, can make it difficult to assess the quality of the provided data.

Challenges most often tackle a single task and therefore require a single trained model [174, 53, 108], and the leaderboards of such challenges represent a good benchmark of the performance of publicly available solutions. Recently, more complex challenges have also been introduced that require ML at multiple stages, and their goal is to get an overall sense of the level of possible automatization.

For different applications, the underlying physics of the task at hand are often considered when designing the network architecture, loss function, evaluation metric, etc. These are often referred to as physics-informed approaches [150, 65, 71].

Concerns

The expanding field introduces a long list of possibilities, but also numerous concerns. These can be categorized into concerns around data, interpretability and reproducibility.

Data concerns

ML solutions are by design data-driven, making the careful selection and comprehensive description of training and evaluation data crucial, especially in medical settings. The optimization process makes the models perform well on the defined training data, however the variation within this data allows for better generalization of the model, and for it to be more useful in a wider range of scenarios. For instance, a model trained to segment structures in MRI images is not expected to perform well when applied to CT data. However, even variations in finer details such as the scanned area, field of view, voxel sizes, orientation matrix, and potential artefacts within scans can significantly impact the effectiveness of the model. Evaluating on validation and testing datasets can help with understanding the generalization of ML models, however all of these details around the training dataset need to be thoroughly described to be able to assess if the model is expected to work well in a certain scenario.

Data balance

It is not only important to include a wide variation in the training data, but also to ensure proper data balance across different classes or categories within the dataset. It is worth considering if the best option is to balance the classes representing their real world distributions. As an example, proposed ML solutions have been shown to work sub-optimally for pediatric patients [145], since they are often not represented in the training data, but it is also an underrepresented class of data in the real world. Underrepresented classes can be oversampled to have a larger impact on the trained model, or they can be used separately to train a model, resulting in a model that is intended to work well only on pediatric data.

Medical imaging datasets are often heavily changed during pre-process-ing, out-of-distribution images or entire volumes containing artefacts are often removed. Thresholding algorithms are often used on the image slices to mask out the background noise, and other artefact removal methods, such as N4ITK bias field correction [159], are frequently implemented. The reasoning for curating the available data is to improve the reported performance on the model on the constructed, well-balanced dataset. This is a fair approach, however, when using such ML models, the same pre-processing steps must be performed to ensure the suitable quality of the new data. This underlines that the detailed description of the training dataset is essential.

Data imbalance is often addressed during evaluation by testing the model in unfamiliar circumstances, such as using a different scanner or a different pre-processing pipeline [116]. Alternatively, it can be addressed during training by using data augmentation or by adjusting the loss function to account for the imbalance [75].

Data leakage

Using a validation dataset that closely resembles the training data can give a good understanding on how well the model performs. However, incorrect data processing when setting up the training and validation datasets can lead to data leakage, which in turn will result in a false validation performance score. Possible reasons for data leakage depend heavily on the task, but as a rule of thumb, all pre-processing must be done after the data has been split into training and validation sets, and all model evaluations must be computed on the testing dataset.

For medical imaging the data split is most commonly performed patient-wise. Even if the proposed method works on individual slices, having one slice in the validation data, while its neighbouring slices are used during training would give an unfair similarity between the two datasets that should be independent. While how to split the data is usually straightforward, not raising these questions can lead to mistakes, and data leakage can seriously undermine the impact of the work [94].

Data privacy

The privacy of medical data is both a blessing and a curse with regards to the popularity of ML in the field. Scientific reports may have limited information about medical datasets due to ethical approval constraints. Additionally, large high-quality medical datasets are extremely valuable, which consequently leads to a decline in research transparency and thorough model evaluations [64].

The increased popularity of public datasets and challenges has advanced transparency [93, 82, 108], as they raise awareness of data shortcomings through quality assessment and documentation, and minimizing pre-processing steps. The level of anonymization differs between datasets, from removing sensitive information from DICOM images, to blurring faces in TotalSegmentator [170]. Public datasets are particularly valuable in domains where data collection is challenging or impossible. They drive advancements in tasks like rare image analysis, drug discovery, disease diagnosis, and personalized treatment planning (e.g. Human Connectome Project with iBEAT V2.0 Cloud software for segmentation maps.) propelling the scientific field forward at a remarkable pace. Additionally, public challenges provide platforms for sharing ideas and solutions, learning as a community, and enabling continuous refinement of datasets and tasks, such as KiTS and BRaTS. These challenges often offer attractive rewards, including exposure at scientific venues and monetary prizes (e.g. Grand Challenge on Medical Image Analysis, further motivating participation.

Interpretability concerns

While public challenges have gained popularity, the adoption of public solutions in literature remains uncommon. Reproducible methods and shared trained weights are crucial for enabling the widespread use of proposed solutions in the field of ML. However, it is equally important to ensure that these models are utilized correctly and that the limitations of the models are understood. For instance, regarding data balance as discussed in Part I, where the model is used on data that it was not trained or validated on (e.g. different hardware, scanner settings, or patient anatomy), the model predictions must be carefully validated and verified.

Reliability

One crucial aspect of current ML models is that technically, they can be used with input data that has not been pre-processed correctly, and the predictions of the model may not indicate that the input format is incorrect. This can lead to severely, or subtly flawed results that may go unnoticed upon visual assessment. In clinical settings where accuracy is most important, thorough and community-driven guidelines are necessary to ensure appropriate pre-processing steps and avoid such issues.

Ensuring the quality of artificially generated images is crucial in terms of diagnosis and treatment decisions. However, decisions based on such images also present legal concerns, whether in medical imaging or when using artificially enhanced images in court hearings [39].

Unsupervised methods such as GANs have a particularly concerning property known as hallucination. This refers to the generation of realistic-looking images that do not accurately represent the patient’s condition. In medical imaging, it is essential to validate the outputs of unsupervised methods using appropriate methods and to ensure that they do not lead to incorrect diagnoses or treatment decisions due to such hallucinations. In sensitive scenarios (which is the majority of medical applications), additional steps must be performed to improve the confidence in the model outputs by improving their explainability.

Explainability

For ML models with millions of parameters, it is impossible to clearly understand why a given input leads to a certain model prediction. This issue is easier to understand for classifications, but they are equally important for image-to-image tasks (e.g. sCT generation) [20]. The extreme complexity of the trained models is what leads to the analogy of a “black box” [130], which harms the level of public trust towards DL. However, there are various tools that can shed a light on the inner workings of the models, which is essential for building the general trust towards ML.

Uncertainty metrics can play a vital role in explainability. One such metric is the Monte Carlo Dropout technique, which estimates uncertainty by using dropout layers during inference or prediction time. This technique generates multiple predictions iteratively with different configurations of the final model, utilizing a different subset of trained neurons to make each prediction. By analyzing the variations between the predictions, we can determine the model’s level of confidence in the predicted output and identify how much it depends on only a few neurons. The use of such uncertainty estimation techniques (some alternatives are Bayesian uncertainty [101] and Variational dropout [80]) can help enhance the interpretability of ML models, and build trust in their predictions [43].

In addition to uncertainty metrics, another valuable tool for improving model explainability is the use of saliency maps [147], which highlight the input image pixels that have the most significant impact on the model’s prediction. These maps can be generated by calculating the gradient of the model’s output with respect to the input image and visualizing these gradient values as heatmaps. By visualizing these salient areas of the input image, we can enhance the interpretability of the predictions of the model, and build trust among researchers.

General solutions

To explore the unknown limitations of ML methods, recent advancements have focused on developing more general and abstract solutions, which provide better control over what the model learns and improve generalization to real-world scenarios. Pixel-based loss functions have recently been replaced by higher-level losses such as shape descriptors [74], and training on real-world data is often replaced by training on synthetic, and evaluating on real datasets [42, 149]. Model evaluations of robustness on a wider range of publicly available data is also becoming more popular [81, 33]. These trends shown promising results, suggesting that ML models can be made more reliable and applicable to a wider range of real-world scenarios.

Reproducibility concerns

Concerns of reproducibility can be found in all research fields. However, they become especially important in computer science, and even more so in ML where the rate of advancement is high and the impact of small changes can be large [141, 64].

There are multiple levels of reproducibility [23], and it is important to note the distinction between the reproducibility of the conclusions and the reproducibility of the methods.

Reproducibility of the conclusions

It entails that the conclusions of a scientific report can be reproduced by other researchers using their own data and methods. However, the complexity of the methods and the number of variables is generally so large, that focusing on this form of reproducibility requires the field of ML to still grow and mature. These general, enourmous methodological issues surrounding the scientific field build up what is called the reproducibility crisis.

Reproducibility of the methods

This is a more attainable goal, as it focuses on the proposed methods by a scientific report to be reproducible using the report and additional materials. The most popular—and most efficient [126]—method of ensuring this form of reproducibility is by sharing the source code used for the report or publication.

Previously discussed concerns of data privacy, unfamiliarity with publishing source code, and the possible commercial value of the proposed solutions all hinder the reproducibility of the field of ML. There are clear advancements towards promoting reproducible research, such as the FAIR principle, and reproducibility guidelines set in place at Nature and an increasing number of ML conferences (NeurIPS [122], MICCAI [10]), however these guidelines must be promoted until an established empirical rigor is followed by the scientific community, so higher level issues of reproducibility can be properly addressed.

The reproducibility of the methods is often limited by using privately acquired data for training, and the performance of the proposed methods are often driven by large and high-quality datasets. With reproducible methods, the cross-examination of these solutions becomes possible, and the effect of the individual proposed components, e.g. novel loss functions and model architectures can be investigated in more depth.

Due to the concerns around code sharing, the pre-trained weights of the proposed models are also difficult to find. Providing the weights in a standalone format would make for the easiest comparison to other methods in further research. However, without the weights the entire training process must be reproduced for comparison, which is a time-consuming process with many caveats [17]. This means that when comparing to previously established methods, it is inherently difficult to use them appropriately, always giving an advantage to the currently proposed model.

Open questions, future directions

As the use of ML continues to expand, it has become increasingly accessible for individuals without a background in artificial intelligence. While this is a significant advancement towards making ML more transparent, it also emphasizes the need for careful design and maintenance of the models by researchers. The myriad of concerns around implementing ML methods need to be popularized and empirical rigor must be employed to improve the quality of the implementations.

In order for ML to continue to grow, it is not enough to merely address the issues that arise. Instead, researchers must proactively take these issues into consideration when designing a project. This includes carefully selecting appropriate datasets, choosing the right algorithms, continually testing and validating the models to ensure that they are reliable and produce accurate results, and sharing their code and trained models whenever possible. By taking these precautions, researchers can ensure that the use of ML remains ethical, safe, and beneficial to society as a whole.

Under careful supervision, the use of ML solutions could benefit virtually all stages of the radiotherapy workflow [61]. By processing the large amount of collected medical data at a much faster rate than possible for the already overworked medical staff can possibly do so, promising to be an extremely effective and valuable asset in healthcare, motivating only further exploration and development.

Current AI solutions have crucial limitations, that make the supervised and wide-spread use of ML models extremely difficult. To move away from the “black box” analogy of ML, the models need to be distributed with a thorough documentation of the significant decisions made during training. The following are the most important aspects to always address [69]:

The data used for training and evaluation: Models are not expected to perform well on data that is widely different than what the model has been trained or evaluated on (e.g. different medical imaging hardware, scanner settings, anatomy)
The pre-processing steps: The model should be evaluated on data that has been pre-processed correctly, as the predictions of the model will likely not indicate if the input format is incorrect.
Training approach: Supervised learning introduces different artefacts than unsupervised learning, such as GANs. For the case of image segmentations, the former may introduce errors around the edges of the segmentations, while the latter may introduce hallucinations of non-existent anatomies.
An interpretation of the model predictions, together with other limitations and potential biases.

By addressing the limitations of current ML solutions, we can enhance their reliability, interpretability, and effectiveness in real-world applications. The presented projects below tackle medical imaging problems, and they all focus on reproducibility and the generalization of the method to previously unseen data. Given the increasing quantity and quality of public datasets, the increasing popularity of public code repositories for sharing implementations, the numerous empirical research [17, 16, 13, 129], and the initiatives towards highly generalizable solutions, I believe the future of ML in medical imaging is bright.