Artificial intelligence and machine learning for medical imaging: a technology review

Artificial intelligence (AI) has recently become a very popular buzzword, as a consequence of disruptive technical advances and impressive experimental results, notably in the field of image analysis and processing. In medicine, specialties where images are central, like radiology, pathology or oncology, have seized the opportunity and considerable efforts in research and development have been deployed to transfer the potential of AI to clinical applications. With AI becoming a more mainstream tool for typical medical imaging analysis tasks, such as diagnosis, segmentation, or classification, the key for a safe and efficient use of clinical AI applications relies, in part, on informed practitioners. The aim of this review is to present the basic technological pillars of AI, together with the state-of-the-art machine learning methods and their application to medical imaging. In addition, we discuss the new trends and future research directions. This will help the reader to understand how AI methods are now becoming an ubiquitous tool in any medical image analysis workflow and pave the way for the clinical implementation of AI-based solutions.

1. Introduction

For the last decade, the locution Artificial Intelligence (AI) has progressively flooded many scientific journals, including those of image processing and medical physics. Paradoxically, though, AI is an old concept, starting to be formalized in the 1940s, while the term of artificial intelligence itself was coined in 1956 by John McCarthy. In short, AI refers to computer algorithms that can mimic features that are characteristic of human intelligence, such as problem solving or learning. The latest success of AI has been made possible thanks to tremendous growths of both computational power and data availability. In particular, AI applications based on machine learning (ML) algorithms have experienced unprecedented breakthroughs during the last decade in the field of computer vision. The medical community has taken advantage of these extraordinary developments in order to build AI applications that get the most of medical images, automating different steps of the clinical practice or providing support for clinical decisions. Papers relying on AI and ML report promising results in a wide range of medical applications [1–7]. Disease diagnosis, image segmentation or outcome prediction are some of the tasks that are experiencing a disruptive transformation thanks to the latest progress of AI.

More recently, ML tools have become mature enough to fulfill clinical requirements and, thus, research and clinical teams, as well as companies are working together to develop clinical AI solutions. Today, we are closer than ever to the clinical implementation of AI and, therefore, getting to know the basics of this technology becomes a “must” for every professional in the medical field. Helping the medical physics community to acquire such a solid background knowledge about AI and learning methods, including their evolution and current state of the art, will certainly result in higher quality research, facilitate the first steps of new researchers in this field, and inspire novel research directions.

The goal of this review article is to briefly walk the reader through some basic AI concepts with focus on medical imaging processing (Section 2); followed by a presentation of the state-of-the-art methods and current trends in the domain (Section 3). To finish, we discuss the future research directions that will make possible the next generation of AI-based solutions for medical image applications (Section 4).

2. Building blocks of AI methods for medical imaging

The field of AI evolves rapidly, with new methods published at a high pace. However, there are several central concepts that have settled for good. This section presents a brief overview of these building blocks for AI methods, with a focus on medical imaging. For more detailed descriptions we refer to relevant books [8–11] and publications [12,13].

2.1. Artificial intelligence, machine learning, and deep learning

As mentioned previously, AI broadly refers to any method or algorithm that mimics human intelligence. Historically, AI has been approached from two directions: computationalism and connectionism. The former attempts to mimic formal reasoning and logic directly, regardless of its biological implementation. Mostly based on hardcoded axioms and rules that are combined to deduce new conclusions, computationalism is conceptually similar to computers, storing and processing symbols. Connectionism, on the other hand, rather follows a bottom-up approach, starting from models of biological neurons that are interconnected in large numbers and from which intelligence is intended to emerge by learning from experience. Expert systems [14–16], which started to be very popular in the 1980, are a classical example of computationalism. Some famous applications of expert systems to the medical field are MYCIN (diagnosis of bacterial infection in the blood) [17], PUFF (interpretation of pulmonary function data) [18], or INTERNIST-1 (diagnosis for internal medicine) [19]. However, the bottleneck of expert systems is the complexity of acquiring the required knowledge in the form of production rules and, thus, interest in computationalist algorithms started to fade since the 1990’s in favor of connectionism approaches [20,21]. The appeal of connectionism and learning-based AI holds in that it delegates the responsibility for accuracy and exhaustiveness to data instead of human experts, who might be poorly available, prone to error, bias, or subjectivity. The ever growing abundance of data, including medical images, then typically tilts the scales in favor of learning techniques, and the community has focused successively on two nested subfamilies ( Figure 1 ): machine learning and deep learning.

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0001.jpg

Artificial intelligence, machine learning, and deep learning can be seen as matryoshkas nested in each other. Artificial intelligence gathers both symbolic (top down) and connectionist (bottom up) approaches. Machine learning is the dominant branch of connectionism, combining biological (neural networks) and statistical (data-driven learning theory) influences. Deep learning focuses mainly on large-size neural networks, with functional specificities to process images, sounds, videos, etc.

The specificity of Machine learning (ML) is to be driven by data, which gives machines (computers) “the ability to learn without being explicitly programmed”, as formulated by Arthur Samuel in 1959, a pioneer in the ML field. ML typically works in two phases, training and inference. Training allows patterns to be found in previously collected data, whereas inference compares these patterns to new unseen data to then carry out a certain task like prediction or decision making. Since the 1990’s, ML algorithms have continuously evolved and improved, becoming more sophisticated and including hierarchical structures, which gave rise to the popular Deep Learning. The term Deep Learning (DL) was first coined by Aizenberg et al.[22] in the 2000s, and refers to a subset of ML algorithms, with the particularity of being organized hierarchically, on multiple levels, hence the term “deep”, to automatically extract meaningful features from data.

Although ML encompasses DL, DL is often opposed to classical “shallow” ML, the latter relying on algorithms that have a flatter architecture and depend on previous feature engineering to extract data representations. This distinction also reflects the evolution from ML to DL, namely, from specific feature engineering to generic feature learning. While ML generally relies on domain knowledge and expertise to define relevant features, DL involves generic, trainable features. In other words, despite the modeling power of ML, global performance remains limited by the adequacy of manually picked features. Alternatively, DL replaces these fixed specialized features with generic, trainable, low-level features that are involved in the learning procedure, thereby offering better performance guarantees. Sophistication is here achieved by stacking layers of simple features, leading to a hierarchical model structure. As training in DL concerns both the low-level features and the higher-level model, DL is often referred to as an end-to-end approach. For image data, this approach typically allows DL to learn optimal filters.

Today, ML models have reached important milestones, in some cases being able to accomplish tasks with an accuracy that is similar to or even better than human experts. For instance, the diagnostic performance of DL models has demonstrated to be equivalent to that of health-care professionals for certain applications [23], such as skin-cancer detection [24] or breast cancer detection [25]. In particular, the latter reported a DL model that not only reached an excellent performance in mammogram classification, but also outperformed five out of five full-time breast-imaging specialists with an average increase in sensitivity of 14% [25]. Image segmentation is another task that has experienced a transformation with the advent of ML algorithms. For instance, a recent study has described a DL model that can perform organ segmentation in the head and neck region from CT images with performance comparable to experienced radiographers [26]. For more detailed examples of the performance of state-of-the-art ML and DL methods for medical applications we refer to Section 3.

2.2. Learning frameworks and strategies

Machine learning can be broadly split into two complementary categories, supervised and unsupervised, which are inspired from human learning ( Table 1 ). Supervised learning is the simplest and provides the tightest framework with strongest guarantees. It formalizes learning with a parent or teacher, providing the inputs and controlling the outputs. In supervised learning, the training data consists thus of labelled or annotated (input,output) pairs, and the model is trained to yield the right desired output when presented with some input. When data is not annotated, unsupervised learning, also known as self-organization, aims at discovering patterns in data ( Figure 2 ).

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0002.jpg

Three classical learning frameworks in artificial intelligence: supervised, semi-supervised, and unsupervised learning. Supervised learning relies on known input-output pairs. If some output labels are difficult or expensive to get, semi-supervised learning can apply. If no labels are available, unsupervised learning allows for a more exploratory approach of data.

Table 1.

Different learning frameworks and strategies, together with some of the most popular algorithms or techniques that are used for each of them, as well as a few examples of common applications in the field of medical imaging. The table is divided in three parts: the basic learning frameworks (supervised, unsupervised and reinforcement learning), the hybrid learning frameworks blending supervised and unsupervised, and finally common learning strategies that solve consecutive learning problems or combine several models together.

Learning styleCommon algorithms / methodsExamples
BASIC LEARNING FRAMEWORKS
Supervised learning• Linear or logistic regression
• Decision trees and random forests
• Support vector machines
Convolutional neural networks
• Recurrent neural networks
• Cancer diagnosis [78–81]
• Organ segmentation [26,82–86]
• Radiotherapy dose denoising [33]
• Radiotherapy dose prediction [87,88]
• Conversion between image modalities [89,90]
Unsupervised learning• (Variational) Auto encoders
• Dimensionality reduction (e.g., Principal component analysis)
• Clustering (e.g., K-means)
• Domain adaptation tasks[35–37,91,92]
• Classification of patient groups [93]
• Image reconstruction [94]
Reinforcement learning• Q-learning
• Markov Decision Processes
• Tumor segmentation [54,55]
• Image reconstruction [95]
• Treatment planning [50–53,96]
HYBRID LEARNING FRAMEWORKS
Semi-supervised learning• Generative Adversarial Networks• Tumor classification [45,46]
• Organ segmentation [46]
• Synthetic image generation [97,98]
Self-supervised learning• Pretext task: distortion (e.g. rotation), color- or intensity-based, patch extraction• Image classification or segmentation [76]
LEARNING STRATEGIES
Transfer learning• Inductive
• Transductive
• Unsupervised
• Radiotherapy toxicity prediction [58]
• Adaptation to different clinical practices [62]
Improving model generalization[99]
Ensemble learning• Bagging - Bootstrap AGGregatING - (e.g. random forests)
• Boosting (e.g. AdaBoost, gradient boosting)
• Radiotherapy dose prediction [100,101]
• Estimation of uncertainty [102]
• Stratification of patients[103]

Typical supervised tasks involve function approximation, like regression and classification. Classification can be binary, like in determining whether a pathology is present or not in an image [25,27], involve multiple classes, as in determining a particular pathology among several labels [28–30], or concern not the whole image but each pixel, as done for image segmentation [31,32]. On the regression side, also in a pixel-wise way, image enhancement (e.g. improving a low-quality image, the input, by mapping it to its higher quality counterpart, the output label or annotation) [33] or image-to-image mapping (e.g. mapping a CT image, the input, to the corresponding dose distribution, the output) [34]. More examples of clinical applications of supervised learning and the common ML methods used within this learning framework are presented in Table 1 .

In contrast, most unsupervised tasks relate to probability density estimation, like clustering (finding separated groups of similar data items), outlier or anomaly detection (isolated items), or even manifold learning and dimensionality reduction (subspaces on which data concentrate). The use of unsupervised learning has been, so far, much more limited than its supervised counterpart, although useful applications for medical imaging exist, such as domain adaptation (e.g., adapting a segmentation model trained on an image modality to work on a different image modality) [35–37], data generation (e.g., generate artificial realistic images) [38–40] or even image segmentation [41]. Table 1 presents some of the main ML methods that work in an unsupervised framework.

Semi-supervised learning is a hybrid framework halfway between supervised and unsupervised and thus it involves data for which desired outputs are only partly known. Groups identified as clusters by unsupervised learning can be used as possible class labels [42] ( Figure 2 ). Some examples of clinical applications for semi-supervised learning include the generation or translation of images from a specific class to another in a semi-supervised setting (e.g., generation of synthetic CTs from MR images) [43,44], and segmentation or classification of images with partially labelled data [45,46].

So far, supervised learning has been the most used learning framework for medical imaging applications, as it is totally univocal and models are very easy to train. However, it is well-known that data labelling in the medical domain is an extremely time-consuming task, subject to costly inspection by human experts. Therefore, more and more researchers are now exploring semi-supervised learning techniques because they are an excellent alternative to complement small sets of carefully labelled data with large amounts of cheap unlabelled data collected automatically [42,47]. In fact, many of the current limitations of ML/DL algorithms come from the use of labelled data (e.g., errors in labels [48], limited size labelled databases, etc) and thus, although the use of fully unsupervised learning in the medical field is still very limited, we believe that future research will focus on unsupervised techniques in order to unlock the full potential of ML. Very recently, unsupervised models are achieving improved performances over supervised models for computer vision tasks [49], and the same is likely to happen for medical imaging applications.

Yet another type of learning is by interacting with an environment where an agent gets feedback from its actions over the course of time, which is known as reinforcement learning. After each action towards a new state, the environment can either reward or punish the agent who has then to best predict the longer-term consequences of future actions in a trial and error fashion. The use of reinforcement learning for medical imaging is still not very extended, but has increased in the last couple of years, with promising applications that allow mimicking physician behaviour for typical tasks such as the design of a treatment [50–55], among others ( Table 1 ).

On top of these three basic learning frameworks (supervised, unsupervised and reinforcement learning), there are other strategies that enable us to reuse previously trained models (transfer learning) or combine models (ensemble learning). Transfer learning [56,57] reuses blocks and layers from a model that was pre-trained with some data and for a certain task (source domain and task) and fine-tune it to be applied to different data and/or task (target domain and task). For example, a classification model pre-trained on ImageNet (a big collection of natural images) can be partly reused and fine-tuned for medical imaging applications, such as organ segmentation or treatment outcome prediction [58–60]. Transfer learning allows us to exploit knowledge from different but related domains, mitigating the necessity of a big dataset for the target task, and improving the model performance [60–62]. Ensemble learning methods are also a way to improve the overall performance and the stability of the model, by combining the output of multiple models or algorithms to perform a task [63]. Some examples of medical applications include the mapping of patient anatomy to dose distribution for radiotherapy treatments [64], image segmentation [65], or classification [66].

Last but not least, self-supervised learning is a recent hybrid framework that has become state-of-the-art in natural language processing [67–69]. It is gaining attention for computer vision tasks [70–73] and it could play an important role in future research directions for medical imaging applications. Self-supervised learning can be seen as a variant of unsupervised learning, in the sense that it works with unlabelled data. However, the trick here is to exploit labels that come for “free” with the data, namely, those that can be extracted from the structure of data itself. Self-supervised algorithms work in two steps. First, the model is pre-trained to solve a “pretext task” where the aim is to obtain those supervisory signals from the data. Second, the acquired knowledge is transferred and the model is fine tuned to solve the main or “downstream task”. The literature on self-supervision for medical imaging applications is still scarce [74–77], but for instance, a recent work used context restoration as a pretext task [76]. Especifically, small patches in the image were randomly selected and swapped to obtain a new image with altered spatial information, and the pretext task consisted in predicting or restoring the original version of the image. They later used this knowledge to tune the model for image classification of 2D fetal ultrasound images; organ localization on abdominal CT images, and segmentation on brain MR images.

The existence of several hybrid learning frameworks shows that the boundaries between supervised and unsupervised learning has been progressively blurred to accommodate hybrid framework and combined strategies ( Table 1 ), which can address real-world problems and data sets pragmatically (see Figure 3 ).

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0003.jpg

The tight framework of supervised learning can be hybridized with unsupervised learning to make room for practical cases and problems, as well as to accommodate temporality. Delaying supervision in future times leads towards reinforcement learning. Incompletely labelled data fosters semi-supervised learning, whereas small data sets encourage reusing (parts of) models trained previously on similar but bigger data sets, like in transfer learning. In self-supervision, pretraining relies on solving dummy supervised problems, where fake labels are created based on the inherent structure of image or sound data.

2.3. Typical AI-based medical imaging analysis workflow

Reviewing past works in the AI and ML literature shows that common blocks are used in most workflows for medical imaging processing ( Figure 4 ). As ML is driven by data, preliminary steps are to extract and select relevant features from data, that is, quantitative characteristics that summarize information conveyed by data into vectors or arrays. Then, this information is fed to generic predictive models, like classifiers or regressors, which learn to perform a certain task. An example of this strategy is the field of radiomics [104,105], where “-omics-like” features are extracted from radiological images in order to predict some indicator of interest like a disease grade or a patient’s survival.

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0004.jpg

General ML pipeline for supervised learning: supervised predictive models are fed with features that are extracted and/or selected beforehand in an unsupervised way. Feature selection can, however, be embedded in some models, using regularization, for instance; selection then becomes supervised and therefore often improved. Classical (shallow) models tend to critically depend on unsupervised feature extraction and selection to preprocess data. In contrast, deep learning drops unsupervised feature extraction and selection; instead, it embeds multiple trainable layers of feature extractors and selectors, allowing the full pipeline to be supervised, end to end.

Feature engineering, extraction, and selection

Feature engineering, extraction, and selection are key steps to channel data to an AI method [8]. Feature engineering refers to crafting features by hand, either in ad hoc fashion or by relying on generic features from the literature. For images, the former could be gray level, color statistics, or shape descriptors (volume, diameter, curvature, sphericity, …). Image features are often classified in local or low-level features (specific to a small group of pixels in the image) and global or high-level features (characterizing the full image). For the latter, generic features would for instance result from applying Gabor or Laplace filters, edge detectors like Sobel operators, texture descriptors, Zernike moments, or popular transforms like Fourier’s or wavelet bases. In radiomics, all the above mentioned features can be used together, like ad hoc tumor shape and intensity descriptors, as well as textural descriptors (typically, Haralick’s gray level co-occurrence matrix [106]).

As an alternative or in a second time, higher-level features can be extracted in a more data-driven way, using dimensionality reduction. Methods like Principal Component Analysis [107]. Linear Discriminant Analysis [108], auto-associative [109] networks can reduce the number of input variables according to some unsupervised or supervised criterion. For images more specifically, the convolutional filters involved in CNNs bear similarity with the filters above: they extract local features, but their parameters are learnt from data and stacking them allows the global higher-level features to emerge. When features are not extracted in a supervised, data-driven way, it might happen that some of them are redundant or not relevant. To address this issue, a feature selection step can discard those to focus on a reduced set of features. Feature selection can follow several strategies, by either selecting or discarding. Wrappers [110] use a supervised predictive model to score subsets of features. To avoid the burden of a full fledged predictive model, feature filters [111], not to be confused with image filters above, use an unsupervised surrogate to score feature subsets, like their correlation or mutual information. Embedded methods [112] are directly integrated into the predictive model. For instance, feature weight regularization can favor sparse configurations, where irrelevant features get null weights. Examples of features selection in radiomics can be found in [113–119], for instance. Deep neural networks typically rely on this last approach with regularization. In the example of radiomics, embedded feature selection can be implemented with deep neural networks [120] and regularization, hence allowing for end-to-end learning instead of combining manually engineered features with shallow predictive models.

Predictive models

Common tasks in AI are regression and classification. The AI/ML model then attempts to predict either continuous values (e.g., a dose or a survival time) or class probabilities (e.g., benign vs malignant) starting from input features. In the following, we describe the main methodological aspects of the basic predictive ML models, state-of-the-art ML/DL methods and examples of their clinical applications for medical imaging are presented in the next section.

Regression is the most generic task in supervised learning. Linear regression is well known but other mathematical models can involve exponential or polynomial functions. ML generalizes this concept to universal approximators that can fit data sampled from almost any smooth function, with also possibly many input and output variables. Artificial neural networks (NNs) are the most iconic universal approximators ( Figure 5 ). They consist of interconnected formal models of neurons, a mathematical ‘cell’ combining several ‘dendritic’ inputs into a weighted sum that triggers an ‘axonal’ output through a nonlinear activation function, like a step, a sigmoid, or a hinge (Rectified Linear Unit, ReLU).

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0005.jpg

Artificial neural networks in a nutshell. (a) The formal neuron, processing several dendritic inputs through a nonlinear activation function f, to produce its actional output. (b) The neurons can be interconnected in a feed-forward way, into successive layers; as soon as a nonlinear ‘hidden’ layer is inserted in between the inputs and outputs, the network can potentially approximate any function; specific activation functions can be fitted in the output layer to achieve either regression or classification. (c) Examples of nonlinear activation functions in the hidden layers: the step function, from biological inspiration, the sigmoid, its continuous and differentiable surrogate, and the rectified linear unit (ReLU), that improves training of deep layers.

As soon as a hidden layer of neurons with nonlinear activation functions is inserted between the input and the output layers, a NN becomes a universal approximator [121]. However, a notion of capacity is associated with the NN architecture: the more neurons the hidden layer counts, the more complex functions can be approximated. The capacity is roughly proportional to the number of synaptic weights (parameters) in the NN and it is analogous to the polynomial order in polynomial regression (the number of weights in the terms). Deep NNs are obviously also universal approximators [121,122]. Their interest lies in trading width of a single hidden layer for depth, as stacks of hidden layers allow functional difference (e.g., convolutive neurons for image data) and thus hierarchical processing, explaining the later success of deep networks compared to shallow ones. Most NNs are feed-forward, meaning that data flows unidirectionally from inputs to outputs. Recurrent NNs (RNNs) add feedback loops to feedforward connections, allowing them to process sequences of data (text, videos) and somehow to keep memories of past inputs, which then gives context to new inputs.

Training of NNs relies on minimizing a loss function between the desired output and the one provided by the NN in its current parameter configuration. The partial derivatives, or gradient, of the loss function with respect to these parameters indicates the direction in which tuning the parameters is likely to decrease the loss. In a feedforward NN, this derivative information flows back from layer to layer and is therefore called gradient backpropagation.

For regression, typical loss functions can be the mean square error or mean absolute error. With a suitable change of the output layer (softmax or normalized exponential) and loss function (the cross entropy), the NN can approximate class probabilities.

Classification is the other prominent task in ML. Classifiers are simply algorithms that can sort data into groups or categories, and there exists a large variety of them [123]. Some of the most popular ones are very intuitive and easy to interpret, such as decision trees [124], where input data is classified by going through a hierarchical, tree-like process including different branching tests of the data features ( Figure 6.a ). Growing several complementary decision trees together, in an ensemble learning strategy, leads to random forests ( Figure 6.b see also Section 3). Other simple algorithms for classification include the linear classifier, the Bayesian classifier, or the Perceptron ( Figure 5.a ). More sophisticated algorithms can actually be used for both regression and classification tasks. Some examples are NNs ( Figure 5.b ), which can yield class probabilities with suitable output layers; or support vector machines [125], which can be seen as an improved linear classifier that works in a higher-dimensional space and try to fit the separation (hyper)plane with the thickest margin in between points of two classes ( Figure 7 ).

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0006.jpg

(a) Decision trees assign labels (leafs) to a given sample by going through a multi-level structure where different features (root nodes) and solutions (branches) are tested. (b) In a Random Forest algorithm, decision trees are combined, following an ensemble learning approach, which enables to get more accurate predictions than a single tree. Each individual tree in the forest spits out a class prediction and the class with the most votes becomes the final model’s prediction.

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0007.jpg

Principle of the linear support vector machine, which lifts the indeterminacy of separable classification by fitting the thickest margin, stuck in between a few ‘support vectors’. The principle can be extended to nonlinear class separation by using Mercer kernels [125].

3. State-of-the-art AI methods for medical image analysis

In the last decade, intensive research in AI methods for medical applications, and specifically in ML/DL ( Figure 8 , left), yielded thousands of publications reporting the performance of new algorithms and/or original variants of the existing ones. The number of publications using some of the most popular ML/DL methods is presented in Figure 8 . In particular, in recent years, attention has moved from ML methods such as SVMs and Random Forests to Convolutional Neural Networks ( Figure 8 , right). In addition, since 2018, the use of other DL methods such as Generative Adversarial Networks or reinforcement learning algorithms is rapidly increasing. Notice that this section is not intended to be an exhaustive review of the application of AI methods to the medical field, but rather an illustration of the potential of these methods. Thus, in the following, we describe the basic methodological aspects of two of the most widely used algorithms (Random Forests and CNNs), as well as the increasingly popular GANs, and we provide some examples of recent applications of these methods to the field of medical image processing.

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0008.jpg

Number of publications since 2010 till 2020 in the PubMed repository, containing keywords related to AI/ML/DL methods in the title and/or abstract.

Random Forests (RFs)

Random forests (RFs) [126,127] use an ensemble of uncorrelated binary decision trees (multiple learning models) to find the best predictive model ( Figure 6 ). Each decision tree can be seen as a base model (binary classifier) with its respective decision, where a combination of such decisions leads to the final output. This is achieved in RFs by using two distinctive mechanisms, i.e., internal feature selection and voting [128]. The RFs algorithm extracts a multitude of low-level (simple) data representations and uses the feature selection mechanism on all collected features to find the most informative ones. After feature selection, a majority vote on selected classifiers yields the final decision. For a full detailed description of the RF algorithm we refer to [128].

The earliest applications of RFs date from a decade ago for organ localization [129] and delineation [130]. Since then, RFs have been applied to numerous tasks, including detection and localization, segmentation, and image-based prediction [128]. For some specific applications, RFs have demonstrated an improved performance over other classical ML methods. For instance, Deist et al. [131,132] compared six different classification algorithms (decision tree, RFs, NNs, SVM, elastic net logistic regression, LogitBoost) on 12 datasets with a total of 3496 patients, for outcome and toxicity prediction in (chemo)radiotherapy. They concluded that RFs was the algorithm achieving higher discriminative performance (on 6 over 12 datasets). This goes in line with the findings from more fundamental ML research studies, which have reported RFs as one of the best classical learning algorithms [133]. However, many other works in the medical field have also compared the accuracy of RFs against more complex or simpler ML classifiers, and it is well known that their performance may vary for different applications [103,113,132,134–139] and even for different datasets within the same application [131,132]. This makes it hard to conclude on the absolute superiority of RFs algorithm over other ML classifiers. Nevertheless, the work of Deist et al. included one of the largest datasets investigated so far for radiotherapy outcome prediction, which is a strong argument in favor of considering RFs as one of the first options to investigate for this kind of application. In addition, RFs keep achieving very promising results in recent applications related to outcome prediction [135,139–143], but also for other domains like image classification [113,144] or automatic treatment planning [100,145–147]. Regarding other tasks where RFs were among the state-of-the-art methods a few years ago, like image synthesis [148–150] or segmentation [151,152], the community has now fully switched the attention to CNNs [5,153,154]. Nevertheless, in favor of RFs one could argue that they are easy to implement and less computationally expensive than CNNs (i.e., they can work in regular CPU). Therefore, they still deserve an important place in the ML toolbox for medical imaging.

Convolutional Neural Networks (CNNs)

Convolutional neural networks (CNNs) are inspired by the human visual system and exploit the spatial arrangement of data within images. Their remarkable capacity to detect hierarchical data representations has made CNNs the most popular architecture for current medical image processing applications.

Traditionally, CNNs stack successive layers of convolutions and down-sampling, and fully connected layers towards the output ( Figure 9 ). Sequential applications of multiple convolutions enable the network to extract first simple features, like edges, in the deepest layers, which are next combined and refined into richer, more complex, hierarchical features, like full organs. Within each convolutional layer, feature saliency is determined by scanning a fixed-size convolution kernel (typically 3x3) all over the image to yield a feature map. This allows for an economy of parameters (weight sharing) and hence easier training. Downsampling layers are inserted between convolutional layers to reduce the size of feature maps, typically by applying a max-pooling operation, which keeps the maximum pixel value out of all non-overlapping 2-by-2 blocks in the feature map. To some extent, successive max-pooling allows for some shift invariance with respect to image content, as the salient maximum might stem from anywhere in the block. Downsampling trades resolution for number, as more convolution filters can be applied to smaller features maps within the same memory footprint. Eventually, fully connected layers generate the outputs, where all neurons are interconnected.

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0009.jpg

Typical architecture for a (deep) Convolutional Neural Network (CNN). Different convolutional kernels scan the input images leading to several feature maps. Then, down-sampling operations, such as max-pooling (i.e., taking the maximum value of a block of pixels), are applied to reduce the size of the feature maps. These two operations, convolution and pooling, are applied multiple times to extract higher-level features. At the end, the feature maps are flattened and passed through fully connected layers of neurons (see Figure 5 ), to obtain a final prediction. The embedded (automatic and unsupervised) feature extraction ( Figure 4 ) is what enables CNNs to remove all handcrafted operations and makes them so powerful.

Fully convolutional networks (FCNs) [155] were proposed to efficiently perform image-to-image tasks like segmentation. In CNNs, repeated convolution and max-pooling layers lead to low resolution abstract outputs. In order to return to full-resolution images, fully connected layers in CNNs are replaced in FCNs with operations that revert convolution and max-pooling. Following the same line, U-net [156] was presented for biomedical image segmentation, and is now widely used in medical imaging. It is an encoder-decoder styled network, where the encoder can be seen as a feature extraction block, and the decoder as output generation block. Within medical imaging, FCNs are used in both supervised, and unsupervised settings depending on the respective architecture. In supervised training, FCNs are mostly used for discriminative tasks, such as detection, localization, classification, segmentation, and denoising. Note that CNNs and FCNs are often used interchangeably.

For certain applications, such as image segmentation [154,157,158] or synthesis [5], CNNs are now considered the state-of-the-art methods [4]. Although the comparison of different algorithms on the same dataset is not so common, an excellent way to track the evolution of the state-of-the-art algorithms is to look at the challenges and competitions organised around specific topics. In certain cases, CNNs have clearly surpassed the performance of more classical methods. A good example is the Challenge on Liver Ultrasound Tracking (CLUST): the winning team in the first edition (2014) achieved a tracking error of 1.51 ± 1.88 mm using an approach based on image registration algorithms [159]; whereas the current best performing algorithm, based on CNNs, achieves under 1 mm accuracy (0.69 ± 0.67 mm), demonstrating a more robust model [160]. Another example is the database from the MICCAI Head and Neck Auto-segmentation Challenge 2015 [161], for which the most recent methods based on CNNs [26,162–164] has improved the Dice coefficients obtained at that time with model- and atlas-based algorithms by more than 3% on average. In particular, the work of Nikolov et al [26] has recently reported a U-Net architecture with an accuracy equivalent to experienced radiographers. These are just two of the many competitions organised around medical imaging tasks [165], but year after year CNNs are becoming the backbone of the best performing algorithms.

Some of the latest methodological improvements in the architecture of CNNs that have contributed to more robust and accurate models include coarse-to-fine cascade of two CNNs [166] to address class-imbalance issues; the addition of squeeze-and-excitation (SE)-blocks to allow the network to model the channel and spatial information separately [167], increasing the model capacity; or the implementation of attention mechanisms, which enables the network to focus only on most relevant features [168–170].

Besides image segmentation, other recent successful applications of CNNs include classification [171,172], outcome prediction [120,173,174], automatic treatment planning [62,87,175], motion tracking [176,177] or image enhancement [33]. In numerous applications, CNNs have either demonstrated an accuracy similar to human experts [26,80,178,179], decreased the interobserver variability [180,181] or reduced the physician’s workload for a specific task [157]

Generative adversarial networks (GANs)

Generative adversarial networks (GANs) Generative adversarial networks (GANs) [182] are popular archi-tectures used for generative modeling. GANs consists of two networks: generator ℊ and discriminator ( Fig. 10 ). The intuition is that ℊ iteratively tries to map a random input distribution to a given data distribution to generate new data, which D evaluates. Depending on the feedback from D, ℊ tends to minimize the loss between the two distributions, thus generating similar samples as input data. The goal is to trick D into classifying generated data as real. Both networks are trained simultaneously to get better at their respective tasks: while ℊ is learning to fool D, D is concurrently learning to better distinguish generated data from real input data. Note that both D and ℊ are generally CNNs trained in an adversarial setup.

An external file that holds a picture, illustration, etc. Object name is nihms-1702435-f0010.jpg

Structure of Generative Adversarial Networks (GANs). Starting from random noise, the generator (G) uses the feedback from the discriminator (D) and learns to create images that are similar to the provided ground truth.

Unlike CNNs, which have relatively old foundations dating back to 1980 (Section 2), adversarial learning is a rather new concept. However, it has rapidly rooted in the medical imaging field, leading to numerous publications in the last few years [183,184]. The initially proposed architecture for GANs [182] suffered from several drawbacks, such as a very unstable training, but the intensive research in the field of computer vision has lead substantial improvements by either changing the architecture of D and ℊ, or investigating new loss functions [183,185]. A way to better control the data generation process in GANs is to provide extra information about the desired properties of the output (e.g. ex amples of the desired real images or labels). This is known as conditional GANs (cGANs) [186], and it can be categorized as a form of supervised learning since it requires aligned training pairs. However, we believe that the real strength of GANs relies on their ability to learn in a semi-supervised or fully unsupervised manner. Specifically, in the medical imaging field, where aligned and properly annotated image pairs are seldom available, GANs are starting to play a very important role. In this context, cycleGANs [187] is probably one of the most famous architectures allowing bidirectional mapping between two domains using unpaired input data.

So far, in the medical imaging field, GANs have been mostly applied to synthetic image generation for data augmentation [188–190] and multi-modality image translation (e.g. MR to CT [90,191–193], CBCT to CT [97,194], among others [5,183]). Regarding data augmentation applications, we believe that GAN-based models have the potential to better sample the whole data distribution and generate more realistic images than traditional approaches (e.g. rotation, flipping, etc), which may contribute to a higher generalizability of the model [188] and more efficient training [195]. For multi-modality image translation, although cGANs have achieved good results [90,191,193,196], cycleGANs usually outperforms in terms of accuracy, in addition to overcome the issues related to paired image training (i.e. inaccurate aligning or labelling) [39,97,194,197]. Besides image translation, GANs have also been applied to other tasks, such as segmentation [198–203] or radiotherapy dose prediction [204–207], or artifact reduction [208], among others [183].

All above-mentioned applications have explored the generative capacity of GANs, but we believe that their discriminate capacity may also have some potential, since it can be used as regularized or detector when provided with abnormal images [209], which might be an excellent application for quality assurance tasks in radiation oncology, for instance.

4. Discussion and concluding remarks: where do we go next?

This article provided an overview of AI with a focus on medical imaging analysis, paying attention to key methodological concepts and highlighting the potential of the state-of-the-art ML and DL methods to automate and improve different steps of the clinical practice. Incorporating such knowledge into the clinical practice and making it accessible to the medical community will definitely help to demystify this technology, inspire new and high quality research directions, and facilitate the adoption of AI methods in the clinical environment.

Looking at the evolution of AI methods, one can certainly conclude that shifting from computationalism to connectionism, together with the transition from shallow to deep architectures, has brought a disruptive transformation to the medical field. However, an important part of the research so far has focused on simply translating the latest ML/DL advances in the field of computer vision to medical applications, in order to demonstrate the potential of these methods and the feasibility to use them to improve the clinical practice. It is the case of some of the papers cited in this manuscript, such as the first proof-of-concepts of the use of CNNs for organ segmentation [32] and for dose prediction for radiotherapy treatments [34], or the use of GANs for conversion between image modalities [97]. Although the technological transfer from computer science to the medical field will certainly continue to bring important progress, the next generation of AI methods for medical applications will only emerge if the medical community steps up to embrace AI technology and integrate all the domain-specific knowledge into the state-of-the-art AI methods [210,211]. This can be done in several ways, such as adding extra information in the input channels of the models or using dedicated loss functions during the model training. Some groups have already started to explore these research directions. For instance, instead of using generic loss functions from computer vision tasks, like the mean squared error, one could use loss functions that better target the specificities of our medical problem, such as including mutual information for the conversion of different image modalities [90,193] or dose-volume histograms for radiotherapy dose predictions [212]. Regarding the injection of domain-specific knowledge as input to the models, some examples include the addition of electronic health records and clinical data, like text and laboratory results, to the image data [213–215], or having first-order prior or approximations of the expected output [175,216–219]

Integrating domain-specific knowledge cannot only serve to improve the performances of state-of-the-art AI models, but also to increase the interpretability of the results, which is one of the well-acknowledged limitations of the current ML/DL methods [220–223]. This is the idea behind the so-called Expert Augmented Machine Learning (EAML), whose goal is to develop algorithms capable of extracting human knowledge from a panel of experts and use it to establish constraints for the model’s prediction [224]. This “human-in-the-loop” approach is also useful to train our AI models more efficiently. Indeed, some preliminary studies have reported that blindly increasing the training databases will not bring much improvement to our AI model’s performance [225]. In contrast, active learning [226] is a type of iterative supervised learning that follows this human-in-the-loop concept, where the algorithm itself query the user to obtain new data points where they are most needed, in order to build up an optimally balanced training dataset [227,228]. Nevertheless, although a human-centered approach for AI models is certainly the way to go in the close future, parallel research should focus on implementing strategies that leverage the problem of data labelling with semi-supervised, unsupervised or yet the increasingly popular self-supervised learning.

The quality of the data itself is certainly another important aspect that is worth discussion. Data collection and curation are indeed of paramount importance, since errors, biases, or variability in the training databases are often directly reflected in the model behavior and can have dramatic consequences in the model performances and its clinical outcome. Some examples of these issues include gender imbalance [229], racial bias [230], or data heterogeneity due to changes in treatment protocols overtime [225]. Despite progress in AI methods, data collection remains poorly automatized and the time dedicated to data collection and curation is often overly long. In fact, most state-of-the-art AI algorithms can be trained in a few hours, whereas building a large-scale well curated database can take months. Therefore, the same way physicians are familiar with planning protocols or delineation guidelines, the clinical teams should start being familiar with guiding principles for data management and curation in the era of AI. The FAIR (Findability, Accessibility, Interoperability, and Reusability) Data Principles [231] are the most popular and general ones, but the medical community should focus efforts on adapting those principles to the specificities of the medical domain [232–234]. Only in this way, we will manage to have a safe and efficient clinical implementation of AI methods. In addition, federated learning approaches [235–237] can be used to train AI models across institutions while ensuring data privacy protection, sharing the clinical knowledge and getting the advantages of collaborative AI solutions.

Investing time and collaborative effort in high-quality databases is certainly the way to move forward. So far, two aspects have played an important role in the recent development of AI and ML, namely, data repositories [238] and contests [239,240], the former feeding the latter. Competitions generate emulation among actors of the domain and allow state-of-the-art models to be benchmarked. A few examples have been cited in this manuscript, but there are multiple competitions every year that lead to public data repositories [241–243]. A very recent example is the breakthrough of AI in the CASP competition [244]. However, the results and rankings from competitions must be interpreted carefully when transferring the acquired knowledge into clinical applications [245,246]. Due to the high stakes in the medical domain, the community should devote even stronger efforts and international organizations should emit recommendations for data collection and curation, as well as for the design of contests and competitions. In the long term, this would lead to a much more structured and uniform clinical practice, with reduced differences between centres. Bigger, more homogeneous data could then potentially allow for another level of AI, by extracting much finer information, at the level of large populations. If data for AI is still in its infancy, so are also the methods. In spite of amazing progress and wowing results, current AI remains cast within tight frameworks. AI for images has been dealt here. Other application domains focus on natural language and speech processing with related but quite different approaches. So far, computer vision and audition follow different specialized approaches, although some works attempt to bridge the gap, like automatic image captioning. Nevertheless, these blocks remain mostly separated and current AI blocks lack integration and are still considered as weak AI. For instance, typical CNNs can boil down to just big filter banks, without any notion of time, and thus no memory and no experience. Strong AI is going to emerge when AI for images and speech, as well as active learning [227,228], will be combined into a sort of Frankensteinian brain, in which specialized lobes for the different senses get interconnected. This will allow for richer interaction, explainability through speech, reference to past experience, and continuous improvement. Such a strategy has been paved for autonomous driving, with different levels of automation. Confidence into ever more complex AI will grow only if AI can get more anthropomorphic, at least from a functional point of view.

In conclusion, artificial intelligence methods, and in particular, machine and deep learning methods, have reached important milestones in the last few years, demonstrating their potential to improve and automate the medical practice. However, a safe and full integration of these methods into the clinical workflow still requires a multidisciplinary effort (computer science, IT, medical experts, …) to enable the next generation of strong AI methods, ensuring robust and interpretable AI-based solutions.