Learning from deep learning

Deep learning is a machine learning technique that allows computers to do what comes naturally to humans: learn by example. With enough data and power and a well-designed experiment, these artificial intelligence (AI) networks clearly outperform competing techniques. Image analysis is one area with particularly convincing results, demonstrating the high performance of convolutional neural networks (CNNs) in the field of medical diagnostics. However, the main concern is that the systems appear opaque, and the basis of their predictions is not traceable by humans. This “black box” nature of the systems prevents us from learning what the systems have discovered as important features.

In medical diagnostics, an important task is predicting disease outcome for cancer patients based on images of pathological specimens. Accurate prediction of tumour malignancy enables the clinician and patient to make a joint, and more appropriate, treatment decision. With the anticipated large turnout of digital pathology, AI systems are expected to play a rapidly increasing role in pathology, not least on prediction tasks based on histopathological images derived from tumours. Increased knowledge of the basis for the systems’ classifications then becomes essential as this may:

Revealing the basis of a network’s predictions or classifications is a challenging task. However, recent developments have provided methods that enhance our ability to detect and visualise areas of images of particular importance to network predictions. Still, understanding the image characteristics that neural networks utilise for outcome predictions is complicated by vast and diverse images as well as limited knowledge about what clues we are looking for. Thus, it appears unlikely that satisfactory understanding can be obtained without supplementing the information in the standard histopathological Haematoxylin and Eosin-stained images (HE-images) with concrete biomedical information, including biochemical measurements on the cellular level. This information may relate to factors like the presence of mitotic cells, chromatin organisation, and protein expression, and must be provided on an image basis and aligned to the HE-images in order to facilitate a direct comparison with the network predictions.

A basic idea behind the present project is to enable the necessary comprehensive approach for understanding CNNs by utilising the competencies and results acquired over many years in the biomedical and informatics milieu at the Institute of Cancer Genetics and Informatics (ICGI) and fostered by the DoMore! lighthouse project. The essential components of this approach are:

The DoMore v1 network is built around multiple instance learning and designed for classifying supersized heterogeneous images of histopathological sections. When imaging HE-stained sections of resected tumours on a microscope scanner, the resulting whole-slide images (WSIs) are too large to be processed all at once with regular deep learning methods using current hardware. Instead of downscaling the images, thereby losing important image information, the WSI can be partitioned into smaller non-overlapping image patches, called tiles. When assessing the DoMore-v1-CRC classifier, each tile at two resolutions (referred to as 10x and 40x tiles) is processed by five CNNs. Each CNN produces a tile representation, and all tile representations associated with a patient and CNN are aggregated and reduced to a single continuous score. The DoMore-v1-CRC marker is a combination of the score produced by the ten networks, with a high score indicating poor prognosis. This new biomarker has been extensively evaluated in large, independent series of CRC patients, and it correlates with, but outperforms, established molecular and morphological prognostic markers, and gives consistent results across tumour and nodal stage (Skrede et al., Lancet 2020;395:350-360). The biomarker stratified stage II and III CRC patients into well-separated prognostic groups and may guide the selection of adjuvant treatment by avoiding therapy in very low-risk groups and identify patients who might benefit from more intensive treatment regimens.

Explainable deep learning for image classification can roughly be put into three categories:

Initial assessments indicate that the model distillation methods rely on internal feature representations, which may be hard to interpret in our particular case. The intrinsic methods require alternative networks; thus, we mainly aim to use different input attribution methods to untangle decisions made by our already well-performing classification networks.

A saliency map of a classified image is an image with the same spatial dimension as the classified image. A pixel value in the saliency map reflects the importance of the corresponding pixel on the network prediction. A first natural approach is to compute the partial derivative of the prediction score for the predicted class with respect to each input pixel (Simonyan et al., 2013;arXiv:1312.6034[cs.CV]). With this approach, the saliency map will be a map visualising how sensitive the network prediction is to change in a single pixel value. One challenge, however, is that the partial derivative of the prediction score with respect to some feature might be of low magnitude even if the network depends on this feature for its prediction (Sturmfels et al., 2020;https://distill.pub/2020/attribution-baselines/).

An alternative set of methods is Layer-wise Relevance Propagation (LRP), which creates saliency maps based on the relevance of each input pixel to the prediction score for the predicted class (Bach et al., PLOS ONE 2015;10:e0130140). The relevance of a pixel is an estimate of the strength of the relation between the input pixel and the output score. Another option is based on Integrated Gradients (IG) (Sundararajan et al., Proceedings of the 34th International Conference on Machine Learning 2017;70:3319-3328). Instead of measuring the gradient of the prediction score with respect to the actual input, these methods accumulate gradient contributions as the input is interpolated from some baseline image to the actual image. With this approach, we can find which pixels contributed the most to the change in prediction from the baseline (representing a random prediction) to the actual input (giving the actual prediction) for the predicted class of the actual image.

To use these and similar approaches to explain a prognostic network analysing medical images is more challenging than to explain a network classifying natural images, which is the setting where these methods typically have been developed and tested. The output of the prognostic network is a score reflecting the prognosis, which is not an intrinsic attribute of the input image, but rather a consequence of something that is indicated by the information contained within the input image. As this is the case, we cannot expect to understand the patterns in the saliency maps without complementary information. Thus, when the saliency maps emphasise any given cellular organ, organelle, or part of the cancer tissue, we need to complement those images with additional relevant biomedical data. For instance, the saliency maps may indicate that part or all of the cell nucleus is important for the network prediction, but it will only be certain nuclei, such as those related to entities shown to impact the prognosis, like the cell’s DNA content as reflected in integrated optical density (Danielsen et al., Nature Reviews Clinical Oncology 2016;13:291-304) or the chromatin organisation as revealed by texture analysis (Kleppe et al., Lancet Oncology 2018;19:356-369). Without information about relevant biomedical features, it appears unlikely that we can reliably identify the difference between objects, which the saliency maps indicate to be important and those indicated as not important. Without the possibility to make this distinction, we are unable to learn much about how the prognostic network provides its accurate prediction of patient outcome. Furthermore, it will be difficult to derive new knowledge about the disease processes driving metastasis, as well as how we may improve the input images to obtain even better classification performance. Thus, we need to simultaneously measure and align biomedical features known to be involved in disease progression, such as abnormal DNA content, genomic instability, the expression of particular genes, and specific mutations. In order to be able to classify the objects highlighted by the saliency maps, it would also be relevant to acquire and align information about cell type.

In spite of the extraordinary classification results presented in our recent Lancet contribution, the most intriguing question still remains: How do neural networks utilise plain microscopic images of cancer tissue to predict the outcome for the patient, years later, doing so better than all established molecular and morphological prognostic markers? What features of the cancer tissue are actually revealing the patient’s outcome? The aim of this project is to provide, at least in part, answers to these questions.

The basis for the activity is the DoMore v1 network, supplemented by work using the well-established Inception v3 network. These networks were trained on tumour tissue sections from 828 CRC patients with distinctly good or poor disease outcome from four cohorts, tuned based on 1645 patients with non-distinct disease outcome, tested on 920 patients with tumour sections prepared in the UK, and finally independently validated according to a predefined protocol in 1122 patients with sections prepared in Norway. These patient cohorts will also be used in the new tissue profiling described below and will constitute a solid biological basis for the research project.

As described above, the networks were trained on WSI partitioned into smaller non-overlapping tiles. However, these tiles do not have boundaries coinciding with biologically natural structures. Thus, one part of the project will be to retrain the classification networks using tiles with higher resolution (60x) as well as with images of segmented cell nuclei instead of tiles. In order to facilitate this, scanner development is also a prioritised part of the project.

The project will be organised as seven separate work packages. In WP1-WP4, we will develop the methods and tools necessary to address the visualisation approaches, including measuring a range of biomedical prognostic markers. In WP5-WP6, we will attempt to identify the image characteristics learned and utilised by the deep learning networks, and in WP7, we will be making the connection between image features and biomedical markers.

WP1 - Tissue profiling by multiparametric biomarker asessments

The simultaneous assessment of a range of biomarkers is required to reveal potential important biological factors that can explain the prognostic impact of the AI method. In order to identify a high number of biomarkers on the same tile or within the same cell (and consequently clusters of tiles/cells in the tissue), each tissue section will have to be sequentially stained against several different targets. A number of markers can be quantified from the HE-stained WSI scan before de-staining and re-staining with Feulgen and Immunohistochemistry (IHC).

In a prostate cancer pilot study, we have successfully managed eight sequential stainings: HE, Feulgen and six rounds of IHC, as shown in the figure above. So far, we have quantified 13 markers in a single tissue section. All markers are automatically quantified using advanced AI-based image analysis to allow for an objective assessment and the high throughput required to obtain robust conclusions.

Sequential staining in a larger context as part of Learning from deep learning

Our sequential staining repertoire will be expanded to include new and an increased number of protein targets, as well as mRNA in situ hybridisation analysis and chromogenic in situ hybridisation (CISH). Feulgen staining provides the basis for ploidy analysis, investigating large chromosomal aberrations, and Nucleotyping, investigating chromatin organisation. CISH will provide additional information on smaller chromosomal rearrangements. mRNA in situ hybridisation will expand our gene expression data and enable us to compare mRNA and protein expression directly in the tissue section on a cell-by-cell basis. Samples from a total of 512 CRC patients with distinct or near distinct disease outcome will be included.

WP2 - Pseudo IHC

Protein expression can be studied with IHC and provides biological information about cells and cellular events, such as cell type, proliferation, and apoptosis. IHC requires special preparation of the tissue section. The principal HE-stain highlights cellular and structural information that is associated with some of the more specific molecular characteristics obtained with IHC.

We have developed an AI-based method to identify cells expressing given features or proteins directly in routine HE-sections (Histopathological Image Analysis, Patent application number PT/EP2018/074155). So far, we have successfully identified cells in S-phase in HE-sections by training a deep learning system on WSI of the same sections re-stained with an antibody that binds to the Ki-67 protein (a marker of proliferation). Cells expressing the Ki-67 protein are identified in the IHC-sections, and the information is transferred to the HE-sections, where the cells’ Ki-67 status is used as ground-truth for training.

We will expand this toolbox to include other proteins relevant for the tumourigenesis of CRCs such as the proteins (MLH1, MSH2, MSH6, PMS2) expressed in cells harbouring microsatellite instability (MSI).

The DoMore network is using information from HE-stained tissue sections, where the tumour tissue contains several other cell types in addition to the epithelial tumour cells. Consequently, we aim to characterise the distribution and location of cell types in the tissue. We hypothesise that these different cell types are sufficiently distinct to train a deep learning network to recognise them in HE-stained tissue sections. The cell types are characterised by the expression of specific proteins, and we will use the pseudo-IHC approach described above to identify cell types in HE-sections based on IHC markers for each of the following cell types:

Epithelium (AE1/AE3), fibroblasts (Vimentin), muscle tissue (Vimentin & Desmin), endothelial cells (CD31/PECAM-1), nerve cells (S100), T-cells (CD4 & CD8), B-cells (CD20), neutrophil granulocytes (myeloperoxidase), monocytes/macrophages (CD68), mast cells (mast cell tryptase), natural killer cells/neutrophils/monocytes (CD56). A total of 150 CRC patient samples will be included for each of these 16 markers.

The pseudo-IHC approach will provide information about tissue composition and cellular features essential to understand and identify the results of WP5/WP6, and successful features will be added to the list of markers in WP7. The proteins and cell types that can be quantified and identified using the pseudo-IHC approach strongly depends on whether the feature of interest is visible and can be measured in the HE-stained tissue section. Features for which a quantification in HE-sections is not feasible will be included in the framework of WP1 and quantified in IHC-sections.

WP3 - MicroTracker, a computational framework for the alignment of cellular, morphological and AI-features

A tool is being developed for alignment of cellular, morphological and AI-features in different WSI and microscopy images (A method and apparatus for aligning or superposing microscope images; https://patents.google.com/ patent/GB2434651A). This tool registers all the different WSIs of the same tissue section and identifies the same cells/tiles in each WSI and hence allows for the required cell-to-cell / tile-to-tile comparison for each and all markers.

Most of the required functionality has been developed; what remains is the report and analysis functionality:


WP4 - High-resolution scanner

The commercially available scanners are designed for viewing images on a screen, mimicking the pathologist’s experience with light microscopy, as opposed to image analysis and AI. We presented the first WSI in 1996 and prototyped the Hamamatsu Nanozoomer, one of the first scanners on the market. We have recently started to build a high-resolution scanner specified and equipped for image analysis. This will allow us to work with significantly higher resolution and smaller tiles which is a critical requisite for WP 5-WP7

WP5 - Discover image regions with prognostic information.

The goal is here two-fold, one is to generate heatmaps generated from predictions of survival time and selected biomarkers. The attribution maps will serve as input to WP6 for the collection of morphological structures relevant to a prediction task. The second goal is to establish quantitative measurements of heatmap quality and compare it for several attribution methods.

We intend to compare established explanations such as integrated gradient and hybrid LRP versus the recently emerged contrastive explanations such as contrastive LRP (Gu et al, ACCV 2019), soft-max gradient LRP (Iwana et al, ICCVW 2019) and Relative attribution propagation (Nam et al, AAAI 2020). The question is to assess whether contrastive explanations offer an added value in histopathology over conventional ones. This comparison between contrastive and non-contrastive attributions has not been done in a quantitative manner so far.

The evaluation of attribution methods may yield in histopathology different results compared to datasets in general imaging such as Imagenet or fine-grained classification problems like CUB-2011, because unlike classes in the former datasets, the cases with low morphological grades in histopathology might be not defined by a presence of certain visual features but by an absence of higher-grade features.

Existing image ablation methods like random patches (Samek et al. TNNLS, 2017) or blur masks (Fong et al, ICCV 2017) may cause the measurement of attribution map quality relative to outliers. Instead, we need a measurement relative to images with less features from the current prognostic class. As part of these efforts, we are interested to develop new approaches for comparison and evaluation of attribution map quality. We propose to train generators for counterfactuals which are class-aware and therefore do not attempt to create a counterfactual by adding queues for another class. This can be achieved for example by GAN inpainting with loss constraints.

The finally selected attribution will provide an insight into what scan regions are deemed important for the classification, and this information may be compared to biochemical and cell type information (WP7). It will be used in WP7 to measure overlaps between the attribution maps and selected sets of nuclei, e.g. nuclei types like lymphocytes or fibroblasts, or nuclei selected by an IHC for a certain property or by a model prediction for a certain property obtained by the methods developed in WP2.

The current networks have been trained on tiles (10x and 40x) from within a segmented tumour. We will retrain the outcome classification networks with higher resolution tiles (63x) as well as with images of segmented cell nuclei instead of tiles. If this is successful, we will have an outcome classification network that is able to predict prognosis based on high-resolution images. We will use the same saliency map techniques as described above to discover regions in the nucleus important for the classification network.

A potential extension is the refinement of the digital staining on the level of nuclei resulting from WP2/WP3. This refined can be approached by intersection with attribution maps on the level of pixels obtained in this WP from patch-based classifiers for molecular properties.

WP6 - Image similarity and prognosis relevancy clustering

In the current work package, we aim to identify clusters or image features to better understand how the developed DoMore-v1-CRC marker and the corresponding Inception v3 marker (Skrede et al., Lancet 2020;395:350-360) outperforms previously known biomedical features solely based on WSI. The analysis will be based on quantifying similarity of image tiles, network embeddings, using the attribution maps from WP5. Using methods based on content-based image retrieval (CBIR) we will identify images that are similar in their prognostic relevance, while the attribution maps will allow to pinpoint similarities on the level of ROI of WSIs. Existing metrics such as euclidean distance in feature space and the prognostic score per WSI have shortcomings. The euclidean distance in feature space is agnostic to the selected prognostic class of interest, whereas the prognostic score does not allow to pinpoint ROIs which are similar between pairs of WSIs. We propose to use the attribution maps to obtain channel weights for selected feature spaces of the DoMore-v1-CRC marker. This will allow to use a similarity which is class-specific and at the same time spatially resolved. It can be used to identify regions in a pair of WSI which contribute most to the similarity either by aligned matching or by cross-matching. We intend to implement this in a scalable manner using GPUs and neural network toolboxes, in order to mine similar patches/ROIs on the orders of thousands efficiently.

The results will then be combined by mapping the combined higher-dimensional representation to a lower dimensional clustering using t-SNE or other similar techniques.

The obtained clustering results can subsequently be quantitatively evaluated by comparison on final prognosis score and will be further analysed in WP7. The clusters will furthermore be visualised directly, as well as being mapped back to the original images in order to allow expert evaluation.

The analysis will be performed for 10x and 40x tiles as well as on the results obtained from analysis of high- resolution tiles. The evaluation may be performed over all patients and within single patients to gain insight in the similarity and differences between scans and within a single scan. Finally, we will evaluate similarities and differences between individual models of the ensemble networks as well as between networks based on different magnification to identify if the networks obtain different information at different resolutions similar to human observers.

WP7 - Relating network predictions to biomedical features

In this work package, the biomedical features quantified as a part of WP1 and WP2, and co-registered as a part of WP3, will be compared to network predictions in an attempt to reveal how the networks obtain their prediction of patient outcome and gain novel insight into mechanisms involved in tumour progression. The comparisons will be carried out across all patients in the QUASAR 2 cohort, which was used for validation of the DoMore-v1-CRC marker (Skrede et al., Lancet 2020;395:350-360), and specifically for patients with distinct outcomes (cancer-specific death <2.5 years or survived without recurrence >6 years after surgery).

For the images of single-cell nuclei, the prediction score and classification of a network trained in WP5 can be compared directly to measurements of protein expressions, presence of mitosis, DNA ploidy, chromatin organisation (Nucleotyping), and size and number of nucleoli in the same nucleus, e.g. using a nonparametric correlation like Spearman’s coefficient. We will expand on these basic associations by comparing the network predictions and classifications to pairs and larger tuples of biomedical features, which could reveal that a network is utilising combinations of biomedical features in a non-linear fashion. These comparisons will be made for all images of single nuclei, for only the images where the network strongly indicates good or poor patient outcome, and for only the images which are found to be similar in terms of image similarity or prognostic relevance in WP6.

For the 10x and 40x tiles used by the DoMore-v1-CRC and corresponding Inception v3 marker, as well as the 63x tiles, WP5 used to train similar markers, and we will compare the average saliency value of pixels within automatically segmented cell nuclei to their biomedical features in a similar manner as for the images of single-cell nuclei. Additionally, we will characterise the entire tiles where a network strongly indicates good or poor patient outcome, as well as entire tiles that are found to be similar in terms of image similarity or prognostic relevance in WP6. In both cases, regions within these tiles where the saliency value for either good or poor patient outcome is high. This characterisation will include a description of the distribution of biomedical features for cell nuclei in the tile or region as well as biomedical features not specific to single-cell nuclei, e.g. stroma fraction and pathological evaluations such as tumour differentiation.

The insight derived from analyses of various networks will be compared in order to investigate which biomedical features are utilised similarly and differently by various networks. One aspect of this would be to study how different resolutions and physical extents influence the biomedical features exploited by the networks. We will then attempt to combine the findings to form an overarching impression of how CNNs are able to accurately predict patient outcome directly from histopathology images, which we anticipate will reveal novel insight about mechanisms for tumour progression.

This text was last modified: 17.09.2021

Click here to see publications related to this topic

Chief Editor: Prof. HÃ¥vard E. Danielsen
Copyright Oslo University Hospital. Visiting address: The Norwegian Radium Hospital, Ullernchausséen 64, Oslo. Tel: 22 78 23 20