Current pathology workflow involves staining of thin tissue slices, which otherwise would be transparent, followed by manual investigation under the microscope by a trained pathologist. While the hematoxylin and eosin (H&E) stain is well-established and a cost-effective method for visualizing histology slides, its color variability across preparations and subjectivity across clinicians remain unaddressed challenges. To mitigate these challenges, recently, we have demonstrated that spatial light interference microscopy (SLIM) can provide a path to intrinsic objective markers that are independent of preparation and human bias. Additionally, the sensitivity of SLIM to collagen fibers yields information relevant to patient outcome, which is not available in H&E. Here, we show that deep learning and SLIM can form a powerful combination for screening applications: training on 1660 SLIM images of colon glands and validating on 144 glands, we obtained an accuracy of 98% (validation dataset) and 99% (test dataset), resulting in benign vs cancer classification accuracy of 97%, defined as area under the receiver operating characteristic curve. We envision that the SLIM whole slide scanner presented here paired with artificial intelligence algorithms may prove valuable as a pre-screening method, economizing the clinician’s time and effort.
Quantitative phase imaging (QPI)1 has emerged as a powerful label-free method for biomedical applications.2 More recently, due to its high sensitivity to tissue nanoarchitecture and quantitative output, QPI has been proven valuable in pathology.3,4 Combining spatial light interference microscopy (SLIM)5,6 and dedicated software for whole slide imaging (WSI) allowed us to demonstrate the value of the tissue refractive index as an intrinsic marker for diagnosis and prognosis.7–14 So far, we have used various metrics derived from the QPI map to obtain clinically relevant information. For example, we found that translating the data into tissue scattering coefficients can be used to predict disease recurrence after prostatectomy. SLIM’s sensitivity to collagen fibers proved useful in the diagnosis and prognosis of breast cancer.12,14,15 While this approach of “feature engineering” has the advantage of providing physical significance to the computed markers, it only covers a limited range of parameters available from our data. In other words, it is likely that certain useful parameters are never computed at all. This restricted analysis is likely to limit the ultimate performance of our procedure.
Recently, artificial intelligence (AI) has received significant scientific interest from the biomedical community.16–20 In image processing, AI provides an exciting opportunity for boosting the amount of information from a given set of data with high throughput.20 In contrast to feature engineering, a deep convolution neural network computes an exhaustive number of features associated with an image, which is bound to improve the performance of the task at hand. Here, we apply, for the first time to our knowledge, SLIM and AI to classify colorectal tissue into cancer and benign.
Genetic mutations over the course of 5–10 yr lead to the development of colorectal cancer from benign adenomatous polyps.21 Early diagnosis promotes disease-specific mortality. Thus, early diagnosed cancers (still localized) have a 89.8% 5-yr survival rate compared to a 12.9% 5-yr survival rate for patients with distant metastasis or late stage-disease.22 Colonoscopy is the preferred form of screening in the U.S. from 2002 to 2010, the percentage of persons in the age group of 50–75 yr who underwent colorectal cancer screening increased from 54% to 65%.23 Out of all individuals undergoing a colonoscopy, the prevalence of adenoma is 25%–27% and the prevalence of high-grade dysplasia and colorectal cancer is 1%–3.3%.24,25 As current screening methods cannot distinguish adenoma from a benign polyp with high accuracy, a biopsy or polyp removal is performed in 50% of all colonoscopies.26 A pathologist examines the excised polyps to determine if the tissue is benign, dysplastic, or cancerous.
New technologies for quantitative and automated tissue investigation are necessary to reduce the dependence on manual examination and provide large-scale screening strategies. As a successful precedent, the Papanicolou test (pap smear) for cervical cancer screening has been augmented by the benefits of computational screening tools.27 The staining procedure, which is critical to the proper operation of such systems, is designed to match calibration thresholds.28
We used a SLIM-based tissue scanner in combination with AI to classify cancer and benign cases. We demonstrate the clinical value of the new method by performing automatic colon screening using intrinsic tissue markers. Importantly, such a measurement does not require staining or calibration. Therefore, in contrast to current staining markers, signatures developed from the phase information can be shared across laboratories and instruments without modification.
RESULTS AND METHODS
SLIM whole slide scanner
Our label-free SLIM scanner, consisting of dedicated hardware and software, is described in more detail in Ref. 29. Figure 1 illustrates the SLIM module (Cell Vista SLIM Pro, Phi Optics, Inc.), which outfits an existing phase contrast microscope. In essence, SLIM works by making the ring in the phase contrast objective pupil tunable. In order to achieve this, the image outputted by a phase contrast microscope is Fourier transformed at a plane of a spatial light modulator (SLM), which produces pure phase modulation. At this plane, the image of the phase contrast ring is perfectly matched to the SLM phase mask, which is shifted in increments of 90° (Fig. 1). From the four intensity images that correspond to the ring phase shifts, the quantitative phase image is retrieved uniquely at each point in the field of view. Figure 2 shows examples of SLIM images associated with tissue cores and glands for cancer and normal colon cases.
The SLIM tissue scanner can acquire the four intensity images, process them, and display the phase image all in real-time. This is possible due to the novel acquisition software that seamlessly combines central processing unit (CPU) and graphical processing unit (GPU) processing.29 The SLIM phase retrieval computation occurs on a separate thread, while the microscope stage moves to the next position. Scanning large fields of view, e.g., entire microscope slides, and assembling the resulting images into single files required the development of new dedicated software tools.29 The final SLIM images (Fig. 2) are displayed in real-time at up to 15 frames/s, as limited by the spatial light modulator refresh rate, which is 4× faster.
Deep learning model
Our training datasets is based on a total of 131 patients who underwent colon resection for treatment of colon cancer at the University of Illinois at Chicago (UIC) from 1993 to 1999. For each patient, tissue cores of 0.6 mm diameter corresponding to “tumor, normal mucosa, dysplastic mucosa, and hyperplastic mucosa” were retrieved. The tissue cores are composed of primary colon cancer (127 patients) and mucosa of normal (131 patients), dysplastic (33 patients), and hyperplastic colon (86 patients). These tissue cores are finally then transferred into a high-density array for imaging purposes.
Two 4-µm thick sections were cut from each of the four tissues mentioned above. The first section was deparaffinized and stained with hematoxylin and eosin (H&E) and imaged using the Nanozoomer (bright-field slide scanner, Hamamatsu Corporation). A pathologist made a diagnosis for all tissue cores in the TMA set, which was used as “ground truth” in our training. A second adjacent section was prepared in a similar way, but without the staining step. This slide was then sent to our laboratory for imaging. These studies followed the protocols outlined in the procedures approved by the Institutional Review Board at the University of Illinois at Urbana-Champaign (IRB Protocol No. 13900). Prior to imaging, the tissue slices were deparaffinized and cover slipped with an aqueous mounting medium. The tissue microarray image was assembled from mosaic tiles acquired using a conventional microscope (Zeiss Axio Observer, 40×/0.75, AxioCam MRm 1.4 MP CCD). Overlapping tiles were composited on a per-core basis using ImageJ’s stitching functionality. To compare with other approaches, an updated version of this instrument (Cell Vista SLIM Pro, Phi Optics, Inc.) can acquire a core within four seconds. Specifically, 1.2 × 1.2 mm2 region consisting of 4 × 4 mosaic tiles can be acquired within four seconds at 0.4 µm resolution. In this estimation, we allow 100 ms for stage motion and 30 ms for SLM modulator stabilization time and 10 ms exposure. The resulting large image file was then cropped into 176 images of 10 000 × 10 000 pixels, corresponding to a 1587.3 × 1587.3 µm2 field of view.
We performed manual segmentation on the cropped images using ImageJ and each of these images was classified as either “cancer” or “normal.” This manual segmentation resulted in a total of 1844 colon gland gray-scaled images of which 922 images were classified as cancer glands and the remaining 922 images were classified as normal glands. These 1844 colon gland images were split into a training dataset (1660 images), validation dataset (144 images), and a “hidden” test dataset (40 images). The split was stratified to guarantee that each of the three datasets (train, validation, and test) has a balanced number of images with cancer and normal classes.
In our deep learning model that leverages pretrained convents, we used a transfer learning approach to build our deep learning classifier. This approach is especially useful when there is a limited amount of data to train a model. We selected the VGG16 deep network trained on a large dataset (ImageNet over 1.6 M images of various sizes and 1000 classes—see Fig. 3). Among the long list of pretrained models such as ResNet, Inception, Inception–ResNet, Xception, and MobileNet, we chose the VGG16 network due to its rich features extraction capabilities (see the network in Fig. 3 and Ref. 30). The 138 M parameters of VGG16 uses over 528 MB in storage and has only 16 layers. We reuse the VGG16’s parameters in the convolutional, the first five blocks of the network, to extract the rich features that are hidden within each gland image. The first two fully connected layers have 2048 units (neuron) each and the third layer contains 256 units. Each of these fully connected layers is followed by a ReLU nonlinear activation function. A dropout layer (after the first fully connected layers) with value of 0.5 was used as a regularizer to reduce network overfitting. The final sigmoid layer, which follows the 256 fully connected layer, predicts the probability of an input image being cancerous or normal. The training of our network is based on feeding the predicted probability into a binary cross entropy loss function and minimizing it. Once the loss function reaches a plateau, a second and final training step follows by unfreezing all the parameters in the convolutional layers. We used early stopping criteria to determine when to end the training. Early stopping criteria is defined as the training-step where the validation loss reaches its minimum. VGG16 takes color images (RGB) as input, but the gland images are all grayscaled. To overcome this limitation, we generated a 3-channel input by triplicating each image and save these triplicates as a three-channel image. Hence mimicking an RGB image. We have examined several hyperparameters during the training of our network. The selection of those fully connected layers, activation function, and dropout has resulted in the best performance on our validation F1 score. Once the network performance, on the validation data, is deemed acceptable, we used the hidden test dataset to report the final performance.
Model accuracy and loss
Model accuracy and losses are shown in Fig. 4. First, note that the shape of loss curves is a good proxy for assessing the “underfitting,” “overfitting,” and “just-right” models. In general, a deep learning model is classified as “underfitting,” when it is not efficient use of all training datasets. In this case, the training loss curve exhibits a non-zero constant loss value beyond a certain epoch value. In a similar fashion, a deep learning model is said to “overfit” the training data when the training loss curve keeps decreasing, while the validation loss metric stalls and then starts to increase. Both “underfitting” and “overfitting” are signs of non-generalizability. On the other hand, in the “just-right” deep learning models, the training and validation loss functions tend to follow each other closely and converge toward zero or very small values. We stopped the training at epoch 60 (this is known as training by early stopping criteria), where the network is no longer able to generalize (i.e., the validation loss started to increase after a specific epoch value). Early stopping criteria are implemented by saving the trained weights of our network where validation loss is lowest during all the training cycle. During our training exercise, the “best” model is saved when the validation accuracy highest is 0.98 at epoch 36.
ROC, AUC, and classification reports for validation and test
Receiver Operating Characteristics (ROC) curve and Area Under the Curve (AUC) scores are two metrics used in reporting the performance of our network on both the validation and the test datasets (Fig. 5). The ROC curve displays the performance of our deep learning classifier at various thresholds. The two parameters plotted on the ROC axes are the true positive rate, along the y-axis, and false positive rate along the x-axis. Figures 5(a) and 5(c) show the ROC curve for the validation and test. The AUC measure on the validation set is 0.98 and 0.99 in the test set, also shown in Figs. 5(b) and 5(d). The accuracy for both the validation and test datasets was 97%.
Confusion matrix for validation and test
The confusion matrix provides a quantitative measure performance of the binary classifier. There are two classes in our confusion matrix: “normal” or “cancer” gland. A confusion matrix has two types of errors: type I error is false positive, where “normal” is classified as “cancer.” Type II error is false negative, where “cancer” is classified as “normal.” For a perfect classifier, its confusion matrix is diagonal, which means that it will only have true negatives and true positives.
The confusion matrices for validation and test datasets are shown in Figs. 6(a) and 6(b), respectively. In the first row of Fig. 6(a), 69 out of the 72 “normal” instances are correctly predicted or are true negatives, and 3 out of the 72 are wrongly predicted as “cancer” or are false positives (type I error). In the second row of Fig. 6(a), 71 out of the 72 “cancer” instances are correctly predicted as “cancer” or are true positives, and 1 out of the 72 “cancer” instances is wrongly predicted as “normal” or is false negative (type II error). In the first row of Fig. 6(b), all the 20 “normal” instances are correctly predicted as “normal” and none of them is wrongly predicted as “cancer.” In the second row of Fig. 6(b), 19 out of the 20 instances are correctly predicted as “cancer” and only 1 out of the 20 instances is wrongly predicted as “normal” or is false negative (type II error).
SUMMARY AND DISCUSSION
In summary, we showed that applying AI (deep transfer learning) to SLIM images yields excellent performance in classifying cancers and benign tissue. The 97% classification accuracy, together with the 98% (validation dataset) and 99% (test dataset) area under the ROC curve, suggest that this approach may prove valuable especially for screening applications. The SLIM module can be implemented to existing microscopies already in use in the pathology laboratories around the world. Thus, it is likely that this new tool can be easily adopted at a large scale as a prescreening tool, enabling the pathologist to screen trough cases fast. This approach can be applied to more difficult tasks in the future, such as quantifying the aggressiveness of the disease,12 and can be used for other types of cancer, with proper optimization of the network.
It has been shown in a different context that the inference step can be implemented into the SLIM acquisition software.31 Because the inference is faster than the acquisition time of a SLIM frame and can also be performed in parallel, we anticipate that the classification can be performed in real time. The overall throughput of the SLIM tissue scanner is comparable with that of commercial whole slide scanners that only perform bright field imaging on stained tissue sections.29 In principle, it is possible to have the result of classification, with areas of interest highlighted for the clinician, all done as soon as the scan is complete, in a couple of minutes. In the next phase of this project, we plan to work with clinicians to further assess the performance of our classifier against experts.
The data that support the findings of this study are available from the corresponding author upon reasonable request.
We are grateful to Mikhail Kandel, Shamira Sridharan, and Andre Balla for imaging, annotating, and diagnosing the tissues used in this study. This work was funded by NSF (Grant Nos. 0939511, R01 GM129709, R01 CA238191, and R43GM133280-01).