In this paper, we investigate two deep learning approaches to recovering initial temperature profiles from thermographic images in non-destructive material testing. First, we trained a deep neural network (DNN) in an end-to-end fashion by directly feeding the surface temperature measurements to the DNN. Second, we turned the surface temperature measurements into virtual waves (a recently developed concept in thermography), which we then fed to the DNN. To demonstrate the effectiveness of these methods, we implemented a data generator and created a dataset comprising a total of 100 000 simulated temperature measurement images. With the objective of determining a suitable baseline, we investigated several state-of-the-art model-based reconstruction methods, including Abel transformation, curvelet denoising, and time- and frequency-domain synthetic aperture focusing techniques. Additionally, a physical phantom was created to support evaluation on completely unseen real-world data. The results of several experiments suggest that both the end-to-end and the hybrid approach outperformed the baseline in terms of reconstruction accuracy. The end-to-end approach required the least amount of domain knowledge and was the most computationally efficient one. The hybrid approach required extensive domain knowledge and was more computationally expensive than the end-to-end approach. However, the virtual waves served as meaningful features that convert the complex task of the end-to-end reconstruction into a less demanding undertaking. This in turn yielded better reconstructions with the same number of training samples compared to the end-to-end approach. Additionally, it allowed more compact network architectures and use of prior knowledge, such as sparsity and non-negativity. The proposed method is suitable for non-destructive testing (NDT) in 2D where the amplitudes along the objects are considered to be constant (e.g., for metallic wires). To encourage the development of other deep-learning-based reconstruction techniques, we release both the synthetic and the real-world datasets along with the implementation of the deep learning methods to the research community.

## I. INTRODUCTION

Analysis of structural imperfections of materials, spare parts, or components of a system is important in preventing the malfunction of devices. This can be achieved by active thermography, which has—in addition to non-destructive testing—several other applications, such as structural health monitoring,^{1} material characterization,^{2,3} and thermal imaging in medicine.^{4–6,45} In thermographic imaging, the specimen is heated by flashlamps, lasers, etc., and the corresponding temperature evolution is then measured on the surface. The resulting thermal pattern is used to reconstruct the heat distribution inside the material, which provides the main information for defect detection.

In recent decades, thermographic data evaluation was dominated by 1D methods, which became inaccurate, for example, when applied to anisotropic heat flow. For more accurate thermographic data evaluation, multidimensional heat flow must be considered. Some research groups have focused on solving multidimensional inverse heat conduction problems (IHCPs) using various thermal stimulation methods.^{7,8} The virtual wave concept^{9} is another approach that considers multidimensional heat flow without solving directly an IHCP. Since it does not add additional information for detecting defects, for a single one-dimensional (1D) reconstruction, the spatial resolution does not improve compared to direct inversion of the heat diffusion. The advantage of the virtual wave concept in 1D is that more advanced regularization techniques that incorporate *a priori* information, such as sparsity or positivity, can be utilized^{10,11} in the reconstruction process. The main benefit in 2D or 3D is that for the calculated virtual waves, any conventional ultrasound reconstruction method, such as the synthetic aperture focusing technique (SAFT),^{12,13} can be used in a second step to reconstruct the defects.^{9,14} The signal-to-noise-ratio (SNR) is significantly enhanced because the lateral heat flow perpendicular to the surface is also taken into account in the reconstructions. This is similar to averaging numerous 1D measurements when imaging a layered structure, but with the essential advantage that the virtual wave concept can be used for any 2D or 3D structure to be imaged. Here, for the first time, after the virtual waves have been calculated, acoustic reconstruction as the second step is performed by deep learning, which shows additional advantages: artifacts caused by limited view or from discretization are suppressed, which results in more accurate reconstruction.

Thermal reconstruction can be modeled mathematically by the heat diffusion equation. Since heat diffusion is an irreversible process, part of the information is inevitably lost, which implies the ill-posed nature of the problem.^{15} In this case, maximum likelihood solutions typically provide numerically unstable results, and regularization is needed to avoid over-fitting of the data.^{16} This can be achieved by direct methods, variational image reconstruction, and deep learning. The first approach is based on the approximate inversion of the problem and is fast because it involves only a few matrix-vector multiplications. However, this comes at the price of reconstruction accuracy, which is low for noisy measurements. The variational approach provides a good compromise between over- and under-fitting of the data. In this case, *a priori* information, such as smoothness, sparsity, group sparsity, and non-negativity, can be incorporated into the model to penalize unfeasible solutions. Although these algorithms are more robust against noise, the penalty function must satisfy certain conditions, such as convexity, which limits the eligible prior knowledge. Another drawback is that the considerably higher computational complexity usually prevents real-time imaging. Furthermore, in many applications, it is not obvious how a suitable regularization parameter is to be chosen.

Deep learning provides a good alternative to direct and variational methods for mainly two reasons: first, the main computational load of these algorithms is shifted to the offline training phase. This permits real-time imaging in the testing phase, even with embedded hardware. Second, since *a priori* information is implicitly encoded during the training phase, it need not be identified and explicitly incorporated into the method. Most successful industrial applications of deep learning have used it in a supervised manner. This approach, however, requires large amounts of labeled data, which is not easy to obtain in thermography, as it requires, for instance, production of physical phantoms with varying material properties including defects with various positions, sizes, and shapes. We tackled this problem by using synthetic data to train a deep neural network (DNN), which was then applied to real-measurement data in the testing phase. In this paper, we present two training strategies. The first is carried out in an end-to-end fashion, which means that the surface temperature data [see, e.g., Figs. 1(a) and 1(b)] is fed directly to the network used to predict the initial temperature profile inside the material [Fig. 1(d)]. The second approach uses the virtual wave concept^{9} as a feature extraction step [Fig. 1(c)] before the neural network is applied. This hybrid solution incorporates domain knowledge via virtual waves and automatically explores the internal structure of the data, such as sparsity.

We demonstrate empirically that this pre-processing method removes irrelevant information from the surface temperature data. Compared to the end-to-end approach, we achieved the same performance with either fewer training samples or a more compact architecture with fewer trainable parameters. We extended our recent work^{17} on this topic by investigating the end-to-end approach and various ways of extracting virtual waves and by testing the proposed methods on both synthetic and real-world measurement data. For the latter, we performed a qualitative analysis of the reconstructed temperature profiles and compared them to those of other model-based approaches. Furthermore, we provide a quantitative assessment of the baseline, hybrid and end-to-end approaches in 2D thermographic reconstruction.

The paper is organized as follows. In Sec. II, we review the two-stage reconstruction process and investigate various model-based methods that can be applied in each step. The data, neural network architectures, and training regime are presented in Secs. III A–III C, respectively. We then describe the end-to-end and the hybrid deep learning approach in Secs. III C 2 and III C 1, where we also investigate the effect of various training data sizes. The results of the experiments on unseen synthetic and real-world data are presented in Secs. IV and V, respectively. This is followed by a discussion of the results and of the validity of the proposed methods in Sec. VI. Finally, Sec. VII concludes the paper with results and future research directions.

## II. MODEL-BASED APPROACH USING TWO-STAGE RECONSTRUCTION

Thermal reconstruction can be modeled mathematically by the heat diffusion equation,

where $\alpha $ stands for the thermal diffusivity, $T$ is the temperature as a function of space $r$ and time $t$, and $T0$ denotes the initial temperature profile at $t=0$, which is given by the temporal Dirac delta function $\delta $ on the right-hand side. Except for some special cases, such as adiabatic boundary conditions (see Sec. III A), there is no exact solution to Eq. (1), and numerical approximations must be applied. Therefore, rather than solving Eq. (1) directly, we consider the problem of undamped acoustic wave propagation described by the wave equation,

where $c$ is the speed of sound, $p$ describes the acoustics pressure in the medium under investigation, and $p0$ is the initial pressure distribution at $t=0$ just after the Dirac-like excitation impulse. According to Burgholzer *et al.*,^{9} the thermographic reconstruction in Eq. (1) can be converted to an ultrasound reconstruction problem by substituting $p0\u221dT0$ in Eq. (2). The solution to this equation $p\u221dTvirt$ is called virtual wave, and its relation to the original temperature distribution $T$ can be formulated as a Fredholm integral equation of the first kind,

where the local transformation kernel $K$ can be written as

A two-stage reconstruction process can now be defined in which the first step is to solve Eq. (3), followed by an ultrasonic evaluation method for reconstructing $T0$ from $Tvirt$ in the second step.

In this paper, we consider 2D reconstruction problems (see, e.g., Fig. 1), where the temperature evolution $T(r,t)=T(y,z,t)$ in time $t$ is measured on the surface $z=0$ by an infrared camera at different locations along the $y$ axis. The corresponding measurement data are an image in 2D that is stored in a matrix $T\u2208RNt\xd7Ny$ for estimating the initial temperature profile $T0\u2208RNz\xd7Ny$, where $Nt,Ny,$ and $Nz$ denote the proper number of discretization points along the temporal, lateral, and depth dimensions, respectively. The discrete analog of the two-stage reconstruction process can be defined by the following regularized linear inverse problems:

where $\Omega (\u22c5)$ stands for the penalty function and $\lambda ,\mu \u22650$ are regularization parameters. Here, the bold-face notation indicates that we are working in the dimension of vectorized 2D measurements, that is, $d=vec(T)\u2208RNtNy\xd71$ and $K\u2208RNtNy\xd7NtNy$. $K=diag(K,K,\u2026,K)$ is a blockdiagonal matrix, where the kernel in Eq. (4) is evaluated at discrete time instances, which means that $Ki,j=K(ti,tj)$ for $i,j=0,\u2026,Nt\u22121$. The virtual wave vector $v~\u2248vec(Tvirt)$ represents the approximated solution to Eq. (2), with $p0\u221dT0$ and an arbitrarily chosen dimensionless virtual wave speed $c$ that we set to $1$. In Eq. (6), $M\u2208RNtNy\xd7NzNy$ represents the matrix form of conventional ultrasound reconstruction methods, such as the SAFT techniques, which are used to estimate the vectorized initial temperature profile $u~\u2248vec(T0)\u2208RNzNy\xd71$.

The solutions to Eqs. (5) and (6) depend on many factors, such as the numerical solver, the penalty function, the optimization constraints, and the regularization parameters. In Secs. II A and II B, we investigate these aspects in order to find the most suitable method for the two-stage thermographic reconstruction process in 2D.

### A. First stage: Virtual wave reconstruction

In the first stage, the virtual waves $v~$ are to be estimated via Eq. (5), which is derived from the discretization of the Fredholm integral equation in Eq. (3). This leads to a discrete ill-posed inverse problem.^{18,19} Regularization including side constraints, such as sparsity and non-negativity, is, therefore, inevitable in order to obtain feasible solutions.

Since the virtual waves represent pressure signals, they can have both positive and negative values. We thus need to find an invertible linear transformation that maps $v~$ onto a space where non-negativity constraints apply. According to Thummerer *et al.*,^{10} this can be done by spherical projections that correspond to time integration in 3D and to Abel transformation in 2D. Let us denote these transformations by $R$, and define $K^=KR\u22121$. The non-negative and sparse variation of Eq. (5) can then be written in the form,

Once the minimizer $v^$ has been found, the original virtual waves $vec(Tvirt)$ can be approximated by $R\u22121v^$.

Another way to sparsify the virtual wave vector is to use multiscale and multidirectional representations, such as curvelets. Unlike wavelets, this transformation applies a polar tiling pattern in the frequency domain, where each wedge represents the frequency support of the corresponding curvelet element. In the spatial domain, the basis functions look like elongated ridges, which are smooth in one direction and oscillatory in the other. Due to their similar structure, acoustic waves become sparse in the curvelet domain, which is illustrated in Fig. 2.

Motivated by their usefulness in seismology,^{20} we utilize curvelets to recover the virtual wave field as follows:

Here, the curvelet transformation matrix $C$ is not explicitly formed; instead, we apply the fast discrete curvelet transformation (FDCT),^{21} which requires $O(n2log\u2061n)$ flops for images of size $n\xd7n$. Therefore, this approach is also suitable for large-scale problems if the virtual waves are reconstructed from high-resolution thermal images.

We analyzed the performance of the virtual wave reconstruction methods utilizing Abel and curvelet transformation. To this end, we used a test set from our previous work,^{17} which consists of $1000$ samples in ten versions for various SNRs, thus comprising 10 000 samples in total. These data simulate thermally isolated specimens in 2D and can be modeled by assuming adiabatic boundary conditions in Eq. (1). Both the virtual waves $Tvirt$ and the initial temperature distribution $T0$ can thus be calculated analytically. For instance, the thermal and virtual wave images in Fig. 1 were simulated this way. Based on recent results,^{10} we used the alternating direction method of multipliers (ADMMs)^{22} in conjunction with the L-curve^{23} method to solve the corresponding sparse approximation problems and to estimate $\lambda $ in Eqs. (7) and (8), respectively. The reconstruction error between the ground truth $Tvirt$ and the reconstructed virtual waves (i.e., $T~virt=vec(v^)$ for Abel trf. or $T~virt=vec(v~)$ for curvelet trf.) were measured in terms of the mean squared error (MSE),

Figure 3(a) shows the MSEs averaged over 1000 sample images for each method and SNR. Except in the worst-case scenario with $\u221220dB$ SNR, the sparse virtual wave reconstruction with Abel transformation outperformed the curvelet-based approach. Therefore, we conclude that in this case, the non-negativity is a much tighter constraint than the higher level of sparsity assumed in the curvelet domain.

### B. Second stage: Initial temperature profile reconstruction

The outcome of the first reconstruction step is an approximation to $Tvirt$, which represents undamped acoustic waves that would be measured on the surface right after temporal excitation. Therefore, we must deal with an ultrasound reconstruction problem in the second stage, which is described by Eq. (6).

In ultrasonography, a point source has a hyperbolic response that depends on both the depth and the speed of sound in the specimen. Therefore, the forward operator $M\u2208RNtNy\xd7NzNy$ is designed such that these diffraction hyperbolas are assigned to the corresponding point sources, while the inverse operator $M+$ collapses these hyperbolas back to point sources. In this work, $M+$ is provided by the well-known Stolt’s f-k migration^{24} and the T-SAFT^{25} methods, which allow reconstruction in the frequency and the time domain. The initial temperature profile can then be estimated directly by $u~\u2248M+v~$. We hereafter refer to these algorithms as *fkmig* and *tsaft*.

One difficulty in applying variational methods here is the lack of proper inversion for $M+$. However, it has been shown that the adjoint operator can be used for this purpose.^{16,26} Therefore, in Eq. (6), we chose $M$ to be equal to the adjoint of the T-SAFT matrix.^{25} Recall that the blockdiagonal structure of $K$ allows application of singular value decomposition (SVD) of the kernel $K$ in Eq. (5), which enables estimation of the optimal $\lambda $ via the L-curve method^{23} in the first stage. However, in the second stage, calculating the SVD is unfeasible because the matrix $M$ is too large in the case of T-SAFT or it is not explicitly formed, as in the case of f-k migration. Hence, we use the IRfista numerical solver, which has recently been developed for large-scale linear inverse problems,^{27} where regularization is achieved via semi-convergence of the iterations. Due to physical constraints, we assume that the initial temperature is non-negative (i.e., $u\u22650$), and we set $\Omega (u)=\Vert u\Vert 2$ in Eq. (6). We hereafter refer to this setup as *reg tsaft*.

Note that nonzero elements of $u$ represent either noise or defects, but that the latter show up in groups. Assuming that the volume of the defects is negligible compared to the full volume of the material, sparsity and group sparsity can also be imposed on $u$. This naturally raises the question of why not to use a penalty, such as $\Omega (u)=\Vert u\Vert 1$, that utilizes this knowledge in Eq. (6). The answer is twofold. One, the approximation of the virtual wave vector $v~$ is influenced by both the measurement noise and the regularization error of the first step in Eq. (5). As a consequence, estimating the optimal regularization parameter for each measurement is an intractable task. Two, group sparse optimizers^{28,29} usually require *a priori* knowledge of the group boundaries, and their performance is also limited by the number of groups.^{30} None of these variational approaches provides the necessary level of freedom for detecting defects with arbitrary shapes, locations, and overlap in large-scale linear inverse problems.

Figure 4 illustrates the previously mentioned reconstruction techniques: we first approximated the ground truth virtual waves $Tvirt$ by applying ADMM in conjunction with Abel transformation as described in Sec. II A. This was followed by reconstruction of the initial temperature distribution $T0$ in the second stage using *tsaft*, *reg tsaft*, and group sparse *grp.-tsaft*, as shown in Figs. 4(b)–4(d), respectively. Among these methods, the group sparse approximation seemed to be the best, but this approach required more than $150$ iterations using the SPGL1 sparse numerical solver.^{31,32} Note that for this single example, we tried several parameter setups to find the optimal values for the regularization parameter $\lambda $ and for the group sizes. Since this procedure would be intractable for large datasets, we omitted this approach from our extensive comparative study.

We tested and compared the performances of the previously mentioned model-based approaches on the dataset^{17} described in Sec. II A. Since the Abel-transformed ADMM showed the best performance in the first stage, we used this method to extract the virtual waves. We then applied *fkmig*, *tsaft*, and *reg tsaft* to approximate $T0$ in the case of varying noise levels. Figure 3(b) shows that *fkmig* achieved the lowest reconstruction error in terms of MSE. Therefore, we chose this model-based reconstruction procedure as the baseline for the proposed deep learning approach.

## III. APPROACHES USING DEEP LEARNING

In this section, we describe two approaches to tackling thermal reconstruction that build on the same architectures to allow direct comparison (see Fig. 5). First, we trained deep neural networks in an end-to-end fashion. That is, we directly fed the surface temperature data to the network. Second, we utilized the virtual wave concept^{9} as a feature extraction step. In this case, we fed the resulting mid-level representation to the neural networks.

### A. Data

Deep learning approaches require vast amounts of data to learn the target distribution. This also applies to thermography, where the data depend on many factors, such as thermal diffusivity, parameters of the defects, and measurement setup. As covering all possible variations is impossible, we created the training set by using simulated data only. In this work, we considered Eq. (1) assuming adiabatic boundary conditions, since there is an analytic solution to this particular case,

where $T^$ and $T^0$ denote the Fourier cosine transforms of $T$ and $T0$ in the $yz$-plane, and $ky$ and $kz$ are the corresponding spatial frequencies. In our experiments, the amplitude along the defects was considered to be constant. Therefore, to simulate the surface temperature data, we first generated $T0$ as a binary image of size $Nz\xd7Ny$. This was used to calculate $T^$ by Eq. (10) and then $T$ by taking the inverse Fourier cosine transform of $T^$. Evaluating $T(y,z,t)$ at $z=0$ gave the simulated surface temperature data.

Our complete data are based on 10 000 different samples with up to five square-shaped defects with side lengths between two and six pixels.

First, we generated a binary target mask by randomly positioning the defects for each sample. Then, the corresponding surface temperature measurements were simulated by assuming adiabatic boundary conditions (i.e., no heat can flow in or out from the specimen; see, e.g., Sec. III in Ref. 9). The discrete Fourier number $\Delta Fo$ = $\alpha \u22c5\Delta t/\Delta z2$ was chosen to be $0.45$, where $\Delta t$ is the temporal resolution and $\Delta y=\Delta z$ is the spatial resolution of $y$ and $z$. These surface temperature measurements were later used as input to the end-to-end approach. Note that generating new training data for different thermal diffusivities was not necessary, because we could easily rescale the temporal and spatial resolution to meet the discrete Fourier number of the training data. For our hybrid method, we computed the virtual waves from the temperature measurements as described in Sec. II A (i.e., by using ADMM with Abel transformation^{10}). We highlight that the estimated virtual wave vector $v~$ is independent of the thermal diffusivity $\alpha $ but depends on the dimensionless speed of sound $c$, which was chosen to be $1$.

The end result for each sample comprised three single-channel images 256 by 64 pixels in size: the temperature measurements, the virtual waves, and the target mask.

Additionally, we used ten different versions of each sample, representing SNRs from $\u221220dB$ to 70 dB in 10 dB steps. On the one hand, this is a form of data augmentation that is supposed to increase robustness against changes in the level of SNR during training. On the other hand, having multiple versions of a sample allows more detailed performance evaluation and baseline comparison.

We divided these data into three disjoint (non-overlapping) subsets as follows: our **training data** consisted of 8000 samples. We normalized the samples to have zero mean and unit standard deviation using only the training data. Given that we had ten different versions of each sample that represented different SNR levels, we ended up with 80 000 samples.

For development and validation purposes, such as architecture engineering, hyperparameter tuning, and model selection, we always used the same 1000 samples. In total, we had 10 000 samples in our **validation data**.

In order to estimate the generalization capabilities and to ensure a fair comparison to the baseline, we evaluated on an additional dataset. This **test dataset** was unseen in every regard and also normalized based solely on the training data. It also consisted of 1000 samples in ten versions for different SNRs, thus comprising a total of 10 000 samples.

### B. Network architectures

We propose two carefully designed u-net architectures, where we used the implementation from Ref. 33 as a starting point. The first architecture was developed to be very compact and computationally inexpensive. The resulting model is best suited to embedded and/or real-time settings. The second architecture was developed with the goal of maximizing performance. While the computational cost was ignored in this case, it is still relatively small for modern standards.

The u-net architecture attracted considerable attention by winning the International Symposium on Biomedical Imaging (ISBI) cell tracking challenge in 2015. Initially, it was proposed for various medical image segmentation tasks.^{34} One major advantage of this architecture is that it can also produce adequate results with moderate amounts of training data. The underlying basic principle is that of a fully convolutional (i.e., without any fully connected layers) auto-encoder with a contracting path (i.e., encoder) and an expansive path (i.e., decoder). In the contracting path, the feature map resolution is successively reduced due to the effect of pooling. In the expansive path, pooling operators are replaced by upsampling operators, which successively increases the resolution.

The ability to localize is realized via skip-connections from the higher-resolution features of the contracting path to the upsampled output of the expansive path. For a more detailed description, we refer to the original paper.^{34}

The compact architecture was designed and has a depth of three in both the contracting and the expansive path. It has just 16 filters in the first (single channel) layer, which results in about 109 000 weights. The second, larger architecture is very similar, except that it has an increased depth of five layers, which amounts to about $1.8\xd7106$ weights. Further increasing the number of filters and/or the depth of the network did not lead to improved performance in terms of validation loss.

### C. Training

For training of the weights, we use the same procedure for both models to minimize the *binary cross entropy* (BCE) loss as follows. We used the Adam optimizer^{35} with a learning rate of $1\xd710\u22123$ and no weight decay. We trained for 100 epochs and fast convergence rendered a learning-rate schedule unnecessary. Furthermore, we did not apply batch normalization, as several experiments had shown that it was not useful.

The only data augmentation we applied during training was a simple left-right flip of both the input image and the target mask with a probability of 50%. Notice that many augmentations known to be helpful in other image-related tasks are not useful here. For instance, stretching or shearing applied to an input and target image would not simply result in a consistent input-target pair.

#### 1. Training end-to-end

In this section, we investigate the performance of an end-to-end approach. We fed the temperature measurements directly to the neural networks. This approach has the least computational load, but at the same time the highest task complexity.

For additional insights, we trained both architectures with four different amounts of training data, starting with 10 000 samples. We then increased the size of the training dataset repeatedly by a factor of two until we ended up with the full set comprising 80 000 samples. In order to increase the meaningfulness of the results, we repeated each training five times and report the means and standard deviations. Note that we always used the exact same validation data as described in Sec. III A.

The outcome of this experiment is summarized in Fig. 6, which shows a side-by-side comparison of the two architectures. We refer to the compact and to the larger architecture as **cmp** and **lrg**, respectively. As can be seen, increasing the number of training samples resulted in models with lower loss and smaller generalization gap (i.e., the difference between training and validation loss). Additionally, the compact model fits the training and validation data less well than the complex model but also produced less overfitting. This is due to the low number of learnable weights, which acts as a regularizer. The compact architecture seems to give consistent results, as the standard deviation of both the training and validation results was relatively small (see cmp train/cmp val). The generalization gap vanished when 40 000 and 80 000 training samples were used.

The results of the larger architecture are similar, except the loss was generally lower than for the compact architecture. Since the generalization gap also remained when training with 80 000 samples, increasing the training set size even further might be useful.

#### 2. Training with virtual waves

In this section, we investigate the performance of a hybrid approach. Here, the goal was to learn the reconstruction process from $v~$ to $u~$, which also incorporates the previously mentioned *a priori* knowledge. The input of the network was provided by the first regularization step in Eq. (5). The original end-to-end reconstruction problem from $d$ to $u~$ was converted to a much easier task, that is, recognition of hyperbolas in $v~$.

The setup of the experiment was the same as with the end-to-end approach, except that we used virtual waves instead of temperature measurements as input. Note that, to achieve the most meaningful comparison possible, all the samples used for training and validation matched exactly those used in the end-to-end approach. Again, we repeated each training five times and report the means and standard deviations.

Figure 7 shows a side-by-side comparison of the two architectures. In general, the results reflect model behavior similar to that of the end-to-end approach, except that the loss was consistently lower. This was expected, as the hybrid approach takes some of the computational burden away from the neural network compared to the end-to-end approach. Therefore, virtual waves seem to be a useful mid-level representation for learning the reconstruction process.

### D. Model selection for baseline comparison

After training, we selected the best models for each architecture and both the hybrid and the end-to-end approach according to the validation data results. The resulting four models were then evaluated on the unseen test data against the baseline.

## IV. SIMULATION RESULTS

In this section, we discuss the results of the end-to-end and the hybrid approaches on the unseen test set and compare them to the baseline (see Sec. II B).

We start by providing intuitive insights by comparing the reconstructions of our approaches with the baseline method. For this, we take one of the hardest examples from the 0 dB SNR results. In addition to the low SNR, two other points make this example challenging. First, several defects are in close proximity and risk becoming merged together. Second, there is a defect located in between and underneath two other defects, which makes it difficult to detect.

Figure 8 shows the target (mask) in the top-left and the baseline reconstruction *fkmig* in the top-right subplot. Compared to the baseline, all four of our proposed approaches produced reconstructions that were much closer to the desired result, even *e2e cmp*, the computationally cheapest one. Furthermore, only the baseline method seemed to merge several defects together. Detecting the “shadowed” defect near the lower right corner was difficult for all methods. In general, only the larger models were able to detect the defects deeper in the material (the lower, the deeper). Furthermore, our deep-learning-based methods produced almost no artifacts, making it easier to find a good threshold for the final binary *defect/no defect* decision.

Notice that *e2e lrg* identifies the “shadowed” defect closer to the expected area without any false positive artifacts, although the MSE is larger than that of *hybrid lrg*. This might seem counterintuitive at first, but can be attributed to the fact that a visual—hence subjective—inspection of the results is prone to missing smaller differences from the target that still contribute to the objective error metric. For instance, in *hybrid lrg* of Fig. 8, three out of five defects have almost perfect reconstruction. In the case of *e2e lrg*, the same defects have blurred edges and slightly different amplitudes, which increase the overall MSE of the reconstructed image.

Depending on the specific application, it might be the case that either the absence of artifacts, or the correct position, shape, or size of the defect is the most important criterion for evaluation. Therefore, a different evaluation metric might be required. Combined with other techniques, such as uncertainty estimates of the predictions,^{36} a practically useful and efficacious system can then be engineered.

Next, we present the results of a more objective evaluation. Figure 9 shows the results of the baseline *fkmig* and of our proposed methods in terms of the MSE for various SNRs. Unsurprisingly, for all methods lower SNRs led to worse reconstructions and thus to higher losses. As can be seen, even the computationally cheapest method *e2e cmp* provided a substantial improvement over the baseline. Overall, the results confirm two findings from the validation results. First, the larger models *hybrid lrg*, *e2e lrg* performed consistently better in terms of MSE compared to their compact counterparts *hybrid cmp*, *e2e cmp*. Second, the hybrid models *hybrid cmp*, *hybrid lrg* performed consistently better in terms of MSE compared to their end-to-end counterparts *e2e cmp* and *e2e lrg*.

Interestingly, between 30 dB and 70 db SNR, both hybrid models exhibited improving performance, whereas both end-to-end models stayed approximately at the same level of loss. This is another indicator that the virtual waves represent the information of the temperature measurements in a way that is easier to process by the neural networks. In order to explain this phenomenon in more detail, we present Fig. 10, depicting measurements (left column) and their corresponding virtual waves (right column) from the same sample, but assuming different SNR conditions. The plots at the bottom represent their corresponding squared difference. In order to make the squared differences visible to the naked eye, we had to upscale their magnitude by a factor of 10 000 and 10 for the measurements and the virtual waves, respectively.

Clearly, the *virtual waves* differ, with the 70 dB representation looking less blurry than the 30 dB representation, especially toward the bottom of the image. Additionally, the squared difference shows a meaningful pattern—information which potentially enables the neural network to produce ever better results as the SNR conditions improve.

On the other hand, the squared difference of the *measurements* does not show such an obvious pattern and is significantly lower in magnitude by a factor of approximately 1000 (see the MSE in brackets in Fig. 10). As the input to the end-to-end approach is extremely similar, the neural network also produces similar output, which is why the end-to-end results do not improve above 30 dB SNR anymore. This further demonstrates the effectiveness of the hybrid approach that incorporates domain knowledge.

## V. REAL-WORLD EXAMPLE

The experiments described so far were based on synthetic data only. However, we cannot simply assume that these results translate to the physical world, especially when we also use synthetic data for training, as was the case with the deep-learning-based methods. The generative process for the synthetic data might not incorporate all relevant aspects, rendering the results less relevant to practical applications.

Additionally, confounding factors could contribute to the problem that a seemingly good performance of a machine learning system might not be the consequence of actually understanding the underlying concepts that determine the target.^{37–40} Therefore, we considered it extremely important to provide additional results using real-world specimens.

However, for the task at hand, real-world data are hard to come by mainly for two reasons. First, we are often blind to the objective ground truth of the real-world specimen, as it would require destruction of the specimen itself in order to obtain it. Thus, we do not report the MSE as we did with the synthetic data, but demonstrate the differences between the methods by showing their outputs along with an estimate of the ground truth. Second, creating data in the physical world are more complex compared to collecting and annotating samples, for instance, for an image classification task. Therefore, the following results are based on a single specimen, and we took measurements from various angles to increase their meaningfulness.

First, a physical epoxy resin phantom with two embedded steel rods was created as our real-world specimen. The rods were heated up by eddy current induction for 2 s using an induction generator that provided 3 kW power at a frequency of 200 kHz. The resulting temperature evolution was then measured on the surface by means of an infrared camera. The thermal diffusivity parameter of the material was $\alpha =8.14\xd710\u22128m2/s$ [see Eq. (4)]. Additional physical parameters of the specimen and the measurement setup are shown in Figs. 11(a) and 11(b), respectively. Note that we can simulate internal heat sources within the test sample, since the epoxy resin has no transparency (i.e., it is opaque) in the infrared spectral range. Therefore, we can only measure the surface temperature and reconstruct the internal heat sources from it.

Figure 12(a) shows the measured temperature data at the sample surface with the highest contrast. It is clearly visible that the lower regions of the rods cannot be detected due to the low SNR. Figure 12(b) presents a snapshot of the measurement along the vertical diameter of the specimen at a spatial resolution of $\Delta x=0.04/238m$ and a temporal resolution of $\Delta t=0.04s$. As can be seen, some vertical and horizontal lines, which are caused by the pixel crosstalk of the infrared camera,^{41} distort the original image. This pattern noise was removed by eliminating the characteristic frequencies indicated by the red lines in Fig. 12(c). Then, the filtered measurement data were used to extract the virtual waves and to reconstruct the initial temperature profile $T0$ in each 2D cross section of the specimen. This was followed by the 3D reconstruction compositing all 2D estimations. For the end-to-end approach, the measurement data were rescaled to obtain a discrete Fourier number of $0.45$, which enabled us to apply the generated training data.

As in Secs. II–IV, we chose the reconstruction processes that achieved the lowest MSE, that is, the hybrid and end-to-end approaches with the large architectures *lrg* and *e2e lrg*, and applied them to the real-world data. We considered four variations of the measurement data. In the first case, the cross sections were perpendicular to the orientation of the steel rods, while we rotated the specimen by 10$\xb0$, 25$\xb0$, and 45$\xb0$ in the other examples.

The results of the model-based *fkmig* approach can be seen in Figs. 13–16. We reconstructed the volume of the steel rods (red) and also show the estimated ground truth (blue) with the orientation of the cross sections (green). As can be seen, the two steel rods deeper inside the specimen do not appear as separated defects anymore. This effect becomes more pronounced on the rotated examples, and artifacts on the edge of the specimen begin to appear, which is especially visible in Fig. 16, which shows the results of rotating the specimen by $45\xb0$. Note that the results of the *fkmig* approach were so noisy that postprocessing was necessary: We zeroed out those pixels in the reconstructed image whose values were below a certain threshold. Tuning this threshold was a difficult task for the *fkmig* approach. For the sake of simplicity, we set a uniform threshold of 200 for all reconstructions in Figs. 13–24.

The end-to-end approach *e2e lrg* yielded meaningful 3D reconstructions, as can be seen in Figs. 17–20, where the two defects stay clearly separated except at the edges of the specimen. Again, thresholding was necessary to eliminate some artifacts from the figures, but the end-to-end method turned out to be relatively insensitive to the value of the threshold. Compared to the model-based *fkmig* approach, the results seem to be affected less by rotating the specimen, and fewer artifacts appear. However, the volume of the steel rods was underestimated. Interestingly, this approach also provided good results for the test case with the largest rotation angle ($45\xb0$; see Fig. 20).

The results of the hybrid approach can be seen in Figs. 21–24. As with the end-to-end approach, thresholding was necessary to eliminate some artifacts from the figures. Overall, the results seem to be similar to those from the end-to-end approach, and both steel rods appear as separate defects. However, the volume of the steel rods seems to be closer to the estimated ground truth. Note that the results for the specimen with the greatest rotation ($45\xb0$) show signs of deterioration, and thresholding did not help to improve the quality of the reconstruction. We discuss these limitations and the validity of the proposed deep learning approaches in Sec. VI.

## VI. DISCUSSION

Both deep-learning-based methods outperformed the model-based baseline substantially on synthetic data. They also seem to perform very well on real-world data, even though no real-world measurements were used in the training process. The networks successfully generalized the learned reconstruction algorithm to real-world data even under previously unseen conditions (rotated specimen). Compared to the model-based *fkmig* approach, the results are more consistent with the estimated ground truth, and thresholding was effortless.

Which method is preferable depends on the particular use case. If computational complexity must be kept as low as possible, the end-to-end approach is likely to be the better choice, although the results might not reflect the actual size and/or shape of a defect. If the quality of the results is more important than low complexity, the hybrid approach seems to be the better choice. The reason for the results of the hybrid approach deteriorating for the $45\xb0$ specimen is given in Sec. VI A.

### A. Validity of the model

Although the first stage of the reconstruction process (i.e., the virtual wave method) is valid in all dimensions, the second stage is applicable in 2D only. In fact, the hybrid deep learning approach was trained on synthetic 2D data assuming adiabatic boundary conditions. This means that the proposed real-world example worked well until the 2D slices were perpendicular or nearly perpendicular to the orientation of the steel rods. The deviation from the perpendicular orientation should not be greater than the angle between the steel rods and the measurement surface, which was about $10\xb0$. Note that the trained u-net also successfully generalized the results to rotated specimens. Interestingly, the reconstruction was very accurate even for a $25\xb0$ rotation.

Due to its ill-posed nature, various noise assumptions must be fulfilled in order to provide feasible solutions to the thermal reconstruction problem.^{42} In the case of NDT applications, the temperature change is small for short integration time, and, thus, the noise is considered to be additive white Gaussian (AWGN), which otherwise follows a Poisson distribution.^{41} In this work, we assumed AWGN with a range of variances such that the training set SNRs matched real-world experiments. In order to check this assumption, we estimated the SNRs and the peak signal-to-noise-ratios (PSNRs) of each cross section $d$ in our test specimen [cf. Fig. 12(b)] as follows:

where $N$ denotes the overall number of elements in $d$, and $\u03f5=0.025K$ is the noise-equivalent differential temperature (NEDT) of our infrared camera. Figure 25 shows the estimated SNR and PSNR values for each vertical cross-section image of the measurement data. The range of SNRs ($\u221220dB$ to 70 dB) of the training set covered the estimated SNRs of the measurement data.

### B. Computational complexity

From a computational point of view, training deep neural networks is an expensive task. However, once training is completed and the weights are fixed, inference from unseen data is computationally cheap in comparison. Inspired by Sovrasov’s work,^{43} we computed the number of multiply-accumulate operations (MACs) in our u-net architectures, and the result was that *e2e cmp* required $0.4$ GMAC, while *e2e lrg* needed $0.76$ GMAC to process one input image of size $256\xd764$. In the case of the hybrid- and the model-based approaches, the computational complexity of the virtual wave extraction, which was $65.5$ GMAC, should be added to the overall number of MACs. Note that the virtual wave extraction works with raw data, which explains the high computational load compared to the other three algorithms in Table I. This first stage can be sped up by processing the measurement frames in parallel.

Method . | cmp . | lrg . | fkmig . | vwave extraction . |
---|---|---|---|---|

Input | 256 × 64 | 256 × 64 | 256 × 64 | 256 × 2000 |

GMACᵃ | 0.4 | 0.76 | 0.0077 | 65.5 |

Method . | cmp . | lrg . | fkmig . | vwave extraction . |
---|---|---|---|---|

Input | 256 × 64 | 256 × 64 | 256 × 64 | 256 × 2000 |

GMACᵃ | 0.4 | 0.76 | 0.0077 | 65.5 |

$a$Estimated number of overall giga multiply-accumulate operations.

In order to give an impression of the reconstruction speed, we measured the execution time on an Intel(R) Core(TM) i9-9900K at 3.60 GHz CPU system equipped with a NVIDIA GeForce GTX TITAN X GPU. According to our experiments, the 3D reconstruction of the real-world test specimen took approximately 2 s for *e2e lrg*, while it was around 800 s for the large hybrid and the model-based *fkmig* approaches including the virtual wave extraction. Note that in our setup, the difference in inference time between the compact and the large u-net model was negligible due to our GPU’s ability to parallelize matrix multiplications. However, this will be more pronounced in a setting where no GPU is available, for instance, on an embedded device.

## VII. CONCLUSIONS

We have proposed an end-to-end and a hybrid deep learning approach for thermographic image reconstruction. The latter uses the recently developed virtual wave concept,^{9} which proved to be an efficient feature extraction technique for our hybrid deep learning approach.^{17} We studied each step of the reconstruction chain by means of quantitative and qualitative tests. In doing so, we developed a framework^{44} for generating data samples, which can be used to train, validate, and test machine learning algorithms for thermographic imaging. In addition to these synthetic examples, we made a physical phantom from epoxy and used it for experiments on real-world data.^{45}

Our experiments showed that, in terms of MSE, the hybrid method performs better on synthetic data than the model-based and the end-to-end approaches. Both our deep learning methods performed well on real-world data compared to their model-based counterparts. The hybrid approach produced only a few artifacts and achieved the best reconstruction for test cases in which the 2D model was valid (i.e., with 0$\xb0$ and 10$\xb0$ of rotation), while the end-to-end method gave meaningful results on the real-world measurement data up to the highest rotation angle ($45\xb0$). Overall, we conclude that the proposed hybrid method outperforms the model-based and the end-to-end approaches in terms of reconstruction error, while the last might perform better in online applications, where execution speed is crucial. For a first approximation, the end-to-end approach provides a fast reconstruction, which can be replaced by the hybrid method if the results are not satisfactory.

Further, we found that the virtual wave concept serves as an efficient feature extraction technique by which prior knowledge, such as non-negativity and sparsity, can be incorporated into the training process to improve the generalization properties of the neural network solutions and to reduce the size of the training set. In our simulations, the amplitude along the objects/defects was considered to be constant, mimicking, for instance, a metallic wire with constant cross section inside insulating mass. In accordance with this approach, we designed a real-world experiment that fits well with the assumptions we made for the training set. In fact, the amplitude along the steel rods did not change, the infrared camera we used fulfilled the AWGN assumption,^{42} and the SNR range of the measurement was covered by the training set. The applicability of the proposed deep learning approaches can be extended by augmenting the training set with additional data and assuming weaker constraints, such as inhomogeneous defects and non-Gaussian noise. This will be part of our future work.

## AUTHORS’ CONTRIBUTIONS

P. Kovács and B. Lehner contributed equally to this work and both should be considered first authors of this manuscript.

## ACKNOWLEDGMENTS

This work was supported by Silicon Austria Labs (SAL), owned by the Republic of Austria, the Styrian Business Promotion Agency (SFG), the federal state of Carinthia, the Upper Austrian Research (UAR), and the Austrian Association for the Electric and Electronics Industry (FEEI); and by the COMET-K2 “Center for Symbiotic Mechatronics” of the Linz Center of Mechatronics (LCM), funded by the Austrian Federal Government and the Federal State of Upper Austria.

Financial support by the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development, and the Christian Doppler Research Association is gratefully acknowledged. Financial support was also provided by the Austrian Research Funding Association (FFG) within the scope of the COMET programme within the research project “Photonic Sensing for Smarter Processes (PSSP)” (Contract No. 871974). This programme is promoted by BMK, BMDW, the federal state of Upper Austria, and the federal state of Styria, represented by SFG.

Additionally, parts of this work were supported by the Austrian Science Fund (FWF) (Project Nos. P 30747-N32 and P 33019-N).

## DATA AVAILABILITY

The data and code that support the findings of this study are available at Ref. 46.

## REFERENCES

*Proceedings of the 45th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*(

*International Conference on Medical Image Computing and Computer-Assisted Intervention*(

*Proceedings of the 3rd International Conference for Learning Representations (ICLR)*(

*Proceedings of the 2nd International Conference on Advances in Signal Processing and Artificial Intelligence (ASPAI)*(

*Proceedings of the 3rd International Conference on Learning Representations (ICLR)*, edited by Y. Bengio and Y. LeCun (

*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(

*Proceedings of SPIE 10851, Photonics in Dermatology and Plastic Surgery*(