Fringe projection profilometry (FPP) has become a more prevalently adopted technique in intelligent manufacturing, defect detection, and some other important applications. In FPP, efficiently recovering the absolute phase has always been a great challenge. The stereo phase unwrapping (SPU) technologies based on geometric constraints can eliminate phase ambiguity without projecting any additional patterns, which maximizes the efficiency of the retrieval of the absolute phase. Inspired by recent successes of deep learning for phase analysis, we demonstrate that deep learning can be an effective tool that organically unifies phase retrieval, geometric constraints, and phase unwrapping into a comprehensive framework. Driven by extensive training datasets, the neural network can gradually “learn” to transfer one high-frequency fringe pattern into the “physically meaningful” and “most likely” absolute phase, instead of “step by step” as in conventional approaches. Based on the properly trained framework, high-quality phase retrieval and robust phase ambiguity removal can be achieved only on a single-frame projection. Experimental results demonstrate that compared with traditional SPU, our method can more efficiently and stably unwrap the phase of dense fringe images in a larger measurement volume with fewer camera views. Limitations about the proposed approach are also discussed. We believe that the proposed approach represents an important step forward in high-speed, high-accuracy, motion-artifacts-free absolute 3D shape measurement for complicated objects from a single fringe pattern.
Optical non-contact three-dimensional (3D) shape measurement techniques have been widely applied for many aspects, such as intelligent manufacturing, reverse engineering, and heritage digitalization.1 The fringe projection profilometry (FPP)2 is one of the most popular optical 3D imaging techniques due to its simple hardware configuration, flexibility in implementation, and high measurement accuracy.
With the development of imaging and projection devices, it becomes possible to realize the high speed 3D shape measurement based on FPP.3–7 Meanwhile, the acquisition of high-quality 3D information in high-speed scenarios is increasingly crucial to many applications, such as online quality inspection, stress deformation analysis, and rapid reverse molding.8,9 To achieve 3D measurement in high-speed scenarios, efforts are usually carried out by reducing the number of images required per reconstruction to improve the measurement efficiency. The ideal way is to obtain 3D data in a single frame. Recently, we have realized high-accuracy phase acquisition from a single fringe pattern by using deep learning.10,11 However, these works just obtain a single-shot wrapped phase. To realize 3D measurement, phase unwrapping is required, which is one of the operations in FPP that affects the measurement efficiency the most. The most commonly used phase unwrapping methods are temporal phase unwrapping (TPU) algorithms,12,13 which recover the absolute phase with the assistance of Gray-code patterns or multi-wavelength fringes. However, the requirement of additional patterns decreases the measurement efficiency. The stereo phase unwrapping (SPU)14 method based on geometric constraints can solve the phase ambiguity problem through the spatial relationships between multiple cameras and one projector without projecting any auxiliary patterns. Although requiring more cameras (at least two) than traditional methods, SPU, indeed, maximizes the efficiency of FPP. However, conventional SPU is generally insufficient to robustly unwrap the phase of dense fringe images, while increasing the frequency of fringes is essential to the measurement accuracy. To solve this trade-off, some auxiliary algorithms are proposed, which usually focus on four directions. (1) The first direction utilizes spatial phase unwrapping methods15 to reduce phase unwrapping errors of SPU.14,16 As the disadvantages of spatial phase unwrapping, these methods cannot handle discontinuous or disjoined phases. (2) The second direction enhances the robustness of SPU by embedding the auxiliary information in the fringe patterns.17,18 Since the assistance based on the intensity information is provided, the sensitivity of intensity to ambient light noise and large surface reflectivity variations of objects will cause them to fail. (3) The third aspect is to increase the number of perspectives and recover the absolute phase through more geometric constraints.19 This method is more adaptive for the complex scene measurement but comes at the increased cost. Besides, simply increasing the number of views is insufficient to unwrap the phase of dense fringe images, which needs to be combined with (4) the depth constraint strategy.20–22 However, the conventional depth constraint strategy can only unwrap the phase in a narrow depth range, and setting a suitable depth constraint range is also difficult. The adaptive depth constraint (ADC)5,23 strategy can enlarge the measurement volume and automatically select the depth constraint range but only if the correct absolute phase can be obtained for the first measurement. In addition, since the stability of SPU relies on the similarity of the phase information of matching points in different perspectives,19 on the one hand, SPU requires high-quality system calibration and is more difficult to implement algorithmically than other phase unwrapping methods, such as TPU; on the other hand, it has high demand for the quality of the wrapped phase so that the wrapped phase in SPU is usually acquired by the phase-shifting (PS) algorithm,24 which is a multi-frame phase acquisition method with a high spatial resolution and high measurement accuracy. However, the use of multiple fringe patterns reduces the measurement efficiency of SPU. The other commonly used phase acquisition technologies are Fourier transform (FT) methods25,26 with single-shot nature, which are not suitable for SPU due to the poor imaging quality around discontinuities and isolated areas in the phase map.
From the above discussion, it is not difficult to know that although SPU is the best suitable for 3D measurement in high-speed scenes, it still has some defects, such as limited measurement volume, inability to robustly achieve phase unwrapping of high-frequency fringe images, loss of measurement efficiency due to reliance on multi-frame phase acquisition methods, complexity of algorithm implementation, and so on. Inspired by successes of deep learning in FPP10,11,27,28 and the advance of geometric constraints, on the basis of our previous deep-learning-based works, we further push deep learning into phase unwrapping and incorporate geometric constraints into the neural network. In our work, geometric constraints are implicit in the neural network rather than directly using calibration parameters, which simplifies the entire process of phase unwrapping and avoids the complex adjustment of various parameters. With extensive data training, the network can “learn” to obtain the “physically meaningful” absolute phase from the single-frame projection without the conventional “step-by-step” calculation. Compared with traditional SPU, our approach more robustly unwraps the phase of the higher frequency with fewer perspectives in a larger range. In addition, the limitations of the proposed approach are also analyzed in the Sec. IV.
A. Phase retrieval and unwrapping with PS and SPU
As shown in Fig. 1, a typical SPU-based system consists of one projector and two cameras. The fringe images are projected by the projector, then modulated by the object, and finally captured by two cameras. For the N-step PS algorithm, the fringe patterns captured by camera 1 can be expressed as:
where In represents the (n + 1)th captured image, n = 0, 1, …, N − 1, (uc, vc) is the camera pixel coordinate, A is the average intensity map, B is the amplitude intensity map, Φ is the absolute phase map, and 2πn/N is the phase shift. With the least square method,29 the wrapped phase φ can be obtained as
where (uc, vc) is omitted for convenience, and M and D represent the numerator and denominator of the arctangent function, respectively. The absolute and wrapped phases satisfy the following relation:
where k is the fringe order, k ∈ [0, K − 1], and K denotes the number of the used fringes. The fringe order k can be obtained by using SPU based on geometric constraints. For an arbitrary point in camera 1, there are K possible fringe orders corresponding to K absolute phases with which K 3D candidate points can be reconstructed by the calibration parameters between camera 1 and the projector. The retrieved 3D candidates can be projected into camera 2 to obtain the corresponding 2D candidates. Among these 2D candidates, there must be a correct matching point that has a more similar wrapped phase to than other candidates. With this feature, the matching point can be determined through the phase similarity check, and then the phase ambiguity of can be eliminated. However, due to calibration errors and ambient light interference, some wrong 2D candidates may have a more similar phase value to than the correct matching point. Furthermore, the higher the frequency of the used fringes, the more candidates there are, and the more likely such a situation will happen. Therefore, in order to alleviate this issue, a multi-step PS algorithm with a higher measurement accuracy and robustness toward ambient illumination is preferred, and high-frequency fringe patterns are not recommended.
To enhance the stability of SPU, the common methods adopted are to either increase the number of views or apply the depth constraint strategy. The former, at increased hardware costs, further projects 2D candidates of camera 2 into the third or even the fourth camera for the phase similarity check to exclude more wrong 2D candidates. The latter, at the cost of increased algorithm complexity, can eliminate some wrong 3D candidates outside the depth constraint range in advance. However, the conventional depth constraint algorithm is only effective in a narrow volume. Generally, the SPU with at least three cameras assisted with ADC (the most advanced and complex depth constraint algorithm) can achieve robust phase unwrapping on the premise that the correct absolute phase is obtained for the first measurement.5,23 However, complex systems and algorithms make such a strategy difficult to implement.
B. Phase retrieval and unwrapping with deep learning
The ideal SPU should be to use only two cameras and a single frame projector to achieve robust phase unwrapping of dense fringe images in a large measurement volume without any complicated auxiliary algorithms. To this end, inspired by recent successes of deep learning techniques in phase analysis, we combine deep neural networks and SPU to develop a deep-learning-enabled geometric constraints and phase unwrapping method. The flowchart of our approach is shown in Fig. 2. We construct two four-path convolutional neural networks (CNN1 and CNN2) with the same structure (except for different inputs and outputs) to learn to obtain the high-quality phase information and unwrap the wrapped phase. The detailed architectures of the networks are provided in Appendix A.Next, we will discuss our algorithm steps. Step 1: To achieve high-quality wrapped phase information retrieval, the physical model of the conventional PS algorithm is considered. We separately input the single-frame fringe images captured by camera 1 and camera 2 into CNN1 and the outputs are the numerators M and denominators D of the arctangent function corresponding to the two fringe patterns instead of directly linked wrapped phases, since such a strategy bypasses the difficulties associated with reproducing abrupt 2π phase wraps to provide a high-quality phase estimate.10 Step 2: After predicting the numerator and denominator terms, high-accuracy wrapped phase maps of camera 1 and camera 2 can be obtained according to Eq. (2). Step 3: To realize the phase unwrapping, enlightened by the geometry-constraint-based SPU described in Sec. II A, which can remove phase ambiguity through spatial relationships between multiple perspectives, the fringe patterns of two perspectives are fed into CNN2. Meanwhile, we integrate the idea of assisting phase unwrapping with the reference plane information30 to our network and add the data of a reference plane to the inputs to allow CNN2 to more effectively acquire the fringe orders of the measured object. Thus, the raw fringe patterns captured by two cameras, as well as the reference information (containing two fringe images of the reference plane captured by two cameras, and the fringe order map of the reference plane in the perspective of camera 1) are fed into CNN2. It is worth mentioning that the reference plane information is obtained in advance and subsequent experiments do not need to obtain it repeatedly, which means there is just one extra reference information for the whole setup necessary. The output of CNN2 is the fringe order map of the measured object in camera 1. Step 4: Through the wrapped phases and the fringe orders obtained by the previous steps, high-quality unwrapped phase can be recovered by Eq. (3). Step 5: After acquiring the high-accuracy absolute phase, the 3D reconstruction can be carried out with the calibration parameters31 between the two cameras (see Appendix B for details).
To verify the effectiveness of the proposed approach, we construct a dual-camera system, which includes a LightCrafter 4500Pro (912 × 1140 resolution) and two Basler acA640-750 μm cameras (640 × 480 resolution). 48-period PS fringe patterns are used in our experiments. The size of the measuring field is about 240 mm × 200 mm.
To train our networks, we collect training datasets from 1001 different scenarios. With training of hundreds of epochs, the training and validation loss of the networks converge without overfitting. We provide further details of collection of training data and the training process of the neural network in Appendix C.
A. Qualitative evaluation
To test the effectiveness of our approach, we firstly measure four static scenarios, containing single or multiple isolated objects with complex shapes, which are not in the training and verification datasets. We use four methods to measure these scenes. The first method is to use PS to obtain the wrapped phase and use triple-camera SPU and ADC to obtain the absolute phase (the results obtained by which are taken as the ground-truth data); the second method is to use PS to obtain the wrapped phase and use dual-camera SPU and the conventional depth constraint strategy to obtain the absolute phase; the third method is to use PS to obtain the wrapped phase and directly use the reference phase to unwrap the phase; the fourth method is our approach. The measurement results are shown in Fig. 3. It can be seen from the results of the second method that the conventional dual-camera SPU and depth constraints are insufficient to unwrap the phase of high-frequency fringes. The parts marked by the black dotted boxes in Fig. 3 show the phase unwrapping errors of the third method, from which we can see that the reference plane can only unwrap the wrapped phase in a limited range, which is between −π and π of the absolute phase of the reference plane, while with our approach, the ambiguity of the wrapped phase can be accurately eliminated in a large depth range. In addition, our deep-learning-assisted approach can yield high-quality reconstruction results, almost of the same quality as those obtained by conventional PS, triple-camera SPU, and ADC methods.
We also test four continuously moving scenarios to demonstrate the superiority of our approach in the dynamic target measurement (note that all our training and validation datasets are collected in static scenes). The measurement results are shown in Fig. 4 (Multimedia view). It can be seen from the left three columns of Fig. 4 (Multimedia view) that the multi-frame imaging characteristics of the PS algorithm lead to obvious motion-induced artifacts in the reconstruction results when encountering moving objects. In addition, due to the sensitivity to phase errors, the results acquired by SPU obviously perform worse. Because of the single-shot nature of our approach, the measurement can be performed uninterruptedly without being affected by motion artifacts for dynamic scenarios, as shown in the right most column of Fig. 4 (Multimedia view).
B. Quantitative evaluation
To quantitatively estimate the reconstruction accuracy of our approach, we measure two standard spheres, whose radii are 25.3989 mm and 25.4038 mm, respectively, and the center-to-center distance is 100.0532 mm with the uncertainty of 1.1 µm. Their errors are 1.8 µm and 3.5 µm, respectively. The measurement result is shown in Fig. 5(b). We perform sphere fitting to measured results of two spheres, and their errors are shown in Fig. 5(c). The radii of the reconstructed spheres are 25.4616 mm and 25.4648 mm with deviations of 52.7 µm and 61.0 µm, respectively. The measured center distance is 99.9878 mm with an error of 65.3 µm. This experiment validates that our method can provide high-quality 3D measurements with fewer cameras, fewer projection images, and simpler algorithms.
IV. CONCLUSIONS AND DISCUSSIONS
In this work, we present a deep-learning-enabled geometric constraints and phase unwrapping approach for the single-shot absolute 3D shape measurement. Our approach avoids the shortcomings of many traditional methods, such as the trade-off of efficiency and the accuracy of the conventional phase retrieval method and the trade-off of SPU in the phase unwrapping robustness, large measurement range, and the use of high-frequency fringe patterns. On the premise of the single-frame projection, our method can solve the phase ambiguity problem of dense fringe images in a larger measurement range with less perspective information and simpler algorithms. We believe that the proposed approach provides an important guidance for high-accuracy, motion-artifacts-free absolute 3D shape measurement for complicated objects in high-speed scenarios.
For traditional methods, one usually proceeds step by step based on prior knowledge. For example, for SPU, first find 3D candidates, second use depth constraints to remove unreliable candidate points, third project to another perspective, and finally, perform the phase similarity check. Due to the step-by-step process, all information, such as spatial information and temporal information, is not effectively utilized. The comprehensive utilization of all valid data requires strong and professional prior knowledge, which is very difficult to complete. However, deep learning can make it. Through data training and learning, these problems can be effectively integrated into a comprehensive framework. In our work, this framework is a very organic one, which incorporates phase acquisition, geometric constraints, and phase unwrapping. These methods in the framework are no longer reproduced step by step as traditionally but are organically integrated together. However, since the data sources of our method are 2D images, when the image itself is ambiguous, deep learning is by no means always reliable. For example, when the large depth discontinuity of the object results in missing order and continuity artifact from the camera view (Fig. 6), such inherent ambiguity in the captured fringe pattern cannot be resolved by deep learning techniques without additional auxiliary information, such as fringe images of different frequencies. In the future, we will further integrate the physical model into FPP based on deep learning and construct FPP driven by data and physics.
This study was supported by the National Natural Science Foundation of China (Grant Nos. 61722506, 61705105, and 11574152), National Key R&D Program of China (Grant No. 2017YFF0106403), Outstanding Youth Foundation of Jiangsu Province (Grant No. BK20170034), Fundamental Research Funds for the Central Universities (Grant Nos. 30917011204 and 30919011222), and Leading Technology of Jiangsu Basic Research Plan (Grant No. BK20192003).
APPENDIX A: ARCHITECTURE OF THE NEURAL NETWORKS
We take CNN1 as example to reveal the internal structure of the constructed networks, as shown in the upper right part of Fig. 2. A 3D tensor with size (H, W, C0) is used as the input of the network, where (H, W) is the size of the input images, and C0 represents the number the input images. For each convolutional layer, the kernel size is 3 × 3 with convolution stride one, zero-padding is used to control the spatial size of the output, and the output is a 3D tensor of shape (H, W, C), where C = 64 represents the number of filters used in each convolutional layer. In the first path of CNN1, the input is processed by a convolutional layer, followed by a group of residual blocks (containing four residual blocks) and another convolutional layer. Each residual block consists of two sets of convolutional layer activated by rectified linear unit (ReLU) stacked one above the other,32 which can solve the degradation of accuracy as the network becomes deeper and ease the training process. In the other three paths, the data are down-sampled by the pooling layers by two, four, and eight times, respectively, for better feature extraction, and then up-sampled by the upsampling blocks to match the original size. The outputs of four paths are concatenated into a tensor with quad channels. Finally, two channels are generated in the last convolution layer (one channel is generated in CNN2). Except for the last convolutional layer, which is activated linearly, the rest use the ReLU as activation function. The mean-squared-errors of the outputs with respect to the ground truth are used as the loss function, and the adaptive moment estimation33 is utilized to tune the parameters for finding the minimum of the loss function.
APPENDIX B: SYSTEM CALIBRATION AND 3D RECONSTRUCTION
After acquiring the high-accuracy absolute phase, the matching points of two cameras can be uniquely identified. Then, the 3D reconstruction can be carried out with the pre-calibration parameters between the two cameras. The reason why we utilize two cameras for reconstruction instead of one camera and one projector is that the multi-camera system can automatically cancel nonlinearity errors.34 The calibration parameters, which contain the intrinsic, extrinsic, and distortion parameters of the cameras are calibrated based on the MATLAB Calibration toolbox and optimized with bundle adjustment.31,35
The reconstructed 3D coordinates are in the world coordinate system, the 0-depth plane of which corresponds to the position of the first calibration pose. For example, when the relationship between the position of a pair of standard spheres and the first calibration pose (the 0-depth plane of the world coordinate system) is as shown in Fig. 7, where Fig. 7(a) is the front view of the standard spheres and the calibration board in the first calibration pose and Fig. 7(b) is their top view, the depths of points p1 and p2 are z1 and z2, respectively.
APPENDIX C: TRAINING THE NEURAL NETWORKS
To collect the training datasets, different types of simple and complex objects are arbitrarily combined and rotated 360° to generate 1001 diverse scenes. Figure 8 shows six representative scenarios from total 1001 training datasets, the first of which is the reference plane. Considering the following comparative experiments (verifying that our approach using only two perspectives can perform better than SPU using three cameras in dynamic scenes), we collect data from three views, each set of which consists of 3-step PS fringe patterns captured by three cameras. Within each set of data, we calculate the ground-truth numerator M and denominator D by the 3-step PS algorithm and obtain the fringe order maps by using triple-camera SPU and ADC (note that the fringe orders can also be acquired through only a single camera, by projecting multiple fringe patterns of different frequencies and using TPU). Before being fed into the networks, the fringe images are divided by 255 for normalization, and the fringe order maps are divided by the number of the used fringes (48) for normalization, which make the learning process easier for the network. When training the CNNs, 800 sets of data are used for training and 200 sets are used for verification. The training and verification datasets have been uploaded to the figshare (https://10.6084/m9.figshare.11926809; https://figshare.com/s/f150a36191045e0c1bef).
The constructed neural networks are computed on a GTX Titan graphics card (NVIDIA). Figure 9 shows the loss curve distributions of the CNNs. For CNN1, the loss curves converge after about 200 epochs, and the training of 400 epochs takes 25.56 hours; for CNN2, the loss curves converge after 120 epochs, the training of 300 epochs takes 19.25 h. It is noted that the loss scales of the two networks are different because their outputs are not in the same scale: the numerator M and denominator D can reach hundred, while the fringe orders k are normalized.