Phase retrieval, the problem of recovering lost phase information from measured intensity alone, is an inverse problem that is widely faced in various imaging modalities ranging from astronomy to nanoscale imaging. The current process of phase recovery is iterative in nature. As a result, the image formation is time consuming and computationally expensive, precluding real-time imaging. Here, we use 3D nanoscale X-ray imaging as a representative example to develop a deep learning model to address this phase retrieval problem. We introduce 3D-CDI-NN, a deep convolutional neural network and differential programing framework trained to predict 3D structure and strain, solely from input 3D X-ray coherent scattering data. Our networks are designed to be “physics-aware” in multiple aspects; in that the physics of the X-ray scattering process is explicitly enforced in the training of the network, and the training data are drawn from atomistic simulations that are representative of the physics of the material. We further refine the neural network prediction through a physics-based optimization procedure to enable maximum accuracy at lowest computational cost. 3D-CDI-NN can invert a 3D coherent diffraction pattern to real-space structure and strain hundreds of times faster than traditional iterative phase retrieval methods. Our integrated machine learning and differential programing solution to the phase retrieval problem is broadly applicable across inverse problems in other application areas.

## I. INTRODUCTION

Phase retrieval, the problem of recovering lost phases from measured intensities alone, is the underlying basis for a variety of imaging modalities in astronomy,^{1} Lorentz transmission electron microscopy (Lorentz-TEM),^{2} super-resolution optical imaging,^{3} and of particular relevance for this article, electron and X-ray coherent diffraction imaging (CDI) techniques including ptychographic methods.^{4,5} In CDI, for instance, the object of interest is illuminated with a coherent beam and the resulting scattered intensities are measured in the far field. In the purest form, these measured intensities correspond to the modulus of the complex Fourier transform of the measured sample. While scattered intensities can be measured, the phase information contained in the scattered wavefield is lost. Consequently, the image cannot be retrieved with a simple inverse Fourier transform.

Coherent imaging techniques are acutely sensitive to material properties that influence the phase of the scattered wave. When measured at a Bragg peak, local distortions of the lattice within the sample will directly impact the relative phases within the scattered wavefield. The coherent diffraction interference pattern will then encode the lattice distortion within the sample.^{6} Recovering the object structure (and hence, also phase) from the scattered intensities provides a 3D image of both the object's structure as well as the distortion of the lattice (represented as relative phase within the complex image) with sensitivity on the order of a few picometers.^{7} This ability to obtain nanoscale structure and picometer sensitivity to distortions caused by strain has been successfully used by the materials and chemistry communities to study a variety of dynamic processes resolved in time using X-ray Bragg CDI. Some examples include grain growth and annealing,^{8} defect migration in battery electrodes,^{9} ultra-fast phonon dynamics,^{10–12} *in situ* catalysis,^{13,14} and mechanical deformation.^{15,16} While X-ray CDI has grown to be an extremely powerful means of characterizing the *in situ* and operando response of materials, the process of phase retrieval is computationally expensive. Iterative phase retrieval methods typically require thousands of iterations and multiple runs from random initialization to converge to a solution of high accuracy, taking several minutes, even on modern graphical processing units (GPU). Furthermore, convergence of the algorithms is sensitive to optimization hyper-parameters, such as the choice of algorithms, algorithmic parameters, data thresholds, and data initialization.^{17,18} These challenges preclude real-time phase retrieval and feedback, which is highly desirable across a broad range of imaging modalities.

Neural network solutions have been proposed to quickly solve various inverse problems including in magnetic resonance imaging (MRI),^{19} inverse design of opto-electronic devices,^{20} and phase retrieval problems.^{21–24} While these results show the promise of deep learning in providing rapid solutions, more general concerns about the susceptibility of neural networks to sudden failures remain. These include their inability to extrapolate and generalize to inputs outside of the training distribution and their susceptibility to subtle biases in the training data. For instance, it was shown that a deep neural network that was approved for use as a medical device in Europe to detect skin melanomas was often making its predictions based on the presence of surgical markers in the dermoscopic images and not from any skin features.^{25} What is needed then is a means of correcting predictions from neural networks in the event of errors, regardless of their magnitude. Additionally, while generative models have been widely applied to generate 2D images, generation of 3D structures is a nascent field.^{26} The data requirements to model 3D structures are larger than for 2D, and the addition of an extra dimension means that there are more symmetries that need to be learned.

Here, we introduce a framework that uses a 3D convolutional encoder–decoder network (3D-CDI-NN) in conjunction with a physics-based optimization procedure to solve the inverse problem in 3D, using coherent imaging as a representative example. We use the reverse-mode automatic differentiation (AD) method both to make the 3D-CDI-NN model physics-aware during the training phase, and to refine the predicted image during the testing phase. We demonstrate that such an integrated approach of using a physics-based refinement stage on the 3D-CDI-NN prediction maximizes the speed and the accuracy of the inversion procedure. Our approach is applicable to several inverse problems and is predicated on knowledge of the forward model, from which both the training dataset and the refinement through optimization are derived.

## II. RESULTS

### A. Approach

Figure 1 illustrates our approach for inverting 3D coherent imaging data to a real-space structure and strain field. The workflow consists of two stages: first, there is a computationally intensive offline training stage that involves training the 3D-CDI-NN model on data generated from large-scale atomistic simulations. Second, the trained 3D-CDI-NN is used in a fast online prediction stage that enables real-time predictions of 3D structure and strain. These predictions can then be refined using a gradient-based optimization procedure based on automatic differentiation.^{27}

Below, we briefly describe the training dataset and neural network architecture used in our offline training stage. Subsequently, we demonstrate the efficacy of the approach of using the trained model and subsequent refinement in making online predictions using simulated data as well as experimental data.

### B. Physics-informed training set

Effective training of a neural network model hinges on the availability of training data that are sufficiently diverse and representative. To obtain training data that are representative of experimental data, we derive them from a physics-informed data preparation pipeline using atomistic structures (Fig. 2). Each example in the training set is created as follows: First, a polyhedral shape is generated by clipping a cube-shaped Face-Centered Cubic (FCC) lattice crystal along randomly selected high symmetry orientations (see Sec. IV A). A random combination of compression, tension, and shear stresses is applied on the atomistic object to create a strain field in the material. The structure is then energetically relaxed using Large-Scale Atomic/Molecular Massively Parallel Simulator (LAMMPS),^{28} a parallel molecular dynamics (MD) simulation package. This energy-minimized atomic configuration is then used to calculate atom densities and displacements, which are spatially voxelized into a 32 × 32 × 32–sized grid (length of each voxel is ≈2 lattice units) and used to compute the 3D coherent diffraction patterns about the (111) Bragg peak, which gives a diffraction pattern array size of 32 × 32 × 32 (see Sec. IV B). Note that since there is no additional padding added to the simulated crystals, the simulated diffraction patterns are likely not oversampled (i.e., oversampling ratio of less than 2).

### C. Neural network architecture

3D-CDI-NN is a feed-forward neural network model (Fig. 3) which consists of a convolutional autoencoder and two identically structured deconvolutional decoders. The encoder takes a 32 × 32 × 32–sized input image of 3D diffraction pattern magnitude and encodes it via a series of rectified linear unit (ReLu) convolutional layers and max pooling layers into a latent space that represents the underlying features. The same encoded data are passed through a series of ReLu convolutional layers and upsampling layers in two separate decoders to obtain 32 × 32 × 32-sized output images that map the encoded diffraction pattern to the corresponding shape and phase of the real-space image. A kernel of size 3 × 3 × 3 is used as the convolution, max pooling, and upsampling windows. The network is trained in a supervised manner, where the output images for the training diffraction data are known *a priori*.

Dropout layers are added to the input layer and convolutional layers to help train a robust network. Dropout, which is the practice of randomly suppressing the output of various neurons during training, helps to train more robust neural networks by forcing the network to learn multiple representations of the same input data.^{29} The use of dropout layers on the input layer can be thought of as adding blankout noise to the training images, which is analogous to the process of augmenting simulated images by Gaussian or Poisson noise to minimize overfitting to idealized data. The convolutional and max pooling operations (max pooling is a binning/down-sampling operation using the maximum value over a prespecified pixel neighborhood) serve to transform the data (in this case the diffraction magnitudes) into feature space, while the deconvolutional and upsampling operations serve to transform back from feature space into image space. In addition, the physics of the forward model is enforced via a custom objective function that minimizes the mean absolute error (MAE) between the magnitude of input diffraction pattern and that obtained from Fourier transform of the recombined predicted shape and phase images (see Sec. IV C).

### D. 3D-CDI-NN performance on simulated data

Figure 4 shows examples demonstrating the performance of 3D-CDI-NN on simulated data outside of the training dataset. From three, 32 × 32 × 32–sized input simulated diffraction patterns, 3D-CDI-NN predicts the corresponding real-space images in the same number of volume elements (voxels). Thus, for a 32 × 32 × 32–sized input diffraction pattern, 3D-CDI-NN makes 65 536 predictions that correspond to the amplitude and phase in each voxel in the sample space. As seen in Fig. 4, 3D-CDI-NN learns to predict sample structure and strain from input diffraction data alone. 3D-CDI-NN accurately predicts details, such as facets and edges of the objects, and it can make such predictions even from diffraction patterns that are not oversampled, which is in contrast to traditional phase retrieval reconstructions, where oversampled diffraction pattern is required for proper convergence.^{6,15}

Due to symmetries in a diffraction pattern, there are two equivalent solutions (twin) for every crystal prediction, i.e., the predicted crystal and its inversion in real space with complex conjugated phase. Crystal 1 in Fig. 4 illustrates such an example. Iterative reconstruction algorithms typically converge to one of these two solutions. To account for this effect in the error distribution plot, we compute MAE for the twin crystals and pick the smaller error of the two. Furthermore, we account for translational invariant in the 3D-CDI-NN predictions by centering the crystals to their center of mass prior to the MAE calculation. 3D-CDI-NN tends to overpredict phases of the real-space image when the object is weakly strained, which we partly resolved by adding examples of crystals with no strains to the training data (see Sec. IV B).

### E. Deep Learning (DL) Prediction refinement

To improve the quality of reconstruction obtained through our approach, we refine 3D-CDI-NN's structure and strain prediction by iteratively minimizing an objective function that measures the fit between the target diffraction pattern and the diffraction pattern obtained from the current structure and strain guesses. The minimization step uses the Adam adaptive gradient descent algorithm^{30} along with the shrink-wrap routine^{17} (see Sec. IV E). Here, the gradient updates are calculated using the same software package as the 3D-CDI-NN model (in our case, Google's Tensorflow) by using the reverse-mode automatic differentiation (AD) technique. When using the AD technique, we only need to specify the physics-based forward model that generates the diffraction patterns from the structure and strain guesses, and do not need to manually specify the full gradient expression for the objective function.^{27} We note that the NN training step also relies on the AD technique, with the exception that the 3D-CDI-NN forward model consists of both the physical forward model (for the custom physics-aware objective function) and the feed-forward neural network.

To illustrate the effectiveness of the refinement process, we select the same three simulated crystals shown in Fig. 4 and perform AD refinement on the 3D-CDI-NN model predictions. The refinement is done via feeding the direct input and output of 3D-CDI-NN model (32 × 32 × 32–sized image arrays) to AD. One has the option to invoke the AD step in different ways. For instance, the 32-sized array used in the NN prediction could be used, or a version of the data that was not down-sampled. We try both strategies in this work, first by applying the 32-sized data in the simulation set, and a 64-sized data block in the experimental validation. A comparison of the 3D-CDI-NN model prediction and refined prediction vs the ground truth (simulated target) is shown in Fig. 5. Interestingly, despite using data that are not oversampled (32 × 32 × 32–sized image arrays) in the AD refinement process, it is still able to significantly improve upon the initial 3D-CDI-NN model predictions, especially in recovering structure of the local strain within the crystal. Although phase retrieval using AD typically requires oversampled data for convergence from a random initial guess, we hypothesize that by starting with a guess that is very close to the true solution, and the use of an accurate and tight support constraint, we are able to converge close to the true solution even with data that are not oversampled. We note that we have also achieved similar refinement results using oversampled data (i.e., neural network predictions padded to 64 × 64 × 64 and the corresponding re-sampled diffraction patterns).

### F. Experimental BCDI measurement

A sample containing gold nanoparticles on a Si substrate was prepared by dewetting a thin film of gold at 900 °C. A nanoparticle was chosen at random and illuminated by a coherent beam focused to ≈500 × 500 nm, which was large enough to fully illuminate the nanoparticle. We measured the resulting 3D coherent X-ray diffraction pattern about the crystals (111) Bragg peak. To obtain this 3D dataset, we rotated the sample through 0.6° in steps of 0.005°, which resulted in 120, 2D slices through the diffraction pattern. These slices were stacked in the 3rd dimension to give a dataset of 151 × 133 × 103 reciprocal space voxels. The 55 *μ*m pixels on the detector were sufficient to oversample the measured diffraction data by a factor of 2 or more at the detector distance of 0.9 m.

### G. Model validation on experimental data

To evaluate the performance of the trained 3D-CDI-NN model on real data, we prepare input data by down-sampling the 3D coherent diffraction pattern of the gold crystal obtained from the X-ray diffraction experiment. The down-sampling to 32 × 32 × 32–sized data is done via cropping followed by re-sampling using block-wise discrete cosine transform (DCT) (i.e., DCT → cropping → inverse DCT on blocks).^{31} Since 3D-CDI-NN is mostly trained on data that are not oversampled, the amount of cropping is determined such that the subsequent scaling of the diffraction pattern to 32 × 32 × 32–sized data yields an oversampling ratio of about 1.5 in the shortest dimension. The crystal output of 3D-CDI-NN is returned to the original oversampling ratio of the experimental input image via a combination of cropping and zero padding, followed by scaling to a 64 × 64 × 64–sized image. This scaled image is used for visual comparison and as initial guess for the AD refinement step. The target for visual comparison is prepared via traditional reconstruction of the original 3D coherent diffraction pattern followed by scaling, normalization, and binning to 64 × 64 × 64–sized images. In addition, a 64 × 64 × 64–sized crop of the experimental input image is used as the AD refinement target.

Figure 6 shows the performance and computational efficiency of the methods. The 3D-CDI-NN model accurately predicts the shape and facets of the target crystal on a sub-second timescale (≈145 ms/prediction) but underestimates the surface details and its local strain, possibly due to the low 32 × 32 × 32–sized resolution. We envision future addition of a larger and more diverse dataset, such as the inclusion of diffraction pattern sampling at multiple resolutions, crystals of different lattice types, and crystals with defects/dislocations, etc., will further improve the size, shape, and local strain predictions of the 3D-CDI-NN model.

As seen in Fig. 6(c), refinement of the 3D-CDI-NN prediction using AD with oversampled data can recover the surface details and the inhomogeneous distribution of strain within the crystal. Benchmarking on the same CPU processor core shows that the combination of 3D-CDI-NN and AD is still ≈4 times faster than the traditional iterative phase retrieval method.

## III. DISCUSSION

In conclusion, we have demonstrated the use of machine learning to invert 3D coherent imaging data rapidly and accurately. Once trained, 3D-CDI-NN is hundreds of times faster than traditional iterative phase retrieval methods. While 3D-CDI-NN demonstrates excellent performance on simulated test data, there is much scope to improve its performance on experimental test data. We expect this gap in performance can be addressed in several ways, including through transfer learning and a neural architecture search. Transfer learning is a powerful means of training large neural networks in the absence of sufficient amounts of training data. The neural network is first pre-trained using a large dataset on a similar problem before being refined using the smaller dataset corresponding to the target problem. We can apply the same method to 3D-CDI-NN by pre-training on simulated data, before refining its training on experimentally phased datasets. We expect this new network to significantly perform better on fresh experimental data. Another important means of improving network accuracy which we have not explored is by optimizing the architecture of the network (which was hand-engineered and kept fixed for this study) (Fig. 3). Automated approaches to neural network design are now widely used and can generate network architectures that surpass the best designed human ones.^{32,33}

We anticipate that modern data analytical approaches to coherent diffraction inversion will be critical to CDI at the coming fourth generation synchrotron sources, such as the recently commissioned Extremely Brilliant Source (EBS) at the European Synchrotron Radiation Facility and the coming Advanced Photon Source Upgrade. At these sources, the coherent flux of the beam is expected to increase by a factor of 50 – 200 times over current sources. This vast increase in flux can be used to measure both higher resolution data (corresponding to much larger datasets), or measure at current resolutions, but at significantly higher rates, just tens of seconds per measurement.

The current phase retrieval methods will not keep up, in either larger datasets or faster data rates, due to limitations of modern GPU devices in both compute cores and memory. In fact, our 3D-CDI-NN approach has shown to produce high-fidelity images from very limited data. This image is then refined through a gradient-based minimization procedure. In the case of extremely large data, which precludes full iterative phase retrieval on current GPU devices due to limited onboard memory, we envision 3D-CDI-NN being used to initialize an AD or other iterative solutions that is based on the entire data volume. Additionally, since CDI-NN is not doing phase retrieval, the typical oversampling of reciprocal space is not required for inversion to real-space images. The neural network can be used on only a limited volume of data, perhaps very close to a Bragg peak of the lattice. One will then optimize the measurement-inversion process to have the neural network working on the subset of data before the total volume of data are even finished acquiring. AD has been shown to scale effectively to large compute resources, contrary to current phase retrieval algorithms, and can even operate on the partial dataset as it is being acquired.

## IV. METHODS

### A. Atomistic model of gold crystals

Each gold crystal is cut from a ≈20 × 20 × 20 nm^{3} FCC lattice of ≈500 k atoms, where the direction (normal vector) of each clip plane is uniformly sampled using the hypersphere point picking method. A random number of selected high-symmetry orientation clip planes (between 4 and 20) positioned at random distances from the crystal geometric center is used to obtain faceted gold crystals of diverse shapes and sizes. The crystal is minimized in LAMMPS using the embedded-atom method (EAM) interatomic potential to obtain the initial gold crystal structure. The final gold crystal structure is obtained by applying a combination of compression, tension, and shear stresses (up to 1% strain) to the initial structure followed by another minimization. The stresses are applied to the structure via deformation of the simulation box with atom coordinates remapped accordingly. Both the initial and final structure of gold crystal are scaled by the inverse lattice constant of gold (1/4.078 Å) such that the lattice constant is normalized to 1. The final structure is used to compute the crystal shape, whereas the difference between the initial and final structures is used to compute the crystal phases (see Sec. IV B). To avoid potential artifact from boundary-related problems, a ≈5 lattice unit padding (i.e., ≈20 Å before lattice normalization) is added to each side of the normalized (lattice constant = 1) simulation box. We note that while the training objects generated using this method are significantly smaller than experimentally measured particles, the approach is still valid given the scale-independent and pixel-normalized nature of CNNs.

### B. Training data

The training dataset is a combination of two datasets. The first dataset consists of 107 180 diffraction patterns generated from atomistic models of gold crystals, where 100 000 of them are used for training and 7180 are set aside for testing. The second dataset consists of the same 100 000 and 7180 gold crystals in the training and testing sets but with the material strain field removed (i.e., the testing set remains entirely independent from the training set). The second dataset serves as control that helps alleviate the tendency of 3D-CDI-NN in overpredicting strains. Target output images of the crystal shape are obtained from the number density of atoms calculated using a bin size of ≈2 lattice units (≈8 Å before normalization) and normalized by the maximum whereas target output images of the crystal phases is obtained from binning of local phases that is computed from the atom displacement field of the final and initial crystal structure projected along [111] and scaled by 2*π*. The binning process convert atomistic model into 32 × 32 × 32–sized images (i.e., length of each voxel corresponds to ≈7.5 Å before lattice constant normalization). For each crystal, the shape (magnitude) and phase images are combined to form a 3D array of complex numbers which is then used to obtain the corresponding diffraction pattern via Fourier transform. The magnitude of the 3D diffraction pattern is used as the input for the 3D-CDI-NN training.

### C. Objective function

Errors between the 3D-CDI-NN prediction and the target are quantified using the mean absolute error (MAE) between the magnitude of input diffraction pattern and that obtained from Fourier transform of the recombined predicted shape and phase images. For input and output images of size *N × N × N*, the loss function is defined as:

where *A* and *P* in Eq. (3) correspond to the shape and phase images predicted by 3D-CDI-NN (voxel intensities in the range of [0,1]), *C* is the complex exponentiation of *A* and *P*, *X _{predicted}* in Eq. (2) is the magnitude of the Fourier transform of

*C*, which is normalized by the maximum voxel intensity in the MAE calculation as shown in Eq. (1), since voxel intensities of

*X*are in the range of [0,1]). MAE is used as an accuracy measure in order to assign equal weights to all errors of all voxels within the images, regardless of their magnitudes. This is important because the images are normalized to the range of [0,1] (i.e., error of each image pixel/voxel is always ≤ 1). Square error–based measures (e.g., mean square error) would largely decrease the error values and can contribute to the vanishing gradient problem in neural network training. Furthermore, pixels/voxels with intensity approaching 0.5 would have an error value of at most 0.5 (about half of the maximum error in high/low intensity pixels/voxels), which means that square error–based measures would overly de-emphasize errors in mid-intensity regions of the images (e.g., fringes of a diffraction pattern, surfaces and edges of a crystal shape, and intermediate strain regions of a crystal, etc.).

_{target}### D. 3D-CDI-NN training

Training was performed in parallel on four NVIDIA V100 GPUs using the Keras package running the Tensorflow backend.^{34,35} We trained the networks for 50 epochs each using a batch size of 256. The training for each network took less than half an hour when trained in parallel across the four GPUs. At each step, we used adaptive moment estimation (ADAM)^{30} to update the weights while minimizing the per-pixel mean absolute error. A 10% dropout rate is applied to all dropout layers. We computed the performance of the network at the end of each training epoch using the validation set, which is 10% of the shuffled training dataset. Since the network architecture of the 3D-CDI-NN model consists of a common encoder shared by two separate decoders, we adapted a systematic approach in training the model weights. We first trained the encoder and shape decoder. We subsequently fixed their weights while performing the initial training of the phase decoder. This was followed by a further training step which involved unfixing the encoder weights and additional training of the phase decoder. The final step was the simultaneous refinement in the weights of all branches of the network. We found that this sequential training approach was necessary to stabilize a network involving multiple branches which tends be unstable (fluctuate wildly) in the beginning due to random initialization of weights and the inability of a single weighted sum objection to handle the case where improvements in one branch are canceled by other branches.

### E. Iterative phase retrieval

To perform phase retrieval, we used standard iterative phase retrieval algorithms that switched between error reduction (ER) and hybrid input–output (HIO).^{36} A total of 620 iterations were performed using a shrink-wrapped support in real space.^{17} The final 20 iterations were averaged over to obtain the final result. The only difference in the phased data was that oversampling was required so the DCT down-sampling was not performed.

### F. Automatic differentiation DL structure refinement

Before the structure refinement, we first applied a thresholding procedure to identify the support in the DL prediction, then dilated this support by two pixels to generate an initial, slightly loose support for the refinement step. For the actual refinement, we used the physics-based objective function:

where *O* denotes the complex-valued structure, *F* denotes the 3D DFT operator, and *y* denotes the experimentally measured diffraction magnitudes. We used the AD procedure to calculate the gradients of *f* with respect to the real and imaginary parts of *O*, then used these gradients within the Adam optimization algorithm to minimize *f.*^{27,30} We initialized the Adam algorithm with a step size of 1.0, then performed 500 steps of iterative minimization, with the shrink-wrapped support updated every 50 steps, to generate the final refined structure.

### G. Data Availability

All the data, codes, Jupyter notebooks, and trained models developed in this study will be made available in a public GitHub repository at https://github.com/hyiprc/3D-CDI-NN.

## ACKNOWLEDGMENTS

We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. This work was performed at the Center for Nanoscale Materials and the Advanced Photon Source. The use of the Center for Nanoscale Materials and Advanced Photon Source, both Office of Science user facilities, was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, under Contract No. DE-AC02-06CH11357. This work was also supported by Argonne LDRD 2018-019-N0: A.I C.D.I.: Atomistically Informed Coherent Diffraction Imaging and by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences Data, Artificial Intelligence and Machine Learning at DOE Scientific User Facilities program under Award Number 34532. Development of the automatic differentiation refinement was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Science and Engineering Division. An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. Youssef Nashed was partially supported by the US Department of Energy, Office of Science, Office of Basic Energy Sciences under Contract No. DE-AC02-76SF00515.