In this work, we explore the application of deep neural networks to the optimization of atomic layer deposition (ALD) processes. In particular, we focus on a one-shot optimization problem, where we try to predict the optimal dose time that leads to saturation everywhere in the reactor based on thickness values measured at different points of an ALD reactor after a single trial growth. In order to tackle this problem, we introduce a dataset designed to train neural networks to predict saturation times based on these inputs for a cross-flow ALD reactor. We then explore the predictive ability of artificial neural networks of different depths and sizes using a separate testing dataset to evaluate their accuracies. The results obtained show that networks trained using stochastic gradient descent methods can accurately predict saturation times without requiring any additional information on the surface kinetics. This provides a viable approach to minimize the number of experiments required to optimize new ALD processes in a known reactor, and it highlights the way machine learning can be leveraged for thin film growth and manufacturing. While the datasets and training procedure depend on the reactor geometry, the trained neural networks provide a general surrogate model connecting thickness values and trial dose times with optimal saturation times that can be reused for different ALD processes within the same reactor.

## I. INTRODUCTION

Machine learning and, in particular, artificial neural networks have revolutionized how we think and work with data. While artificial neural networks are not new and they have long been explored as a way of modeling complex, nonlinear relationships between different variables, the development of powerful tools capable of implementing and optimizing large networks combined with increasingly powerful computing capabilities has led to a veritable explosion in the range of applications and types of architectures.

At their core, feedforward artificial neural networks are universal function approximators, capable of modeling connections between inputs and outputs of arbitrary complexity.^{1,2} This makes them a useful tool to develop surrogate models without having to carry out complex calculations. One of the challenges, at least in the context of physical sciences, is that they require large amounts of data for training. For instance, the MNIST dataset, one of the most commonly used entry-level machine learning datasets, is composed of 60 000 pictures of handwritten digits for training, plus an additional 10 000 samples for testing.^{3}

In this work, we explore the application of deep neural networks in the context of the optimization of atomic layer deposition (ALD) processes. In particular, we focus on a very practical question: given thickness measurements from a set of samples distributed inside a reactor, can we predict the dose time that would lead to saturation everywhere inside the reactor? The experimental measurement of thickness profiles using a set of substrates or witness coupons is a common approach used in research labs and in industry as part of the qualification of a new ALD process or reactor. By training artificial neural networks to develop a surrogate model capable of predicting saturation times from growth profiles, we can minimize the number of experiments required.

In particular, we introduce datasets designed to train neural networks to predict saturation times based on the dose time and a growth or thickness profile, a set of thickness values measured at different points of the reactor for a single experimental condition. We then explore the ability of different artificial neural network configurations to learn to predict optimal dose times when trained using stochastic gradient descent methods. In particular, we focus on the impact of depth (number of hidden layers) and size (number of neurons in each layer). Finally, we evaluate the minimum number of experimental data points and dataset size required to achieve high classification accuracies. The dataset and the code implementing the networks and the training and evaluation processes have been made available upon publication of this work, and they can be found online at https://github.com/aldsim/saturationdataset.

## II. METHODOLOGY

The fundamental question that we are exploring in this work is whether the knowledge of the film thickness at different points of the reactor obtained during a single growth (for instance, through the use of multiple witness coupons) and the dose time used during growth are enough to help us predict the optimal saturation time for that specific process. Dose times are key variables in an ALD process: too short dose times result on unsaturated growth conditions, leading to experimental values of the growth per cycle that are lower than the saturation value as well as thickness gradients across the reactor if the precursor reactivity is high enough. On the other hand, due to ALD’s self-limited nature, too long dose times do not contribute to further growth, resulting in both lower precursor utilization and a lower throughput.

Expressed in mathematical terms, if $x$ is the collection of $N$ thickness values at different positions inside a reactor, $td$ is the dose time, and $tsat$ is the saturation time, we would like to find a surrogate model that takes a single $(x,td)$ pair as input and returns the saturation time $tsat$ as output,

This is essentially a one-shot optimization problem, where we are trying to find the optimum condition from just a single experiment.

### A. ALD dataset

In order to explore the application of neural networks to this optimization problem, we have created a new ALD dataset. As mentioned in Sec. I, one of the challenges of neural networks is that they usually require large datasets for training. This is obviously a challenge from an experimental standpoint, particularly if the data are reactor specific, as in this particular case. In a prior work, we demonstrated that computational fluid dynamic models provided excellent quantitative agreement with experimental growth profiles in our own ALD reactors.^{4} In this work, we have used 1D models to generate our dataset: each dataset sample comprises a list of thickness values at predetermined locations inside the simulated reactor ($x$), the dose time used in the simulation to generate that profile $td$, and the computed saturation time for that process $tsat$, defined as the time required to reach at least 99% of the saturated growth per cycle.

To create such a dataset, we have used an irreversible first-order Langmuir kinetics to model the self-limited surface kinetics of an ALD process. The dataset was computed for a cylindrical horizontal viscous flow reactor configuration analogous to the custom-built reactors in our laboratory (Fig. 1).^{5} For each sample of the dataset, we randomly selected the key input parameters used in the simulation, resulting in a completely different growth condition and a different tuple of growth profile, dose time, and saturation time $(x,td,tsat)$. It is important to emphasize that the goal is not to model a specific ALD process, but to sample a diverse enough set of conditions to cover any possible ALD process. An example of two such growth profiles is shown in Fig. 1(b) for the case of $N=20$ data points per growth condition.

The main parameters used to generate this dataset include precursor pressure, sticking probability, and the trial dose time for that specific condition, expressed as a percentage of the saturation time. Other secondary parameters include molecular mass, process temperature, and growth per cycle. The ranges of the primary parameters used in generating the dataset presented in this work are summarized in Table I.

Parameter . | Range . |
---|---|

Precursor pressure | 5 × 10^{−4}–0.5 Torr |

Sticking probability | 10^{−5}–10^{−1} |

Dose time | 0.2–0.9 t_{tsat} |

Parameter . | Range . |
---|---|

Precursor pressure | 5 × 10^{−4}–0.5 Torr |

Sticking probability | 10^{−5}–10^{−1} |

Dose time | 0.2–0.9 t_{tsat} |

Through this procedure, we have constructed a series of datasets comprising 100 000 independent samples for training, plus 10 000 independent samples for testing. Having separate training and testing datasets is a standard approach in machine learning, and it helps ensure that the surrogate models are able to generalize well beyond the data used during training.

We have created independent datasets for the following number of points $N$ in the reactor (the separation between consecutive positions is in parentheses): 20 (2 cm), 16 (2.5 cm), 10 (4 cm), 8 (5 cm), 5 (8 cm), and 4 (10 cm). We further extracted smaller datasets from the training dataset ranging from 1000 to the original 100000 samples, each comprising 20 independent points.

### B. Model

We have used the datasets described in Sec. II A to explore the application of deep neural networks to learn the functional relationship,

between growth profiles and dose time (the experimental observables) and the saturation dose time (our optimization target).

While there are many machine learning approaches that can be used, from Gaussian processes to random forests, here, we have focused on deep neural networks.^{1,2} Neural networks are in essence a chain of layers, where each layer can be viewed as a function that carries out a specific transformation of its input and returns an output. In their simplest form, these layers comprise a linear transformation of an input vector followed by a nonlinear transformation, such as a sigmoid, hyperbolic tangent, or a rectified linear function. The coefficients and independent terms of the linear transformation are usually referred to as weights and biases by analogy with biological neurons. A network is trained using a stochastic gradient descent method or a similar algorithm using a loss function that quantifies the difference between the output of the network and the expected output based on the training dataset. Training is usually expressed in terms of the number of samples (instances of the training dataset) or epochs (instances used during training normalized to the length of the training dataset). In order to improve the stability of the gradient descent method, training takes place using batches of samples to compute the gradients used to update the weights and biases of the network.^{1,2}

In this work, we have explored three different neural networks, shown in Fig. 2: a shallow network and two different deep networks with one and two hidden layers. All networks use the vector of thickness values of size $N$ and the dose time as input values, providing the predicted saturation time as output. In order to achieve high accuracies over a range of times spanning more than two orders of magnitude, we used the logarithm of the dose and saturation times in seconds as inputs and predicted targets.

In all cases, layers are connected with all-to-all linear functions so that the output for each layer is given by

where $ReLU(\u22c5)$ represents a rectified linear function [Fig. 2(d)]. One of the motivations to use all-to-all instead of convolutional layers is that it encompasses the case where samples inside the reactor are not equidistant from each other, but may be located downstream or upstream to a substrate of interest or in arbitrary positions. For the deep networks, the size of each of the hidden layers, $M$ for the case of a single hidden layer and $M1$ and $M2$ for the two-hidden-layer network, are additional parameters that can be explored.

### C. Implementation, training, and testing

We implemented the neural networks and carried out the training and testing in PyTorch, a free open source framework for deep learning.^{6} Each network was trained against the training dataset using a stochastic gradient descent method with a mean square error (MSE) loss function, the Adam optimizer,^{7} and a learning rate of 10$\u22123$. Each network was trained for 100 epochs using batches of 64 samples. The resulting networks were then tested against the testing dataset. The implementation and training script can be found online at https://github.com/aldsim/saturationdataset.

While the MSE function was directly used as a loss function during training, for analysis and visualization, we used the relative difference of the predicted saturation time, defined as

For a highly performing, unbiased network, we expect this error to have an average close to zero. The variance of $\epsilon $, $\sigma \epsilon $, therefore, provides a good estimator of the prediction error.

## III. RESULTS

In Fig. 3, we show a sample of the prediction errors for the three different types of networks explored in this work: a shallow network, a network with one hidden layer, and a network with two hidden layers. The networks are trained to predict saturation times from the training dataset comprising 20 thickness values and then their accuracy calculated using the testing dataset. Data points in Fig. 3 are colored based on how close the dose time used as input was to the actual saturation dose, with darker points being closer to the saturation conditions.

It is apparent that the shallow network is not capable of accurately predicting the saturation times, with the predicted times diverging $\xb120%$ from the true saturation value. In contrast, the deep network with one hidden layer [Fig. 3(b)] shows a much smaller dispersion and an excellent agreement with the predicted saturation times. It is interesting to note that the error seems to increase for the deep network with two hidden layers [Fig. 3(c)]. We attribute this to an overfitting of the training dataset due to the larger number of free parameters available in the network with two hidden layers. This emphasizes the importance of having separate training and testing datasets.

As mentioned in Sec. II C, we can use the mean and standard deviation of the relative error $\epsilon $ to quantify the network’s accuracy. In the case of the profiles shown in Fig. 3, the standard deviation values are 0.10, 0.007, and 0.015, respectively. This shows that the deep networks can predict saturation times within a $3\sigma $ error margin of 3%.

A fundamental question when qualifying a growth process is how many different data points within a reactor or a single wafer need to be measured to accurately capture any inhomogeneities or gradients in film thickness. In order to understand the impact that this choice has on the ability to accurately predict saturation times, we trained our networks against a collection of datasets comprising different numbers of points $N$ inside the reactor. In Fig. 4, we show the mean error and the standard deviation $\sigma e$ of the predicted saturation time for the three networks shown in Fig. 3 when trained on datasets comprising a different number of points.

The results show that the networks with one and two hidden layers can accurately predict the saturation time with as few as $N=8$ thickness values for the reactor configuration shown in Fig. 1(a). Using fewer values still produces results that are much more accurate than those obtained using a shallow neural network, but the standard deviation in the predicted saturation times starts to significantly increase. In Fig. 5, we show the dispersion in the predicted saturation values for the datasets comprising growth profiles with $N=4$, 5, and 10 data points.

We have also explored the impact of network size on its performance. In Fig. 6, we show the classification performance of a neural network with one hidden layer as a function of the number of independent points in the growth profile for different hidden layer sizes. The results show how at least 20 neurons are needed in the hidden layer to maximize the network’s ability to predict saturation times. This corresponds to $21\xd7N+61$ free parameters, where $N$ is the number of independent points in each thickness profile.

Finally, we have focused on the impact that the size of the dataset has on the training accuracy. In order to achieve high accuracies in tasks, such as image classification, deep neural networks rely on large datasets. This is a challenge for applications where data are expensive to acquire. It is, therefore, important to explore how large the datasets need to be to produce accurate surrogate models.

In Fig. 7, we show the impact of dataset size on the accuracy for a network with one hidden layer with 30 neurons. We extracted subsets of the original dataset and used them to train the network in each of them. The testing dataset remained identical in all cases. All networks were trained using the same number of samples, which means that smaller datasets were trained for a higher number of epochs. The results show a weak dependence of the classification error with dataset size whenever the number of independent samples exceeds 2000.

## IV. CONCLUSIONS

In this work, we have explored the potential of machine learning to accelerate the optimization of ALD processes through the use of surrogate models that rely on easily obtainable data. In particular, we have shown how deep neural networks can be used to predict the behavior of reactive transport systems without any prior knowledge of the surface kinetics, something that could help accelerate the optimization of manufacturing processes based on thin film deposition and surface modification techniques. Compared to the networks used in traditional machine learning domains, such as image classification, the number of free parameters required to achieve a good agreement is significantly smaller, 691 for the case of a network with a single hidden layer with $M=30$ neurons.

The trained neural networks can be interpreted as surrogate models capturing the underlying physics of the reactive transport of precursors inside an ALD reactor. While the behavior far into the future of the differential equations modeling precursor transport cannot be expressed as closed expressions, the training process is able to capture this functional relation from the pre-existing dataset, sidestepping the need to solve the transport models in real time.

Our results also show that the dataset does not need to be exceedingly large and that a collection of the order of 2000 independent samples leads to highly accurate predictions. This means that for the case of a full 3D simulation model similar to the ones we have used in prior works,^{4} it would take fewer than 100 core-days to generate a dataset of that size. While this may seem like a long time, it represents less than a week in a machine with 32 cores. 1D models, such as those used in this work, are two orders of magnitude faster. Moreover, the advantage of developing such surrogate models is that they can be reused in the optimization of as many ALD processes as needed. Also, while in this work, we have focused on deep neural networks, there are multiple other machine learning approaches that could be used that may be less data intensive or more capable of generalizing from fewer examples.

The results also have some important implications from an experimental perspective: first, in order to take full advantage of the methodology explored in this work, it is critical to keep records of the exact positions at which different thickness measurements are taken when characterizing or qualifying a process. Consistency is also paramount if we want to make the most of the optimization process since the networks have to be trained to reproduce the thickness values at predefined points in the reactor. Second, while in this work we relied on simulations of the reactive transport of ALD precursors to generate the datasets, the same approach can be used with datasets based on experimental data. For this case, there are approaches for data augmentation, which could help further reduce the number of experiments required. This is particularly relevant for conditions where simulations can only provide approximate solutions.

Finally, it is important to mention that the results of the training process are reactor specific. Consequently, each experimental reactor will require its own specific dataset. Moreover, based on the results presented in this work, the accuracy seems to be limited primarily by how well the kinetics of the process we are trying to optimize is represented in the training dataset. While here, we have focused on purely self-limited processes, it is easy to envision more ambitious datasets incorporating nonidealities, such as a partial self-limiting behavior. Departures with respect to the expected behavior could also help users identify processes that do not behave according to the surface kinetics models considered in the training dataset.

## ACKNOWLEDGMENT

This research is based on the work supported by the Laboratory Directed Research and Development (LDRD) funding from the Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. DOE under Contract No. DE-AC02-06CH11357.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Angel Yanguas-Gil:** Conceptualization (lead); Data curation (lead); Formal analysis (lead); Funding acquisition (supporting); Methodology (lead); Software (lead); Writing – original draft (lead); Writing – review & editing (lead). **Jeffrey W. Elam:** Conceptualization (supporting); Funding acquisition (lead); Methodology (supporting); Project administration (supporting); Writing – original draft (supporting); Writing – review & editing (supporting).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in GitHub at https://github.com/aldsim/saturationdataset.

## REFERENCES

*et al.*, “PyTorch: An imperative style, high-performance deep learning library,” in

*Advances in Neural Information Processing Systems 32*, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Curran Associates, 2019), pp. 8024–8035.