Here, we report a case study on inverse design of quantum dot optical spectra using a deep reinforcement learning algorithm for the desired target optical property of semiconductor Cd_{x}Se_{y}Te_{x−y} quantum dots. Machine learning models were trained to predict the optical absorption and emission spectra by using the training dataset by time dependent density functional theory simulation. We show that the trained deep deterministic policy gradient inverse design agent can infer the molecular structure with an accuracy of less than 1 Å at a fixed computational time of milliseconds and up to 100–1000 times faster than the conventional heuristic particle swam optimization method. Most of the effective inverse design problems based on the surrogate machine learning and reinforcement learning model have been focused on the field of nano-photonics. Few attempts have been made in the field of quantum optical system in a similar manner. For the first time, our results, to our knowledge, provide concrete evidence that for computationally challenging tasks, a well-trained deep reinforcement learning agent can replace the existing quantum simulation and heuristics optimization tool, enabling fast and scalable simulations of the optical property of nanometer sized semiconductor quantum dots.

Semiconductor colloidal quantum dots (QDs) have attracted significant research attention from both experimental and theoretical points of view for decades due to their extraordinary low dimensional optical and electronic properties.^{1–6} Compared to the large amount of experimental results regarding colloidal quantum dots, rigorous theoretical simulation of their low dimensional optical properties is usually restricted by its heavy computation cost due to the large simulation supercell.^{7–9} Moreover, even though we managed to perform the simulation of the optical property from a given structure, the results could only contribute partially to the material engineer in the field because we are, in practice, more frequently asked to search for appropriate material candidates for a targeted optical property. Conventionally, the latter task is implemented by human guided design purely based on the physical insights into or intuitive experience from a large number of trials and errors.^{10} The tasks of either predicting the property from a structure or designing the material with a suitable structure for the targeted property correspond exactly to the “forward design” and “inverse design” proposed in the emerging materials informatics (MI) field.^{11,12} For the prediction related forward design in MI, the basic idea is to train a prediction model using a finite subset of known solutions by solving the mathematical governing equations.^{13,14} On the other hand, machine learning or reinforcement learning based inverse design in MI attempts to invert the problem to come up with a design that produces desired properties generally amounting to nonlinear one-to-many problems that are even more difficult to solve although various approaches have been proposed.^{15–17} By far, most of the effective inverse design problems based on the surrogate machine learning and reinforcement learning model have been dominantly focused on the field of nano-photonics.^{15,17–27} Few attempts have been made in the field of quantum physical systems, such as quantum simulation, quantum computing or quantum optics systems.^{28} In this work, we proposed an inverse design approach in the field of quantum optical simulation by introducing a joint deep reinforcement learning optimization scheme coupled with surrogate machine learning models for time dependent density functional theory (TD-DFT) simulation. We take the optimization of the colloidal semiconductor Cd_{x}Se_{y}Te_{x−y} quantum dot structure for a targeted optical absorption/emission spectrum as a case study while applying the deep deterministic policy gradient (DDPG) algorithm for learning continuous control of the atomic positions.

We first describe the surrogate machine learning scheme for TD-DFT to predict the optical properties of QDs. TD-DFT embedded in the Gaussian 16 package was used to perform the structure optimization and optical absorption/emission spectra simulation.^{29} The ground-state structures of Cd_{x}Se_{y}Te_{x−y} were created randomly by varying the number of Cd, Se, and Te atoms. The number of total atoms ranges from 8 to 30 to cover the various configurations of quantum dots. The exchange-correlation DFT functional CAM-B3LYP was chosen for the QD geometry optimization due to its confirmed effect in the benchmark study.^{30} The same CAM-B3LYP DFT functional was also utilized in the TD-DFT optical simulation to better include the long-range Coulomb interaction for more accurate prediction of the central peak energy in both the absorption and emission spectra. The core electrons of Cd, Se, and Te atoms were treated within the framework of LanL2DZ pseudopotential, and for the LanL2DZ basis set, the valence electron wavefunction was constructed by the 4*d*^{10}5*s*^{2} electrons of each Cd atom, 4*s*^{2}4*p*^{4} electrons of each Se atom, and 5*s*^{2}5*p*^{4} electrons of each Te atom. Kasha’s rule and the Franck–Condon Principle were applied during the calculation of the absorption and fluorescence emission spectrum.^{7,31–33} Here, we note that for the sake of simplicity and limitation of computation resources, all the simulations were conducted in the gas phase, and the solvent effect and ligand effect are left as future works.

Figure 1 shows the calculated optical absorption and emission spectra results of a typical Cd_{15}Se_{12}Te_{3} molecular QD using TD-DFT. Figure 1(a) presents the simulation details regarding optical absorption and emission. As mentioned before, we focus the calculation only on singlet excitation and recombination, and up to six excited states were taken into consideration. As shown in oscillator strength simulation results, we found that the maximum optical excitation process corresponds to the energy transition $e2\u2192e0\u2032$, i.e., the third ground states to the lowest unoccupied molecular orbital (LUMO) of the excited states. For the optical emission process, all the excited states are geometrically reoptimized to account for the non-radiative recombination during fluorescent emission. From the calculated oscillator strength results, it can be seen that the transition occurs between the relaxed excited zeroth state $e0r\u2032$ and the ground zeroth state $e0r$. Figures 1(b) and 1(c) show the simulated absorption and emission spectrum, respectively. The first excited-state structures were optimized with TD-DFT calculations; thus, there exists a structural difference between the excited-state $e0\u2032$ and the relaxed excited state $e0r\u2032$. The insets in Figs. 1(b) and 1(c) show the optimized geometric structure, and a clear difference between the optimized structures can be easily found, which accounts for the non-radiative recombination and is an important factor for proper evaluation of fluorescence quantum yields of quantum clusters.^{7} By repeating the calculation process shown in Fig. 1, a final set of 240 absorption spectra and 208 emission spectra was prepared for training the machine learning prediction model by taking the limit of the computation resource into consideration. We also note that the whole dataset is not presented here due to space limitations. Instead, we have uploaded all the datasets to the “training_data” folder in GitHub for easy access.^{34}

After the training data were prepared and the structural information was further processed by the so-called random Coulomb matrix techniques,^{13,14} to solve the governing Schrödinger equation in quantum physics, the ML surrogate model played the role of $Zi,Ri\u2192mlE$ to map the location *R*_{i} and charge information *Z*_{i} of each atom to its eigenenergy. The “Coulomb matrix” developed by Rupp *et al.* is defined as follows:

*C*_{ij} has the rank of *M* × *M* for the number of *M* atoms.^{13} Here, *a* and *b* are dimensionless hyperparameters and are empirically set to 0.5 and 2.4, respectively, in this work. An image representing the Coulomb matrix for a given quantum dot molecular structure is shown in Fig. 2(a) as an example. Since the Coulomb matrix takes the inverse of the atom-distance, the matrix representation is unique and retains invariance with respect to translation and rotation of QD molecules. These intrinsic functions are expected to filter out redundant samples that share common geometric features and thus play a similar role to feature extraction; pretraining techniques are used to attain better generalization ability while avoiding overfitting.^{13,14}

Regarding the ML model in this work, three typical machine learning models (ML-NN, ML-AE, and ML-RF) representing both parametric and non-parametric models were selected as the candidate to be used in the final TD-DFT learning scheme (Fig. 2).^{35} The three models were trained to predict the optical absorption and emission spectra by utilizing the training data generated by the TD-DFT simulation. All these ML models were constructed using the standard Scikit-learn package.^{36} For the ML-NN model, the number of input layer neurons was fixed at 900 to accommodate the maximum number of Coulomb matrices with the rank of *M* = 30. The number of output layer neurons was set to 500 corresponding to the discretized absorption spectra (0–800 nm) in the mesh with a resolution of 1.6 nm and the discretized emission spectra (300–3000 nm) with a resolution of 5.4 nm. In addition to the conventional NN model, we have also applied the “dropout” trick to avoid overfitting by randomly thrown a certain number of neurons based on a predetermined proportion. For the ML-AE model, it keeps a similar neural network framework as the ML-NN except that the ML was pretrained by the autoencoder to extract the feature of the input Coulomb matrix for the sake of enhancement of learning efficiency. For all the ML models, the standard cross-validation library embedded in Scikit learn was directly applied with five iterations. The training data were split in an 80:20 ratio, i.e., 80% for training and 20% for testing.

Figures 3(a) and 3(b) show the training and prediction results for optical absorption and emission spectra based on the machine learning models described previously. The root mean squared error (RMSE) criteria were used for all the four learning models. For the evaluation, we adopted the criteria used in Ref. 15 where both root mean squared error (RMSE) and mean absolute error (MAE) were utilized. The RMSE is influenced by the average of the error from all training data. In contrast, the MAE is more influenced by the training data with absolute large error. As shown in Fig. 3(a), ML-AE showed poor learning ability to suppress the RMSE but good ability to suppress the MAE. Contrary to ML-AE, ML-dropout showed opposite tendency. It is found here that ML-NN possesses the ability to reach both low RMSE and low MAE. The training error curves of the non-parametric tree-based algorithm ML-RF are shown in both Figs. 3(a-②) and 3(b-②) for both absorption and emission. ML-RF showed comparable test accuracy with the ML-NN work model but with a longer computation time. The high prediction accuracy can be verified for ML-NN by the results shown in Figs. 3(a-③) and 3(b-③), where the summarized correlation plots regarding the predicted peak energy (y-axis) from ML-NN vs target peak (x-axis) are simulated by TD-DFT. Due to the space limit, we only show the best results from ML-NN. For a better visual effect, Figs. 3(a-④) and 3(b-④) show graphic plots regarding the predicted absorption and emission spectrum profile using ML-NN together with the target spectrum simulated by TD-DFT.

Next, we proceed to present and discuss the optimization algorithm for inverse design of an optimal structure for a target optical structure using the trained ML-NN agent to replace the TD-DFT simulation. The inverse design approach contains both a training loop shown in Fig. 4(a) and the deployment loop shown in Fig. 4(b). The DDPG algorithm was coded in-house based on the conventional actor-critic algorithm. It is worth noting that the open source DDPG algorithm such as the one in OpenAI Gym is not suitable to fulfill the training purpose in this work due to its extremely slow training speed.^{37} More details regarding the DDPG algorithm could be found in the literature and our previous studies.^{38–40} The reward function in this work is defined directly as the profile-overlapping between the target spectrum and the designed spectra $R\u2261aE\u2212a\u0302E$. Here, $aE$ is the target absorption/emission spectra, and $a\u0302(E)$ is the learnt (predicted) absorption/emission spectra. The trained DDPG agent with fixed NN weights is deployed for validation and prediction testing, as shown in Fig. 4(b). During the test process, the structure with the same atomic species and atomic ratio is prepared with arbitrary initial atomic positions. The final spectra are calculated by the TD-DFT simulation and are compared with the target spectra. For gaining statistically significant results, the validation process was implemented by using 100 sets of the arbitrary atomic positions.

Four types of molecular QDs—Cd_{6}Se_{6}, Cd_{5}Te_{6}, Cd_{4}Se_{4}, and Cd_{5}Se_{1}Te_{4}—with a stable structure and well defined optical properties have been selected as the benchmark test samples for DDPG based inverse design. Since the inverse design process is similar for both absorption and emission spectra, here, we focused only on the emission spectra due to the existence of abundant experimental results. Figure 5 shows general information regarding the whole DDPG based inverse design process by taking molecular Cd_{6}Se_{6} as an example. The boxes shown in Fig. 5(a) indicate the movement boundary for the atoms to be optimized. For the atoms without a boundary cuboid, their atomic positions are assumed to be fixed and thus are not subjected to optimization. The purpose of inserting a boundary cuboid is mainly to facilitate the training efficiency. Meanwhile, by controlling the number of boundary cuboids, the difficulty of the task can be adjusted in a controllable manner. In this work, two types of cuboids with a volume of 4 × 4 × 4 Å^{3} and 10 × 10 × 10 Å^{3} have been chosen to verify the learning efficiency of the proposed approach. The number of atoms to be optimized has also varied from one to three. The DDPG algorithm is designed in a way in which the state for the DDPG agent is defined as the Cartesian coordinates for the *i*th step of the atoms: (*x*_{i}, *y*_{i}, *z*_{i}). The action of the DDPG agent during the training loop was updated by following the learnt policy and choosing the optimal moving direction defined by the pole angle pair (*θ*_{i}, *φ*_{i}) from the spherical coordinate (*r*, *θ*_{i}, *φ*_{i}). The radius *r* in the spherical coordinate (*θ*_{i}, *φ*_{i}) represents the step width and is treated as a hyperparameter defined prior to the implementation of the DDPG algorithm while being maintained constant (here, it is *r* = 0.4 Å) during the whole training loop.

Figure 5(b) exemplifies the learning curves of the DDPG agent by taking the reward function defined previously under different numbers of atoms to be optimized. It can be seen that the accumulated rewards for all the cases under investigation increase with the increase in episodes, indicating the successful implementation of the proposed algorithm. Moreover, due to the increased search complexity and difficulty, it is found that the reward for designing the optimal location of one atom is much higher than the case for the design of two and three atoms. The calculated spectra based on the geometric structure generated by the trained DDPG agent presented in Fig. 5(c) showed consistent tendency with the reward learning curve shown in Fig. 5(b). Figure 5(c) shows the obtained geometrical structures for two and three atoms, and it can be clearly seen here that the accuracy of the designed atom locations decreases with the increase in the number of atoms to be optimized.

The summarized results for all the four types of molecular QDs—Cd_{4}Se_{4}, Cd_{5}Te_{5}, Cd_{5}Se_{1}Te_{4}, and Cd_{6}Se_{6}—under different boundary cuboids are shown in Fig. 6. Figures 6(a) and 6(b) correspond to the results obtained for the two boundary cuboids. Since it is difficult to display all the parameter dependence in a single figure, for the sake of better understanding, the plot in Fig. 6 only represents the central peak position of the target spectrum and the peak position from the spectra calculated by the structure generated by the DDPG inverse design agent. As mentioned before, to obtain statistically significant results for the validation, 100 sets of arbitrary initial positions of the atoms to be optimized were prepared, and accordingly, 100 central peaks were determined from the generated 100 spectra. For better understanding and interpreting the prediction results summarized here, we first explain the TD-DFT simulation sensitivity by taking the quantum dot Cd_{6}Se_{6} as an example. As shown in Fig. 6(c), when the atomic positions intentionally rearranged to deviate from their original position are less than 0.4 Å, the averaged central peak deviation is within 20 nm, and for the deviation increasing from 0.4 to 1 Å, the deviation of the central spectra peak dramatically increases up to 200 nm. Although the sensitivity is based on TD-DFT simulation, it serves as a good reference and guideline to evaluate the prediction results by our trained DDPG inverse design agent. From the plotted results shown in Figs. 6(a) and 6(b), several features could be extracted: (1) As shown in Figs. 6(a) and 6(b), there is a general trend that shows that the deviation from the target increases with the increase in the number of atoms to be optimized. It is also found from the averaged distance deviation value of $\Delta d\u0304$ over the three atoms estimated using the reference shown in Fig. 6(c) that the inferred optimal structure has reached an accuracy of less than 1 Å. (2) The boundary cuboid with a volume of 10 × 10 × 10 Å^{3} tends to generate slightly worse results. We assume this negative effect may be related to the larger exploration space, which hampers the agent to find much better solution. (3) There is no apparent difference for the four QD samples. This is a good sign indicating that the proposed approach in this work is less likely suffering from the issue of overlearning, a critical problem in the machine learning field.

At last, we compare our results with those obtained by the conventional meta-heuristic PSO algorithm.^{41} To perform PSO embedded inverse design, we simply replaced the DDPG algorithm shown in Fig. 4(a) with the PSO for searching the optimal atomic positions. There is no need for deployment or a test loop as those shown Fig. 4(b) since there is no trained parameters in the PSO algorithm. Figure 7(a) shows the optimized results by using the PSO approach. Although there is slight improvement in the optimized peak positions for two and three atoms, the profile of final spectra showed certain deviation from the target spectra, and the degree of deviation is qualitatively comparable to those obtained by the DDPG inverse design agent shown in Fig. 5(c) (here we have taken Cd_{6}Se_{6} as an example). A dramatic difference between the PSO method and the DDPG algorithm can be clearly identified by comparing the computational cost as shown in Fig. 7(b). For all the conditions investigated in this work, when compared with PSO, the DDPG algorithm was roughly 100 times faster when using the 10 × 10 × 10 Å boundary cuboid and 1000 times faster when using the 4 × 4 × 4 Å boundary cuboid. The significant computation cost reduction is simply due to the fact that the optimal solutions from the DDPG agent are generated by inference through the learnt weight of the trained neural network. On the contrary, heuristics such as the PSO method lack this advantage and must search the optimal solution by conducting rigorous calculation every time. One might argue that the training cost for the AI based approach is usually high and should not be ignored. In this work, for the DDPG agent trained under the two types of boundary cuboids, the computation cost is between 3 and 4 h, which is indeed high when compared to the calculation time of the PSO method. However, since the DDPG RL algorithm proposed here is principally designed for offline applications, the heavy training cost is not considered as a dominating parameter. Nevertheless, due to this drawback, it is wise to use the AI based approach for computationally challenging tasks such as large scale molecular optimization problems, which has been demonstrated successfully in the field of games and others.^{40,42}

In conclusion, we performed inverse design of the optical property for semiconductor Cd_{x}Se_{y}Te_{x−y} quantum dots by the deep reinforcement learning DDPG algorithm. Machine learning models were trained to predict the optical absorption and emission spectra by utilizing the teaching data generated by the TD-DFT simulation. A trained ML model was then used to predict the emission spectra during the training loop in the DDPG based inverse design approach. Four types of molecular QDs—Cd_{4}Se_{4}, Cd_{5}Te_{5}, Cd_{5}Se_{1}Te_{4}, and Cd_{6}Se_{6}—under two types of boundary cuboids were tested with 100 trials to gain statistically significant results. The designed structure showed reasonable agreement with the target structure at an accuracy of less than 1 Å by comparing the central peak of the target spectrum with the spectrum from the inferred structure. Moreover, we show that the trained DDPG inverse design agent can infer results at a fixed computational cost and up to 100–1000 times faster than the PSO method. Our results provide evidence that, for computationally challenging tasks, a trained NN based RL agent can replace existing heuristics optimization tool, enabling fast and scalable simulations of the optical property of nanometer sized colloidal QDs with thousands of atoms.

The authors gratefully acknowledge the funding from the New Energy and Industrial Technology Development Organization (NEDO) (Grant No. JPNP20015) and the Ministry of Economy, Trade and Industry (METI), Japan.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Hibiki Yoshida**: Investigation (equal); Methodology (equal); Software (equal); Validation (equal). **Katsuyoshi Sakamoto**: Methodology (equal). **Naoya Miyashita**: Methodology (equal); Validation (equal). **Koichi Yamaguchi**: Methodology (equal); Validation (equal). **Qing Shen**: Investigation (equal); Validation (equal). **Yoshitaka Okada**: Investigation (equal); Supervision (equal); Validation (equal). **Tomah Sogabe**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (equal); Project administration (equal); Supervision (equal); Validation (equal).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in GitHub, Ref. 34.