One endeavor of modern physical chemistry is to use bottom-up approaches to design materials and drugs with desired properties. Here, we introduce an atomistic structure learning algorithm (ASLA) that utilizes a convolutional neural network to build 2D structures and planar compounds atom by atom. The algorithm takes no prior data or knowledge on atomic interactions but inquires a first-principles quantum mechanical program for thermodynamical stability. Using reinforcement learning, the algorithm accumulates knowledge of chemical compound space for a given number and type of atoms and stores this in the neural network, ultimately learning the blueprint for the optimal structural arrangement of the atoms. ASLA is demonstrated to work on diverse problems, including grain boundaries in graphene sheets, organic compound formation, and a surface oxide structure.
I. INTRODUCTION
The discovery of new materials for energy capture and energy storage and the design of new drugs with targeted pharmacological properties stand out as the ultimate goals of contemporary chemistry research and materials science.1,2 The design spaces for these problems are immense3 and do not lend themselves to brute force or random sampling.4 This has led to the introduction of screening strategies in which computational chemistry is utilized to identify the more promising candidates before experimental synthesis and characterization is carried out.5 Machine learning techniques have further been introduced to speed up such screening approaches.6 Unsupervised machine learning may serve to categorize substances whereby further search efforts can be focused on representative members of different categories.7 Supervised machine learning may direct the searches even further as properties of unknown materials are predicted based on the properties of known materials.8 In recent years, there has been a move toward even more powerful use of machine learning, namely, the employment of active learning schemes such as reinforcement learning.9,10
In active learning, the computational search is self-directed. It thereby becomes independent of both the availability of large datasets as in big data approaches11 and the formulation by the research team of a valid guiding hypothesis. In the chemistry domain, the power of active learning has been demonstrated in drug design. Schemes in which molecules may be built from encoding strings, e.g., the simplified molecular-input line-entry system (SMILES) strings, have led to artificial intelligence driven protocols, where a neural network based on experience spells the string for the most promising next drug to be tested.12 For instance, generative adversarial networks (GANs) have been utilized for this.13
The strength of reinforcement learning was recently demonstrated by Google DeepMind in a domain accessible to the general public.14–16 Here, Atari 2600 computer games and classical board games such as Chess and Go were tackled with techniques from computer vision and image recognition and performances were achieved that surpassed those of all previous algorithms and human champions.
In the present work, we aim at bringing the image recognition approach into the domain of quantum chemical guided materials search. That is, we formulate a protocol for deep reinforcement learning in which a convolutional neural network (CNN) starts with zero knowledge and builds up its knowledge of optimal organization of matter via self-guided recurring inquiries for its produced candidate structures. By optimal organization of matter, we consider in this first application primarily the thermodynamical stability of a configuration as revealed by a quantum mechanical simulation. By “playing” against itself, the network strives to beat the properties of its ever best candidate. As a result, the blueprint for favorable atomic arrangements is directly searched for and learned by the network.
This deviates in two ways from existing approaches that based on intricate descriptors of the atomic environments17,18 attempt to facilitate the calculation of quantum mechanical energy landscapes at a low cost19–25 and use them in traditional global structure optimization methods.26–28 First, our method is constructed to directly solve the configurational problem of assembling atoms into bonding complexes. Second, by using image recognition methodology, our approach is more general than methods based on the use of SMILES strings that are restricted to molecular compounds. To stress the latter point, several of the examples given in this paper will address periodic systems (graphene and a surface oxide structure on silver) for which SMILES may not be used.
II. ATOMISTIC REINFORCEMENT LEARNING
The goal of an atomistic structure learning algorithm (ASLA) structure search is to generate structures of a given stoichiometry which optimizes a target structural property, . This is done by repeatedly generating candidate structures, evaluating a reward according to , and learning from these experiences, or episodes, how to generate the globally optimal structure. The three phases of an ASLA structure search, building, evaluation, and training, are depicted in Fig. 1 and further explained in Secs. II A–II C.
The three phases of structure search in ASLA. In each step t of the building phase for episode i, the atomistic structure is input to the CNN that outputs Q-values (blue raster plot) estimating the expected reward for the final structure, . N consecutive actions, , are taken according to an ε-greedy policy thereby building the final structure. In the evaluation phase, a structure property, , is calculated. All episode steps are stored in the replay buffer from which a training batch is extracted. Properties contained in the batch are then converted to rewards which represent Qtarget-values that the CNN is trained toward predicting.
The three phases of structure search in ASLA. In each step t of the building phase for episode i, the atomistic structure is input to the CNN that outputs Q-values (blue raster plot) estimating the expected reward for the final structure, . N consecutive actions, , are taken according to an ε-greedy policy thereby building the final structure. In the evaluation phase, a structure property, , is calculated. All episode steps are stored in the replay buffer from which a training batch is extracted. Properties contained in the batch are then converted to rewards which represent Qtarget-values that the CNN is trained toward predicting.
A. Building
In the building phase, atoms are placed one at a time on a real space grid to form a structure candidate. N atoms are placed in a sequence of actions at ∈ {a1, a2, …, aN}. At each step of the candidate generation, ASLA attempts to predict the expected reward of the final structure given the incomplete current structure. This information is stored in a CNN, which takes as input the incomplete current configuration st, encoded as a simple one-hot encoded representation as seen in Fig. 2, and outputs the expected reward (Q-value) for each available action.
Layout of the CNN architecture for one atom species. The input is atom positions mapped to a grid in a one-hot encoding with 1 at atom positions and 0 elsewhere. The input is forwarded through three convolution layers with 10 filters in each layer, arriving at the Q-values. Convolution layers (i.e., no pooling, fully connected layers, etc.) are used such that the network maps each Q-value to a real space coordinate. The input is padded appropriately, taking periodic boundary conditions into account if needed. Notice that the network accepts grids of arbitrary size since the output size scales accordingly. The input can readily be expanded in the third dimension to accommodate systems of several atomic species.
Layout of the CNN architecture for one atom species. The input is atom positions mapped to a grid in a one-hot encoding with 1 at atom positions and 0 elsewhere. The input is forwarded through three convolution layers with 10 filters in each layer, arriving at the Q-values. Convolution layers (i.e., no pooling, fully connected layers, etc.) are used such that the network maps each Q-value to a real space coordinate. The input is padded appropriately, taking periodic boundary conditions into account if needed. Notice that the network accepts grids of arbitrary size since the output size scales accordingly. The input can readily be expanded in the third dimension to accommodate systems of several atomic species.
The action is chosen according to a modified ε-greedy policy (Fig. 3), i.e., chosen as the action with the highest Q-value (greedy action), a random action within the ν-fraction highest Q-values, or a random action among all available (ε-action).
Illustration of the pseudorandom ε-greedy policy. From the entire space of actions, the action with the highest Q-value is chosen with 1 − ε probability, while a pseudorandom action is chosen with ε = ε0 + εν probability. There is a probability ε0 that it is chosen completely randomly, while the action is chosen among a fraction ν of the highest Q-values with εν probability.
Illustration of the pseudorandom ε-greedy policy. From the entire space of actions, the action with the highest Q-value is chosen with 1 − ε probability, while a pseudorandom action is chosen with ε = ε0 + εν probability. There is a probability ε0 that it is chosen completely randomly, while the action is chosen among a fraction ν of the highest Q-values with εν probability.
To ensure that the greedy policy builds are included in the training, every Np = 5th episode is performed without ε-actions. Additionally, atoms cannot be placed within some minimum distance of other atoms (defined as a factor c1 times the average of the involved covalent radii). This is to avoid numerical complications with the use of density functional theory (DFT) implementations.
B. Evaluation
After the completion of a structure (episode i), the property of that final structure is calculated in the evaluation phase, e.g., as the potential energy according to density functional theory (DFT), , in which case ASLA will eventually identify the thermodynamically most stable structure at T = 0 K.
In some cases, the agent might settle in a local optimum where a highly improbable series of ε-actions is needed in order to continue exploring for other optima. To nudge the agent onward, we introduce a punishment which is activated when the agent builds the same structure repeatedly, indicating that it is stuck (similar to count-based exploration strategies in reinforcement learning29). The punishment is implemented by modifying the energy,
where m runs over structures, , to be punished (repeated structures), where A and σ are positive constants, and where the distance between two structures is evaluated with the Bag of Bonds descriptor as in Ref. 17.
C. Training
The Q-values sought to guide the search for structures should eventually fulfill the Bellman optimality equation,30
where r(st+1) is the reward for state st+1, and where a runs over all possible actions in the (t + 1)th step while γ is a discount factor. We choose a zero reward for all intermediate states until the final step where a structure with property is built. That property is converted into a reward , where 1 is considered the better reward (reflecting, e.g., the lowest potential energy). When solving Eq. (2), the CNN parameters are updated via backpropagation, which lowers the root mean square error between predicted Q-values and some appropriately chosen Qtarget-values. The latter may be set up in a number of ways, e.g., as in traditional Q-learning. In Q-learning, bootstrapping is employed meaning that estimates of Q-values of successor states are used in evaluating the target values, and Q-learning hence uses Eq. (2) for Qtarget,
However, with the combination of sparse rewards and potential bias from bootstrapping, the procedure from Monte Carlo control
[where (i) enumerates the episodes] appears more appealing in the present context. With Monte Carlo control, an averaging over stochastic outcomes, e.g., caused by an unpredictable opponent when playing a game, is effectively done once state-action pairs appear repeatedly in the training data. However, in our domain, where the reward of placing atoms according to a given policy is deterministic, we may consistently update Q-values toward the highest observed rewards among steps from previous episodes,
where i runs over the set of episodes I(st, at) = {i1, i2, … } that contain the given state-action pair, (st, at). A training batch of Nbatch state-action pairs, (st, at), is selected, including the most recent episode, the best episode, and via experience replay14 random state-action pairs from older episode steps. The reward function scales all batch property values linearly to the interval [−1, 1], where 1 is considered the better reward. If the property values span a large interval, all values below a fixed property value are set to −1 in order to maintain high resolution of rewards close to 1. The batch is augmented by rotation of structures on the grid. Rotational equivariance is thus learned by the CNN (see Fig. 4), instead of encoded in a molecular descriptor as is done traditionally.17,18,21,31 This allows the CNN to learn and operate directly on real space coordinates. We furthermore take advantage of the smoothness of the potential energy surface by smearing out the learning of Q-values to nearby grid points (spillover). Spillover is incorporated into the backpropagation cost function by also minimizing the difference between Qtarget and predicted Q-values of grid points within some radius rspillover of the grid point corresponding to the action at. In all of the present work, rspillover = 0.3 Å. The complete cost function is thus
where Nb is the augmented batch size and ηj is a weight factor determining the amount of spillover for a grid point j within the radius rspillover (ηj = 1 for the center grid point and ηj = 0.01 for the neighboring grid point in the present work). Thus, the predicted Q-values of adjacent grid points are also converged toward the Qtarget value but at a different rate determined by the factor ηj. To enable some spillover for systems where the grid spacing used is larger than rspillover, we instead double the batch size by adding instances (, , ), where deviates from the actually tested action, , by ± one pixel in one or both of the x- and y-directions.
Q-values in a graphene edge problem. The top row demonstrates the exact translational equivariance of the Q-values directly inherited from the architecture of the CNN. The bottom row demonstrates the approximate rotational equivariance learned by the agent from augmenting the training data with rotated copies.
Q-values in a graphene edge problem. The top row demonstrates the exact translational equivariance of the Q-values directly inherited from the architecture of the CNN. The bottom row demonstrates the approximate rotational equivariance learned by the agent from augmenting the training data with rotated copies.
CNN parameters are optimized with the ADAM gradient descent optimizer using TensorFlow.32
III. BUILDING GRAPHENE
As a first demonstration, an instance of ASLA (an agent) learns how to build a sheet of pristine graphene. One carbon atom is present in an initial template structure, s1, and N = 23 carbon atoms are placed by the agent on a grid with periodic boundary conditions. Figure 5(a) presents one example of predicted Q-values for t = 1, 10, and 20 during the building phase of one agent [for statistics involving 50 independent agents, see Figs. S1(a) and S1(b)]. Values are shown for the tabula rasa agent (the untrained agent) and for the agent trained for 200, 1000, and 2000 episodes. At first, the untrained CNN outputs random Q-values stemming from the random initialization, and consequently, a rather messy structure is built. After 200 episodes, the agent has found the suboptimal strategy of placing all the carbon atoms with reasonable interatomic distances but more or less on a straight line. The angular preference at this point is thus to align all bonds. After 1000 episodes, the agent has realized that bond directions must alternate and now arrives at building the honeycomb pattern of graphene. After 2000 episodes, the agent still builds graphene, now in a slightly different sequence.
ASLA learns to build a sheet of graphene. (a) One example of predicted Q-values for t = 1, 10, and 20 after 0, 200, 1000, and 2000 episodes. (b) An example of an agent which builds the globally optimal structure in a different sequence.
ASLA learns to build a sheet of graphene. (a) One example of predicted Q-values for t = 1, 10, and 20 after 0, 200, 1000, and 2000 episodes. (b) An example of an agent which builds the globally optimal structure in a different sequence.
The evolution over the episodes of the overall ring-shaped patterns in the Q-values for t = 1 shows how the agent becomes aware of the correct interatomic distances. The barely discernible six-fold rotational symmetric modulation of the Q-values that develop within the rings ends up being responsible for the agent choosing bond directions that are compatible with accommodating graphene within the given periodic boundary conditions.
Interestingly, the agent may discover the optimal structure through different paths due to the random initialization and the stochastic elements of the search. In Fig. 5(b), another agent has evolved into building the graphene structure in a different sequence. This time, the agent has learned to first place all next-nearest atoms (half of the structure) in a simple, regular pattern. The remaining atoms can then be placed in the same pattern but shifted. One can imagine how this pattern is easily encoded in the CNN.
IV. TRANSFER LEARNING
Moving on to a more complex problem, we consider a graphene grain boundary which has been studied by Zhang et al.33 Here, an ASLA agent must place 20 atoms in the region between two graphene sheets misaligned by 13.17°. The complete structure under consideration is comprised of 80 atoms per cell. We first run an ensemble of 50 tabula rasa agents. Within ∼2500 episodes, half of the agents have found the optimal structure33 that involves a pentagon-heptagon (5-7) defect [Figs. 6 and S1(d)].
Demonstration of transfer learning. CPU hours required to solve the graphene grain boundary problem without and with transfer learning (left and right part, respectively). The transferred agents pay a small amount of computational resources on the cheap graphene edge problem to get a head start on the grain boundary problem, where they solve it 50% of the time using 57.2 CPU hours or less. However, the tabula rasa agents need 143.0 CPU hours to achieve the same.
Demonstration of transfer learning. CPU hours required to solve the graphene grain boundary problem without and with transfer learning (left and right part, respectively). The transferred agents pay a small amount of computational resources on the cheap graphene edge problem to get a head start on the grain boundary problem, where they solve it 50% of the time using 57.2 CPU hours or less. However, the tabula rasa agents need 143.0 CPU hours to achieve the same.
The number of required episodes can be lowered significantly with the use of transfer learning. Transfer learning is a widely used approach in many domains where knowledge is acquired for one problem and applied to another.34,35 Our CNN architecture allows us to utilize transfer learning since each Q-value is evaluated based on the atomic-centered environment, which is independent of the cell size. The CNN may learn elements of how to discriminate favorable and unfavorable atomic arrangements in simple systems and retain this knowledge when used in larger, more computationally expensive systems. To demonstrate this, we choose a simpler problem, namely, that of placing two atoms at the edge of a graphene nanoribbon (see the upper right corner of Fig. 6). Using our DFT setup, this simple problem is cheaper to compute by a factor of ∼5 due to the smaller number of atoms. The agents trained on this problem are transferred to build the grain boundary. The pretrained agents hold a clear advantage over the tabula rasa agents. The average energies during the search reveal that the transferred agents produce structures close in energy to the globally optimal structure even before further training [Fig. S1(f )], and now, only about 1000 episodes are required for half of the agents to find the 5-7 defect.
The graphene edge problem is too simple for the agent to immediately infer the full solution to the grain boundary problem. Figure 7(a) shows how the transferred agent initially closes six-membered rings, i.e., the optimal strategy for the edge problem. However, the agent spends 18 atoms closing rings, after which there are no more rings to close and as a consequence, no more high Q-values but still two atoms to place. After further training, the agent has adapted the strategy to accommodate the defect induced by the misalignment [Fig. 7(b)]. The full building processes are shown in Figs. S2 and S3.
A snapshot of the building process at the 18th and 19th atom placements. The Q-values are shown in blue. (a) A transferred agent trained only on the edge problem attempts to close six-membered rings. This strategy fails in the final stages of the building process when the agent cannot find any rings to close. (b) After further training, the agent learns to accommodate the defect.
A snapshot of the building process at the 18th and 19th atom placements. The Q-values are shown in blue. (a) A transferred agent trained only on the edge problem attempts to close six-membered rings. This strategy fails in the final stages of the building process when the agent cannot find any rings to close. (b) After further training, the agent learns to accommodate the defect.
To inspect the agents’ motivation for predicting given Q-values, we employ a variant of the layerwise relevance propagation method.36 Layerwise relevance propagation seeks to decompose the CNN output, i.e., the Q-value, and distribute it onto the CNN input in a manner that assigns a higher share, or relevance, to more influential input pixels (Fig. 8). The outputs are distributed layer by layer while conserving relevance such that the sum of the relevance values in each layer equals the output. The relevances are computed by backward propagating the output through the CNN. We compute the relevance of neuron i in layer l by
Q-values (left) and relevance values (right) given by an agent trained on the graphene edge. (a) The agent has learned that closing a six-membered carbon ring results in low potential energy and assigns high Q-values for these positions. (b) Relevance values for the action (×) reveal that nearby atoms are responsible for the high Q-values. (c) If an atom is too close to an action (×), this atom is responsible for low Q-values. (d) The agent recognizes similar arrangements of atoms in the grain boundary (after 3 atoms have been added) and assigns high Q-values for corresponding actions. [(e) and (f)] Although the agent has never seen this configuration, it applies the local knowledge learned in the graphene edge problem.
Q-values (left) and relevance values (right) given by an agent trained on the graphene edge. (a) The agent has learned that closing a six-membered carbon ring results in low potential energy and assigns high Q-values for these positions. (b) Relevance values for the action (×) reveal that nearby atoms are responsible for the high Q-values. (c) If an atom is too close to an action (×), this atom is responsible for low Q-values. (d) The agent recognizes similar arrangements of atoms in the grain boundary (after 3 atoms have been added) and assigns high Q-values for corresponding actions. [(e) and (f)] Although the agent has never seen this configuration, it applies the local knowledge learned in the graphene edge problem.
Here, xi is the activation of node i connected to node j by the weight , while τ is a small number to avoid numerical instability when the denominator becomes small. Equation (7) distributes relevance of neuron j to neuron i in the preceding layer according to the contribution to the pre-activation of neuron j from neuron i weighted by the total pre-activation of neuron j. Equation (7) implies that bias weights are not used whenever relevance values need to be calculated (i.e., in the graphene grain boundary and edge systems).
The relevance value of an atom indicates how influential that atom was in the prediction of the Q-value, where 1 is positive influence, −1 is negative influence, and 0 is no influence. We show the Q- and relevance values for the graphene edge and grain boundary problems in Fig. 8, as given by one of the agents that has only been trained on the graphene edge problem. In addition to solving the graphene edge problem by closing a six-membered ring, the agent recognizes similar local motifs (unclosed rings) in the grain boundary problem which lead to favorable actions even without further training [Figs. 8(b) and 8(e)]. The relevance values in Figs. 8(c) and 8(f) also reveal that the agent has learned not to place atoms too close no matter the local environment.
V. MOLECULAR COMPOUNDS
The ability of ASLA to build multicomponent structures is easily accommodated by adding to the CNN input a dimension representing the atomic type. The CNN will then correspondingly output a map of Q-values for each atomic type. When building a structure, a greedy action now becomes that of choosing the next atom placement according to the highest Q-value across all atomic types.
Figure 9(a) presents an example of the building strategy of a self-trained ASLA agent assembling three carbon, four hydrogen, and two oxygen atoms. The agent identifies the combination of a carbon dioxide and an ethylene molecule as its preferred building motif. The reason for this build is the thermodynamic preference for these molecules with the chosen computational approach. Depending on the initial conditions and the stochastic elements of ASLA, we find that other agents may train themselves into transiently building a plethora of other chemically meaningful molecules given the present 2D constraint. The list of molecules built includes acrylic acid, vinyl formate, vinyl alcohol, formaldehyde, formic acid, and many other well known chemicals [see Figs. 9(b), 9(c), and S4]. To avoid getting stuck in local minima, we employ the punishment strategy previously described.
Predicted Q-values in the process of building C3H4O2 compounds. (a) Building ethylene and carbon dioxide. (b) Building acrylic acid. (c) Building vinyl formate.
Predicted Q-values in the process of building C3H4O2 compounds. (a) Building ethylene and carbon dioxide. (b) Building acrylic acid. (c) Building vinyl formate.
Figure 10(a) reveals the distribution of molecules built when 25 independent agents are run for 10 000 episodes and shows that ASLAs prolific production of different molecules has a strong preference for the thermodynamically most stable compounds (blue bars). This is in accordance with the reward function depending solely on the stability of the molecules. We find, however, that the building preference of ASLA may easily be nudged in particular directions exploiting the bottom-up approach taken by the method. To do this, we modify the property function slightly rewarding structures that contain some atomic motif. The property values are changed to
where f artificially strengthens the potential energy for structures fulfilling some criterion g according to
In the present work, λ = 0.2 is used. The histogram in Fig. 10(b) shows the frequencies with which molecules are built once the reward is increased for structures that have an oxygen atom binding to two carbon atoms. Two atoms are considered to bind if they are within the sum of their covalent radii +0.2 Å. No constraints on the precise bond lengths and angles are thus imposed, and ASLA consequently produces both ethers (structures 6, 18, 22, and 24) and an epoxy molecule (structure 19) at high abundance, while the building of other molecules is suppressed.
Building C3H4O2 compounds. (a) Distribution of molecular compounds of stoichiometry C3H4O2, sorted by formation energy, as found by 25 agents during 10 000 episodes. For all relaxed structures, see Fig. S4. (b) Search results when rewarding O binding to 2×C, coding for ether and epoxy groups. (c) Search results when rewarding C binding to C and 2×O, coding for epoxy and acidic groups. (d) Search results when rewarding C binding to O and 2×C, coding mainly for cyclic C3 groups.
Building C3H4O2 compounds. (a) Distribution of molecular compounds of stoichiometry C3H4O2, sorted by formation energy, as found by 25 agents during 10 000 episodes. For all relaxed structures, see Fig. S4. (b) Search results when rewarding O binding to 2×C, coding for ether and epoxy groups. (c) Search results when rewarding C binding to C and 2×O, coding for epoxy and acidic groups. (d) Search results when rewarding C binding to O and 2×C, coding mainly for cyclic C3 groups.
The production of the epoxy species intensifies as evidenced in Fig. 10(c) once the rewarding motif becomes that of a carbon atom bound to one carbon and two oxygen atoms. This motif further codes very strongly for acidic molecules (structures 2, 14, and 23). In fact, now every agent started eventually ends up building acrylic acid (structure 2) within the allowed number of episodes.
Switching to a rewarding motif consisting of a carbon atom binding to one oxygen and two carbon atoms, we find the results of Fig. 10(d). Now, the search builds almost twice as many 2-hydroxyacrylaldehyde molecules (structure 4) and adopts a clear preference for molecules with three carbon atoms in a ring, such as the aromatic cyclopropenone and related molecules (structures 16, 20, and 26).
The ability of ASLA to direct its search in response to variations in the reward function represents a simple illustration of how automated structure search may lead to the design of molecules with certain properties. The present approach may be used in search for larger organic compounds with some given backbone building blocks or some assemblies of functional groups. It might also be directly applicable when searching for organometallic compounds where the malleable bonds mean that the desired motif may only be specified loosely as done here. However, for searches that target, e.g., electric dipole moments and polarizabilities, HOMO-LUMO gaps, deprotonation energies, and other global properties of molecules, we envisage that more elaborate strategies for directing the searches must be implemented in order to truly conduct molecular design.
VI. SURFACE OXIDE ON Ag(111)
As a final, nontrivial example, the structure of oxidized silver, known to be an efficient catalyst for, e.g., epoxidation reactions,37 is considered. For this system, one particular surface oxide phase, Ag(111)-p(4 × 4)-O, has been subject to many studies since the 1970s resulting in several proposed structures.37–41 Initially, we prepare a grid [left of Fig. 11(a)] on which an agent can build a Ag4O2 2D structure. The property of a specific build is its DFT potential energy once it is placed as an overlayer structure on a single, fixed Ag(111) layer representing the silver crystal [right of Fig. 11(a)]. To encourage exploration, we utilize the punishment term.
Learning to build a Ag4O2 island structure. (a) Placement of ASLA 2D structure as the top surface layer in a DFT slab calculation. (b) Predicted Q-values for an episode where the untrained agent after 1825 episodes has self-learned to build the globally optimal island structure. (c) The agent has learned to respond reasonably when presented with structures of different stoichiometries.
Learning to build a Ag4O2 island structure. (a) Placement of ASLA 2D structure as the top surface layer in a DFT slab calculation. (b) Predicted Q-values for an episode where the untrained agent after 1825 episodes has self-learned to build the globally optimal island structure. (c) The agent has learned to respond reasonably when presented with structures of different stoichiometries.
The converged agent [Fig. 11(b)] is capable of making reasonable decisions when presented with structures it has never seen before. This is shown in Fig. 11(c), where the agent is tested on three substructures of the Ag(111)-p(4 × 4)-O global minimum. Although the agent cannot infer the correct solution immediately, the transferred knowledge allows the agent to avoid unfavorable parts of the configuration space.
Finally, the agent is employed to solve the full problem using transfer learning. Key experimental observations include STM images39–41 [Fig. 12(a)] of a regular rhombic surface unit cell containing 12 Ag atoms and 6 O atoms per unit cell with void regions amounting to approximately 25% of the cell.41 Using this information, we prepare a grid [Figs. 12(b) and 12(c)] on which an agent can build a Ag12O6 overlayer structure. ASLA proves capable of learning by itself to build the intricate structure shown in Fig. 12(d). In hindsight, this structure might appear highly intuitive to a trained inorganic surface chemist, yet despite many efforts,37 this correct structure remained elusive for more than half a decade after the first characterization by STM imaging in 2000 until it was finally identified in 2006.40,41 It is intriguing how such a puzzling structure may now emerge from a fully autonomous reinforcement learning setup and a CNN of a relatively modest size.
Learning to build a Ag12O6 overlayer structure. (a) Scanning tunneling microscopy image of the Ag(111)-p(4 × 4)-O phase (adapted from Ref. 40). (b) Experimental information on the super cell, absence of atoms in corner regions, and stoichiometry is used as input to ASLA. (c) Placement of ASLA 2D structure as the top surface layer in a DFT slab calculation. (d) Predicted Q-values after 3990 episodes where the agent transferred from the Ag4O2 problem [Figs. 11(b) and 11(c)] has self-learned to build the globally optimal overlayer structure.
Learning to build a Ag12O6 overlayer structure. (a) Scanning tunneling microscopy image of the Ag(111)-p(4 × 4)-O phase (adapted from Ref. 40). (b) Experimental information on the super cell, absence of atoms in corner regions, and stoichiometry is used as input to ASLA. (c) Placement of ASLA 2D structure as the top surface layer in a DFT slab calculation. (d) Predicted Q-values after 3990 episodes where the agent transferred from the Ag4O2 problem [Figs. 11(b) and 11(c)] has self-learned to build the globally optimal overlayer structure.
VII. OUTLOOK
In the present work, we have demonstrated how deep reinforcement learning can be used in conjunction with first principles computational methods to automate the prediction of 2D molecular and inorganic compounds of optimal stability. The work is readily generalized into handling 3D atomistic structures by turning to 3D convolution kernels and adding one more spatial dimension to the input grid and to the output Q-value tensor. Preliminary investigations in which the C3H4O2 problem is extended from 2D (Figs. 9 and 10) to 3D (not shown) reveal that the computation load of mainly the Q-value back-propagation step in each episode increases as a function of the number of grid points added in the 3rd dimension. It stays, however, manageable with our computer hardware. Enabling 3D structure search with ASLA does, however, also open for more complex problems to be addressed which will require more episodes to converge.
As more complex 2D or even 3D problems are tackled with the method, we foresee the need for introducing further means of directing the global optimization. The reinforcement protocol will be slow in detecting structures that require several favorable ε-moves to form and will suffer from conservatism, exploring mainly close to previously built structures. Here, adding, e.g., a beam search strategy, Monte Carlo tree search, and possibly combining the method with a property predictor network allowing for the screening of several orders of magnitude more structures than those handled with the first principles method may be viable strategies.
Finally, we note that agents trained to act on a variable number of atoms are desirable as they will be more versatile and capable of solving more diverse problems. Such agents may be obtained by introducing chemical potentials of atomic species. Notwithstanding such possible extensions, we consider the present work to represent a first peek into a rich research direction that will eventually bring about first principles computer codes for the intelligent, bottom-up design and direct manipulation of atoms in chemical physics and materials science.
VIII. METHODS
A. DFT calculations
All calculations are carried out with the grid-based projector augmented-wave (GPAW42) software package and the atomic simulation environment (ASE).43 The Perdew-Burke-Ernzerhof (PBE) exchange-correlation functional44 is used together with a linear combination of atomic orbitals (LCAO) for the graphene and Ag2xOx surface reconstruction, and a finite difference method for the C3H4O2 compound. All systems are modeled in vacuum. For further details on the system cell and grid dimensions, see Table S1.
B. Hardware
All calculations were done on a Linux cluster with nodes having up to 36 central processing unit (CPU) cores. Neural network and DFT calculations were parallelized over the same set of cores on one node. The CPU times given for the transfer learning problem presented in Fig. 6 were recorded when 4 CPU cores were used.
C. Hyperparameters
For all relevant system dimensions, computational settings, and hyperparameters, see Table S1.
SUPPLEMENTARY MATERIAL
See supplementary material for the statistics on the graphene problems, the 30 most stable molecular compounds found, the full Ag12O6 build, and the hyperparameters.
ACKNOWLEDGMENTS
We acknowledge support from VILLUM FONDEN (Investigator Grant, Project No. 16562).