To accelerate the exploration of chemical space, it is necessary to identify the compounds that will provide the most additional information or value. A large-scale analysis of mononuclear octahedral transition metal complexes deposited in an experimental database confirms an under-representation of lower-symmetry complexes. From a set of around 1000 previously studied Fe(II) complexes, we show that the theoretical space of synthetically accessible complexes formed from the relatively small number of unique ligands is significantly (∼816k) larger. For the properties of these complexes, we validate the concept of ligand additivity by inferring heteroleptic properties from a stoichiometric combination of homoleptic complexes. An improved interpolation scheme that incorporates information about cis and trans isomer effects predicts the adiabatic spin-splitting energy to around 2 kcal/mol and the HOMO level to less than 0.2 eV. We demonstrate a multi-stage strategy to discover leads from the 816k Fe(II) complexes within a targeted property region. We carry out a coarse interpolation from homoleptic complexes that we refine over a subspace of ligands based on the likelihood of generating complexes with targeted properties. We validate our approach on nine new binary and ternary complexes predicted to be in a targeted zone of discovery, suggesting opportunities for efficient transition metal complex discovery.
I. INTRODUCTION
In recent years, virtual high-throughput screening (VHTS)1–8 with first-principles density functional theory (DFT) and machine learning (ML) models9–17 has greatly accelerated the discovery of new molecules and materials.18–23 Nevertheless, the theoretical space of all possible compounds or materials is so large as to challenge even the most accelerated methods, with 1030–1060 theoretical drug-like molecules being enumerated24–29 from a relatively small number of elements and atoms.30,31 Within the space of theoretical transition metal complexes, additional variables emerge, such as the metal identity, spin, and oxidation state, as well as denticity of the ligands.4,9 Indeed, significant analysis has been carried out by Fey and co-workers in understanding the role of privileged (e.g., phosphine) ligands in determining transition metal complex properties.32–34 Jensen and co-workers devised elegant strategies to explore the space of favored complexes, e.g., by adjusting the denticity during complex optimization carried out with efficient semi-empirical- or force-field-based scoring.35–38 One key challenge for high-throughput screening with DFT of transition metal complex space is that the smallest non-trivial mononuclear octahedral complex consists of at least seven heavy atoms and nearly 100 electrons, also challenging the speed of conventional simulation techniques in comparison to readily computed datasets of closed-shell transition metal39 and organic molecules.40,41 Further compounding the challenges of exploring transition metal chemical space are potential issues with convergence success or the presence of multi-reference character.42–44 Thus, it is attractive to identify the minimal set of explicit first-principles calculations that can be used to build a model of the properties of the full dataset.45–47
Recently, toward the goal of exploring a more diverse transition metal chemical space in comparison to complexes comprising of frequently studied ligands, we devised a strategy for enumerating hypothetical, small (i.e., 1–2 heavy atoms per coordination site) ligands for mononuclear octahedral transition metal complexes.48 These ligands sampled a diverse combination of coordinating atoms and their bonding environments,48 and only a small fraction were represented in prior databases of organic molecules.25,49,50 We showed how incorporating these molecules could improve the fidelity of artificial neural network (ANN) models48,51 when applied to larger, realistic complexes present in the Cambridge Structural Database (CSD).52 That study was limited to the properties of homoleptic combinations of those ligands (i.e., all ligands are the same) and therefore did not capture effects of mixing ligands that give rise to the compelling properties of many heteroleptic complexes and catalysts. Enumerating combinations of these ligands, however, would give rise to a combinatorial explosion, motivating strategies to understand which combinations are likely to be valuable or informative for a specific application.
Despite the challenge of combinatorial explosion, there are some established precedents of ligand additivity53 that suggest that the properties of heteroleptic complexes can be inferred from combinations of homoleptic complexes.54 For example, ligand additivity has been demonstrated in force field and DFT energetics54 and DFT errors.55 It has also been used in correction schemes, such as the DBLOC method.56–59 We also recently exploited additivity to learn the degree of multireference character in a complex from the multireference character in its constituent ligands.60 Additivity is also exploited heavily in fragmentation methods61,62 and locally correlated methods.63,64 In the present work, we carry out a survey of the symmetry classes and ligand diversity present in the CSD to confirm that the theoretical chemical space is orders of magnitude larger than the number that have been characterized. Motivated by the need to devise efficient but accurate methods for the exploration of chemical space, we introduce improved interpolation schemes for heteroleptic compounds to incorporate cis and trans effects. Finally, we demonstrate how these approaches can be used for efficient but accurate discovery of transition metal complexes with targeted properties.
II. COMPUTATIONAL DETAILS
A. Curation from the Cambridge Structural Database
A set of 85 575 mononuclear octahedral transition metal complexes were curated from the Cambridge Structural Database52 (CSD) version 5.41 (November 2019). This procedure employed both the Conquest graphical interface to the CSD and the Python application programming interface, in all cases applied to the v5.41 dataset with complexes from the November 2019 dataset with both the March 2020 and May 2020 data updates (supplementary material, Text S1). For the complexes identified as octahedral, equatorial planes and axial positions were assigned based on prior reported rules.65 To identify the symmetry of the ligands, unique ligands were identified by removing the metal atom to create independent molecular graphs for each ligand. Each ligand was identified as chemically unique within a given octahedral complex if it differed from all other ligands in the complex by (1) heavy atom chemical symbols, (2) metal-connecting-atom element, or (3) more than three hydrogens (supplementary material, Text S2). The symmetry of the complex was identified by distinguishing ligand denticity overall and in the equatorial plane along with the total number of unique ligands and whether ligands that were trans to each other were identical (supplementary material, Text S2). This led to a nomenclature for 66 ligand symmetry classes (supplementary material, Text S2).
To identify the set of unique ligands in each complex, a dummy atom with identical connectivity to the metal with an atomic number of 0 was introduced to preserve the connectivity of the ligands to the metal without preserving the metal identity. For this ligand and dummy atom combination, the atomic-number and bond-order weighted connectivity matrix determinant was calculated as described in Ref. 65. We also computed the determinant of the atomic-number and bond-order weighted connectivity matrix where the off-diagonal elements, ZiZj (i ≠ j), were set to the CSD-assigned bond order for each ligand. Ligands with both distinct atomic-number-weighted connectivity matrix determinants and bond-order-weighted connectivity matrix determinants were identified as distinct ligands across monometallic transition metal complexes in the CSD. A second search was carried out by requiring that oxidation states and charges be assigned by the uploader along with no disorder (i.e., as judged by the CSD flags) or missing hydrogen atoms in the structure (i.e., none were added by the CSD algorithm), leading to 17 085 unique “computation-ready” complexes. Finally, we curated a subset of 1202 Fe(II)-containing “computation-ready” complexes based on the oxidation state reported by the uploader. From the ligands identified in this Fe(II) complex set, heteroleptic calculations from CSD ligands were carried out using a previously developed procedure60 that enabled the assignment of the per-ligand charge. The goal of this curation is for subsequent analysis outlined in Sec. III. We note that complementary datasets have been developed, including the tmQM dataset of 86k closed-shell transition metal complexes,66 the cell2mol set of 31k inferred transition metal complex charges from the crystal unit cell along with 13k ligand charges,66 and our set of over 5k transition metal complex ligands with confident charge assignment.60 Our rationale for curating another set of CSD transition metal complexes is threefold: (i) to focus on open-shell transition metal chemistry absent from some of the aforementioned sets, (ii) to leverage our recently developed ligand-derived charge scheme,60 and (iii) to primarily analyze the theoretical vs observed propensities of ligand types, denticities, and complex symmetries in the CSD.
B. Electronic structure calculations
DFT geometry optimizations were performed using a development version of TeraChem v1.9.67,68 The B3LYP70–71 global hybrid functional was employed with the LANL2DZ72 effective core potential for transition metals and the 6-31G* basis73 for all other atoms. All transition metal complexes were studied with Fe(II) centers in low-spin singlet and high-spin quintet multiplicities. Singlet calculations were carried out in a spin-restricted formalism, while quintet calculations were unrestricted. Level shifting74 was employed to aid self-consistent field convergence with the majority-spin and minority-spin virtual orbitals each shifted by 0.25 Ha. Geometry optimizations were carried out in the gas phase in translation rotation internal coordinates75 using the BFGS algorithm. Default tolerances of 4.5 × 10−4 hartree/bohr and 10−6 hartree were applied in the convergence criteria for the maximum gradient and energy difference between steps, respectively.
For the representative model complexes of CH3CN, H2O, CO, and NH3, initial structures were generated with molSimplify,51,76,77 which uses OpenBabel78,79 as a backend. The same protocol was applied to generate homoleptic complexes of the 20 neutral ligands derived from monodentate-only, non-homoleptic Fe(II) complexes obtained from the CSD52 after discarding one bulky ligand, OP(Ph)3, that could not form a stable homoleptic structure for steric reasons. For the 36 homoleptic Fe(II) complexes in the CSD, the structures were directly extracted for subsequent geometry optimization. Heteroleptic complexes of the 12 representative ligands were also generated with molSimplify and optimized following the same procedure. The curated CSD database and structures, as well as the scripts to generate these databases and structures, along with the properties of all transition metal complexes studied with DFT are provided on Zenodo with DOI 10.5281/zenodo.7224793.80
III. RESULTS AND DISCUSSION
A. Symmetry classes and theoretical complex space
The diversity present in the chemical space of transition metal complexes is derived from the variability in the metal, its oxidation and spin state, and the chemistry of the coordinating ligands. Our experimental knowledge of this chemical space is unevenly distributed. We thus first examined the structures deposited in the CSD to uncover trends in the arrangement of ligands in previously characterized complexes (see Sec. II). From 85 575 mononuclear octahedral transition metal complexes of which 17 085 are identified as unique and computation-ready (e.g., have user-defined charges), the vast majority (95%–98%) contain no more than three unique ligands (Fig. 1 and supplementary material, Table S1). In fact, 28% of all unique computation-ready complexes are homoleptic, and a majority (76%) contain no more than two ligand types (Fig. 1 and supplementary material, Table S1). Because the metal identity and ligand diversity are expected to be coupled, we also evaluated statistics on a subset of 1202 unique Fe(II) complexes and confirm that the preference for complexes with no more than two ligands is preserved and even strengthened (Fig. 1 and supplementary material, Table S1). We, nevertheless, further distinguish the denticity of these ligand types because monodentate ligands are theoretically compatible with a wider range of symmetries than higher denticity ligands (Fig. 1). In subsequent analysis in this paper, we will only focus on monodentate ligands for this reason and note, therefore, that while two is the most common number of unique ligands, many of those ligands are bidentate and their study is motivated in the future.
To identify how the structures sampled in the CSD compare to the theoretical space of all hypothetical complexes, we enumerate the overall pool of theoretical complexes and the number in each symmetry class to compare to the most frequently characterized symmetry classes in the CSD. We applied Pólya’s enumeration theorem to octahedral coordination geometries to obtain all possible symmetry classes from the cycle index (i.e., theoretical sum) of a symmetry group (supplementary material, Text S3).82–83 The total number of complexes depends on both the number of ways the same stoichiometry can be arranged and the number of ways the ligands can be combined (supplementary material, Text S3). For example, in (L1)4(L2)2, there is one occurrence of a four-substitution site and one occurrence of a two-substitution site, whereas in (L1)2(L2)2(L3)2, there are three occurrences of a two-substitution site. In total, the theoretical number of ways distinct ligands can be enumerated together leads to a large number of hypothetical complexes even for a small pool of ligands (Table I).
Name . | Configuration . | Isomers . | Cardinality . | Complexes . |
---|---|---|---|---|
HO | x6 | 1 | N | 12 |
5+1 | x1x5 | 1 | N(N-1) | 132 |
TS/CS | x4x2 | 2 | N(N-1) | 264 |
FS/MS | x3x3 | 2 | N(N-1)/2! | 132 |
CA/TA | x4x1x1 | 2 | N(N-1)(N-2)/2! | 1 320 |
FA/MAC/MAT | x3x2x1 | 6 | N(N-1)(N-2)/2! | 3 960 |
EA/DCS/DTS | x2x2x2 | 5 | N(N-1)(N-2)/3! | 1 100 |
Up to two ligands | 540 | |||
Up to three ligands | 6 920 | |||
Full | 1/48 (N6 + 3N5 + 9N4 + 13N3 + 14N2 + 8N) | 82 160 |
Name . | Configuration . | Isomers . | Cardinality . | Complexes . |
---|---|---|---|---|
HO | x6 | 1 | N | 12 |
5+1 | x1x5 | 1 | N(N-1) | 132 |
TS/CS | x4x2 | 2 | N(N-1) | 264 |
FS/MS | x3x3 | 2 | N(N-1)/2! | 132 |
CA/TA | x4x1x1 | 2 | N(N-1)(N-2)/2! | 1 320 |
FA/MAC/MAT | x3x2x1 | 6 | N(N-1)(N-2)/2! | 3 960 |
EA/DCS/DTS | x2x2x2 | 5 | N(N-1)(N-2)/3! | 1 100 |
Up to two ligands | 540 | |||
Up to three ligands | 6 920 | |||
Full | 1/48 (N6 + 3N5 + 9N4 + 13N3 + 14N2 + 8N) | 82 160 |
For the trivial case of homoleptic (HO) complexes, N ligands produce N complexes (i.e., 12 for N = 12, Table I). For up to two unique ligand types, the five unique symmetry classes consist of monoheteroleptic (5+1) M(L1)5(L2)1 complexes, trans symmetric (TS) or cis symmetric (CS) M(L1)4(L2)2 complexes, and fac symmetric (FS) or mer symmetric (MS) M(L1)3(L2)3 complexes (Fig. 2 and Table I). Both 5+1 and CS/TS complexes each form N(N-1) complexes for N ligands (i.e., 132 each for N = 12, Table I). The degeneracy of the stoichiometry in FS and MS complexes gives rise to N(N-1)/2! complexes (i.e., 66 each for N = 12, Table I). Thus, from N = 12 ligands, a total of 540 complexes may be formed with up to two unique ligand types.
Expanding to up to three unique ligand types introduces eight additional symmetry classes (Fig. 3). These include M(L1)4L2L3 cis asymmetric (CA) and trans asymmetric (TA) complexes; the three types of M(L1)2(L2)2(L3)2 configurations in equatorial asymmetric (EA), double cis symmetric (DCS), or double trans symmetric (DTS) symmetries; and three M(L1)2(L2)3L3 in fac asymmetric (FA), mer asymmetric trans (MAT), or mer asymmetric cis (MAC) symmetries (Fig. 3). For the EA complexes, there are three occurrences of a two-substitution site that fulfill the EA definition. This combines with the DCS and DTS isomers to form a total of five isomers with N(N-1)(N-2)/3! possible combinations of ligands (i.e., 1100 for N = 12, Table I). The less degenerate CA/TA complexes form a total of N(N-1)(N-2)/2! complexes for each of the two isomers (i.e., 1320 for N = 12, Table I). Similarly, FA/MAC/MAT complexes can each form N(N-1)(N-2)/2! complexes in two isomers each (i.e., 3960 for N = 12, Table I).
In total, 6920 complexes can be formed from N = 12 ligands for the three symmetry classes considered here. For the same ligand pool, there is a much larger set of 82 160 theoretical complexes that could be created from a greater number of unique ligands. This analysis does not consider cases where the ligand chemistry prevents the formation of a complex, e.g., only monodentate and pentadentate ligands are likely to form the 5+1 symmetry class, whereas only monodentate, bidentate, tridentate, or hexadentate ligands can form HO complexes.
Returning to the diversity observed in complexes deposited in the CSD, we qualitatively observe that relatively little of the theoretical space has been sampled. As we have shown for a representative example, for any set of unique ligands, a much larger theoretical number of binary and ternary complexes can be formed in comparison to homoleptic complexes. Nevertheless, there are far fewer binary and especially ternary complexes in the CSD (supplementary material, Table S1). Within the binary complexes, 5+1 and TS complexes are overrepresented in comparison to those for FS, MS, or CS based on our theoretical enumeration (Fig. 4 and supplementary material, Table S2). All ternary complexes are underrepresented, but those with equal stoichiometry (i.e., EA, DCS, or DTS) have very few examples in the CSD (Fig. 4 and supplementary material, Table S2). If we simplify our analysis by focusing only on Fe(II) complexes, the same trends hold although Fe(II) complexes have an even greater relative number of 5+1 and TS complexes and a lower number of CS complexes (Fig. 4 and supplementary material, Table S2).
An additional factor in analysis of symmetry classes is the extent to which ligand denticity plays a role. While enumeration is straightforward for monodentate ligands, higher-denticity ligands (e.g., pentadentates) may only be compatible with some of the symmetry classes. Indeed, over both the computation-ready CSD set and the Fe(II) subset, a significant number of complexes consist of higher-denticity ligands that would be incompatible with full enumeration (Fig. 1 and supplementary material, Tables S3–S4). Some multidentate ligands are also restricted in which symmetry classes they can form due to rigidity, while other more flexible ligands are less restricted. Nevertheless, we can still conclude that there is vastly higher sampling of homoleptic structures. For example, tridentate ligands can be present in a homoleptic complex and any binary FS/MS or ternary TA/FA/MAC/MAT complex. Nearly an order-of-magnitude more unique tridentate ligands have been characterized in homoleptic Fe(II) complexes in comparison to either the binary or ternary cases (supplementary material, Table S4). Similar trends hold for all unique computation-ready complexes (supplementary material, Table S3). Overall, the number of unique ligands for binary or ternary complexes is still lower than the number of unique complexes, but the gap is smaller than could be expected from enumeration alone (supplementary material, Tables S3–S4).
To simplify a quantitative comparison between the theoretical space and the enumerated space, we focus on Fe(II) complexes with only monodentate ligands in any of the 14 symmetry classes considered. For this set, we identify 40 unique monodentate ligands present in 40 HO complexes (supplementary material, Table S5). There are additional 48 ligands present in binary or ternary complexes not observed in homoleptics for which we can confidently assign a charge to the ligand (supplementary material, Table S5). While the majority of HO complex monodentate ligands were neutral, over half of the additional binary or ternary ligands have a non-zero charge (supplementary material, Table S6). Thus, although we identify 88 unique monodentate ligands in previously synthesized Fe(II) complexes, ligands not sampled in HO complexes may be incompatible with the HO symmetry class if they give rise to high overall complex charges.
Taking the set of 88 ligands from any HO, binary, or ternary Fe(II) complex as the theoretical space for which compatibility across symmetry classes should be maximal, we then quantified the theoretical vs actual coverage of Fe(II) complex chemical space. A significant number (39%) of all HO complexes have been characterized. In comparison, the binary symmetry classes have been much less well explored, with TS complexes being the highest at 0.3% (i.e., 26 of 7656 theoretical complexes and supplementary material, Table S6). The number of theoretical ternary complexes grows rapidly with this ligand pool, ranging from 109 736 theoretical DTS/DCS complexes to 329 208 FA/MAC/MAT/CA/TA complexes (supplementary material, Table S6). In total, the 1202 Fe(II) complexes represent a tiny fraction of the theoretical 3 213 056 homoleptic, binary, or ternary complexes that could form from 88 experimentally synthesized ligands. Of the 88 ligands, we exclude one [i.e., OP(Ph)3] from further analysis due to its large bulk that prevents building a homoleptic complex that remains intact after geometry optimization. Even if we restrict ourselves to the 56 ligands that are neutral, closed-shell singlets, and amenable to homoleptic complex construction, the theoretical space of homoleptic, binary, or ternary complexes is still large (i.e., 816 256 complexes). Thus, we conclude that efficient strategies to infer heteroleptic properties from homoleptic properties are necessary to “fill in” the remainder of this unexplored space.
B. Ligand additivity for interpolating properties of transition metal complexes
We next aimed at determining the extent to which the properties of lower-symmetry heteroleptic transition metal complexes could be inferred from those of higher-symmetry complexes. We constructed complexes with two to three unique ligand types from small sets of ligands that spanned a large range of ligand field strengths: weak-field water, strong-field carbonyl, and either strong-field methyl isocyanide or weak-field ammonia. For the calculation of the adiabatic high-spin [e.g., quintet Fe(II)] to low-spin [e.g., singlet Fe(II)] splitting, ΔEH-L, it can be expected that the weak-field ligands will lead to homoleptic complexes that favor high-spin states, whereas strong-field ligands will make homoleptic complexes that favor low-spin states. Thus, heteroleptic combinations of these ligands are expected to reside between the two limits. One simple way to obtain estimates of the spin splitting of the heteroleptic complexes (e.g., with up to three unique ligand types) is to take a weighted average of the spin splitting of the parent homoleptic complexes,
Indeed, we observe that heteroleptics reside between the homoleptic limits for spin splitting, and the linear averaging roughly holds for complexes with methyl isocyanide (CH3CN), H2O, and CO ligands (Fig. 5 and supplementary material, Table S7). Similar observations can be made on combinations with H2O, CO, and NH3 (supplementary material, Fig. S1). Nevertheless, there are significant outliers in the interpolated vs actual ΔEH-L, which are particularly evident when comparing heteroleptic complexes with the same stoichiometry and, therefore, the same prediction from a simple linear model, but distinct ligand symmetry (Fig. 5). For example, the CS complex Fe(II)(H2O)4(CO)2 ΔEH-L is predicted accurately (predicted: −7.9 kcal/mol vs calculated: −9.9 kcal/mol) from homoleptic interpolation (Fig. 5 and supplementary material, Table S8). The same prediction significantly overestimates the TS complex with the same stoichiometry (predicted: −7.9 kcal/mol vs calculated: −19.3 kcal/mol), which instead behaves much more similarly to the 5+1 complex, as observed in prior work84 (Fig. 5). Overall, for this combination of ligands, CS or FS complexes that have one minority ligand in the axial position and another in the equatorial plane are much better predicted than the equivalent MS or TS complexes (Fig. 5). For the case of CO, H2O, and NH3, mixing between weak-field NH3 and H2O is relatively accurately predicted from homoleptic averaging, whereas for NH3 and CO, it is the MS and TS complexes that are more accurately predicted than the FS or CS counterparts (supplementary material, Fig. S2 and Tables S8–S9).
Because adiabatic spin splitting involves geometry optimizations in two distinct spin states, we also evaluated HOMO levels of each singlet complex as a property that only depends on a single geometry. Overall, interpolation of HOMO energies of heteroleptic complexes of CH3CN, CO, and H2O from homoleptic complexes reproduces trends between the homoleptic limits (Fig. 6 and supplementary material, Tables S7 and S10–S11). As with spin splitting, there are key differences for complexes with identical stoichiometry that cannot be captured by interpolation from homoleptic complexes alone (Fig. 6). Interestingly, for the same complexes for which the CS complex had a higher (i.e., more off-parity) ΔEH-L and the TS complex was more like the equivalent 5+1 complex, we find more varied results for the HOMO level, with some cases occurring where the TS and FS complexes are less accurately predicted (Fig. 6 and supplementary material, Table S10). Nevertheless, over the full range of data, the outliers in the HOMO level prediction are more modest than what was observed for ΔEH-L. These trends also hold for the complexes with NH3, H2O, and CO (supplementary material, Fig. S2 and Tables S10–S11).
Given the differences between CS and TS complexes despite having identical stoichiometry, we next identified strategies for improving the interpolation. First, we employed CS and TS complexes along with homoleptic complexes to predict the spin-splitting energetics of the other binary heteroleptic complexes as follows:
where E(5+1) is the interpolated energy of the M(L1)5L2 complex from the M(L1)6 HO energy and the M(L1)4(L2)2 TS energy, and we select the weights of the averaging here and throughout to reflect the stoichiometry of the final complex (i.e., here, one L2 ligand and five L1 ligands).
Similarly, we estimate the FS and MS energies as
where the weights are again chosen stoichiometrically, but we assume that fac complexes, which contain more cis interactions than mer complexes, are a better predictor of cis symmetric interactions and vice versa for the trans case. In practice, this corresponds to computing six energies and interpolating four remaining energies for each pair of ligands for a computational saving of 40%. Indeed, we observe reduced mean absolute error (MAE) over the remaining points that are interpolated (2.4 kcal/mol vs 5.1 kcal/mol) in comparison to the HO-only interpolation (Fig. 5 and supplementary material, Tables S8–S9). In particular, FS complexes of CO and H2O are now correctly predicted to be much more low-spin-directing than the MS complex of the same stoichiometry (Fig. 5). The HOMO levels for these and other complexes are also improved (supplementary material, Tables S10–S11). For example, the modified interpolation is able to capture the fact that MS Fe(II)(CO)4(H2O)2 has a shallower HOMO level than its FS counterpart (Fig. 6). Nevertheless, not all points are uniformly improved by this interpolation. For the case of NH3, CO, and H2O where the interpolation already performed well, some points such as ΔEH-L for 5+1 Fe(II)(CO)5(NH3) are slightly worsened in the modified interpolation (supplementary material, Fig. S1). For the same complex, the HOMO level is equivalently predicted by both interpolation schemes, and most HOMO level estimates are improved (supplementary material, Fig. S2). Overall, errors are on average significantly lower for all properties and sets of ligands considered when the modified interpolation expressions are employed.
Although the CS/TS-derived interpolation schemes greatly reduce errors in estimating the energetics of heteroleptic complexes, they still require significant computational overhead. Thus, we next aimed at identifying if FS and MS complexes, which contain three ligands cis to each other or two ligands cis and two sets of ligands trans, respectively, could be used instead in the interpolation (see Fig. 2). If the FS and MS complexes impart sufficient information, using them in an interpolation scheme along with homoleptic complex properties corresponds to evaluating four properties (e.g., energies) to predict six properties for a computational savings of 60%. In this interpolation scheme, we estimated the complex properties from FS and MS complexes as follows:
where E(5+1) is the interpolated energy of the M(L1)5L1 complex from the M(L1)6 HO energy and the M(L1)3(L2)3 FS energy, and the weights are selected based on stoichiometry. Similarly, we obtain expressions for the CS and TS complexes as
where the weights are chosen to match the stoichiometry of the final complexes and the fac complex is again chosen to better mimic the cis complex. Indeed, using this approach, we achieve errors only slightly larger than that for the CS/TS averaging scheme, with the added benefit of requiring fewer energies to obtain the same fidelity of the interpolation (Fig. 5). For example, the higher spin-splitting energy of CS complexes relative to TS complexes is captured here because it can be directly derived from the strength of the high-spin directing character of the FS complex relative to that of the MS complex (Fig. 5). Mixing of the weak-field NH3 and H2O ligands, which had slightly worsened with CS/TS interpolation, is also significantly improved in this scheme, and the other sets of complexes are of comparable accuracy (supplementary material, Fig. S1 and Tables S8–S9). While, CS/TS interpolation tended to underestimate the HOMO level, FS/MS interpolation slightly overestimates HOMO levels, but errors are much smaller than for the HO-only interpolation (Fig. 6 and supplementary material, Tables S10–S11). For the set of ligands including ammonia, almost all points are predicted to comparable or slightly improved values (supplementary material, Fig. S2). Thus, HO-only interpolation provides a highly efficient scheme for predicting heteroleptic transition metal complexes, but the best trade-off in accuracy and computational cost for interpolation is likely achieved through estimating properties using information from FS and MS complexes.
We next investigated whether we could generalize our observations to heteroleptics with three unique ligand types (Fig. 3). This extension is motivated by the fact that 96% of all mononuclear transition metal complexes in the CSD contain no more than three ligand types, and over 99% of Fe(II) complexes contain three or fewer ligand types (Fig. 1 and supplementary material, Table S1). Expansion to three ligand types introduces 29 complex energies that need computation for any set of three ligands. HO-only averaging performs poorly here for complexes that are mixtures of CH3CN, CO, and H2O, with a large difference between CA and TA complexes of H2O and CO being treated completely equivalently in this scheme (Fig. 7). Similar differences in TA and CA HOMO levels are also missed in this averaging scheme (Fig. 8). Generally, the CA complex spin-splitting energies are better predicted by the homoleptic averaging than the TA are for both ternary complexes with CH3CN and with NH3 (Fig. 7 and supplementary material, Fig. S3 and Tables S12–S13). For the HOMO level, results are more varied, with the HOMO energies of the CH3CN ternary complexes being underestimated, while those of NH3 ternary complexes are overestimated (Fig. 8 and supplementary material, Fig. S4 and Tables S14–S15).
Thus, we next investigated interpolation schemes that reincorporated CS/TS or FS/MS complex energies of the binary heteroleptics. Interpolation of these 29 energies from 12 CS/TS and 3 HO energies or 6 FS/MS and 3 HO energies would still represent significant computational savings. The binary complex energies can also be reused for ternary complexes that share a pair of ligand types, as is the case in the two worked examples presented here. Next, we obtained expressions for all eight symmetry classes of heteroleptics with three ligand types from energies derived from only either FS/MS or CS/TS complexes. Specifically, the FS/MS-interpolated expressions for these heteroleptics are as follows:
where the third ligand in the EA complex is the one that is trans to itself, so energies involving that ligand are derived from the binary MS complexes, whereas the remaining components are derived from FS complexes. Analogous expressions were also obtained for the CS/TS energetics (supplementary material, Text S3). We again obtained these expressions by identifying the most representative interactions (i.e., cis or trans and fac or mer) in the final ternary complex and matching the stoichiometry of the resulting complex. Overall, both interpolation schemes significantly improve the estimation of both spin splitting and HOMO level energies for both sets of ternary complexes in comparison to HO-only interpolation (Figs. 5 and 6 and supplementary material, Figs. S3 and S4 and Tables S14–S16). Differences in CA and TA complex properties are much better predicted by both interpolation schemes, with the CS/TS scheme performing best for the combination that includes CH3CN (Figs. 5 and 6). While this may be expected as a generalization of observations on the binary heteroleptics, it is noteworthy that trends in the fully equivalent stoichiometries (i.e., all three EA isomers, DCS, or DTS complexes) are reasonably well predicted by the improved schemes, whereas they were indistinguishable with simple linear interpolation because they differ solely by cis vs trans positioning effects (Figs. 5 and 6). Overall, errors for the interpolation scheme using FS/MS ligands are sufficiently low to warrant its use in chemical space exploration (supplementary material, Table S16). We have thus demonstrated how over a set of three ligands, nine explicit calculations can be used to interpolate the properties of 18 binary and 29 ternary complexes to around 2 kcal/mol accuracy in spin-splitting energies or 0.1–0.2 eV accuracy in orbital energy levels.
C. Interpolation of chemical space from experimentally characterized complexes
Given the strength of the interpolative trends we observed, we chose Fe(II) complexes from the CSD to explore the potential of interpolation from previously synthesized complexes. For these 1202 complexes, the majority (661) are homoleptics, followed closely by binary complexes (407) and ternary complexes (129), with only six having more ligand types (Fig. 4 and supplementary material, Table S4). We restrict our analysis of chemical space interpolation to monodentate ligands from complexes of each symmetry type that only contain monodentate ligands to simplify the interpolation process because higher-denticity ligands impose geometric constraints that make them incompatible with certain symmetry types. Nevertheless, we note that there are a significant number of monodentate ligands that only appear in combination with higher-denticity ligands (supplementary material, Tables S4–S5). From the set of monodentate-only complexes, we identified 40 unique ligands already present in homoleptic complexes along with 48 additional unique ligands present in monodentate-only binary and ternary complexes for which we could assign a charge following the scheme introduced in Ref. 60. While the procedure is outlined in detail in Ref. 60, we reiterate that the procedure only assigns ligand charges if there are sufficient copies of that ligand in multiple complexes from which a consistent, single charge can be obtained. In practice, this leads to consistency with the octet rule as well.60 Of the set of 88 ligands, only 56 are assigned a neutral charge and closed-shell electronic structure and deemed sterically feasible for homoleptic calculations [i.e., excluding only the neutral OP(Ph)3 ligand].
We then computed the adiabatic spin-splitting energies and singlet HOMO levels of all 56 homoleptic Fe(II) complexes (see Sec. II). This set is strongly biased toward high-spin structures, with only a few ligands giving rise to low-spin ground states (Fig. 9 and supplementary material, Fig. S5 and Table S17). The majority of spin-splitting energies are in the range of −30 to 0 kcal/mol, whereas a minority of complexes are low-spin (Fig. 9). Only one complex, Fe(II)(CO)6, which was present in our study in Sec. III B, has a HOMO level deeper than −16 eV, and no complexes have intermediate HOMO levels (ca.−14 eV) while also sampling near-degenerate spin states (Fig. 9). While the preference for high spin could be attributed to the strong sensitivity of ΔEH-L to the choice of the functional,86–94 reducing the amount of Hartree–Fock exchange (here, to 10% to exaggerate the effect in comparison to 15% recommended in Ref. 88) will not significantly alter the observation that there are regions of the HOMO level and ΔEH-L values that homoleptic complexes rarely sample (supplementary material, Fig. S6).
We next interpolated estimates for the 816 200 binary and ternary complexes from the 56 homoleptic complexes using the HO-only averaging scheme (Fig. 9). By definition, this interpolation scheme provides rough estimates of which ligand combinations will enrich ΔEH-L/HOMO energetic pairings not observed in the homoleptic set (Fig. 9). Nevertheless, this approach provides only a coarse estimate of the energetics in comparison to interpolation schemes that use knowledge of binary complex energetics. However, even for the most data-efficient approach of FS/MS-derived interpolation, a pool of 56 ligands would require 3080 explicit FS/MS calculations to achieve high-fidelity predictions for 816 200 complexes (see Table I). Thus, we identified a way to use the HO-only interpolation to reduce the number of unique complexes required to achieve high fidelity within a target region.
First, we select a targeted region of ΔEH-L values in the range of −4.0 to 4.0 kcal/mol and HOMO levels in the range of −14 to −13 eV. This region was selected because no homoleptic complexes were present in this range, but interpolation of the space predicted that heteroleptic complexes would be found there. These complexes could be of interest in chemical discovery applications that target spin-crossover (SCO) candidates (i.e., with near-degenerate spin states) with good oxidative stability (i.e., deep HOMO levels). We primarily select these two metrics as an illustrative example for simultaneous screening because few SCOs have deep HOMO levels. Next, we identify 12 common ligands from the parent homoleptic complexes that are predicted to give rise most frequently to binary and ternary complexes in the targeted zone on the basis of the simple HO-only model (Fig. 10 and supplementary material, Table S18 and Fig. S7). These ligands consist of common ligands from our original set (i.e., CO, H2O, CH3CN, and NH3) but also introduce new chemistry (e.g., S-coordinating dimethylthioformamide, DMTF, and 2-chloropyrazine, ClPyz, supplementary material, Table S18). From this set, only 132 additional FS/MS complexes need to be studied beyond the 12 homoleptic complexes we already computed to infer the 6776 additional HOMO level or ΔEH-L properties at higher fidelity than HO-only averaging (see Table I). We carried out geometry optimizations of these 132 complexes and used their properties to evaluate the revised interpolated HOMO level and ΔEH-L values over this subset. Evaluating this subset also provides a validation of the accuracy of the homoleptic averaging over a larger set of ligands. Over this set, we observe that errors are comparable to the earlier tests (i.e., MAE of 4 kcal/mol), and homoleptic averaging is generally a good predictor of ΔEH-L (supplementary material, Fig. S8 and Table S19). For HOMO level predictions, there are more points that differ from the interpolated values, leading to higher MAEs (0.8 eV) than observed over our representative test ligands (supplementary material, Fig. S8 and Table S19).
Nevertheless, the calculations on FS and MS complexes also highlight the limits of homoleptic averaging for seeking targeted properties. Of the 132 FS or MS calculations carried out, three are found to be in our targeted zone, MS Fe(II) (MeCN)3(NH3)3 and both FS and MS Fe(II)(CH3CN)3(MeOH)3 (supplementary material, Table S20). None of these three complexes were predicted to be in the zone from homoleptic averaging alone, with the MS Fe(II)(MeCN)3(NH3)3 ΔEH-L underestimated (predicted: −7.3 kcal/mol vs actual −3.6 kcal/mol), while FS/MS Fe(II)(CH3CN)3(MeOH)3 ΔEH-L was overestimated (predicted: 9.0 kcal/mol vs 2.9 and −0.8 kcal/mol, respectively, supplementary material, Table S20). For these same complexes, the homoleptic averaging predicted HOMO levels very well, within around 0.1 eV (i.e., lower error than for ΔEH-L, supplementary material, Table S20).
Returning to the properties predicted from FS/MS-interpolation, we identified a total of three additional binary complexes predicted to be in the targeted zone from FS/MS-interpolation but not in the targeted zone according to HO-only interpolation and we computed their properties (Fig. 10). These complexes are 5+1 Fe(II)(MeCN)5(CO) and CS/TS Fe(II) (ClPyz)4(CO)2 (Fig. 11 and supplementary material, Table S20). Indeed, all three of these complexes have HOMO levels in the targeted zone, while the ΔEH-L values are close to [4.9–5.7 kcal/mol for CS/TS Fe(II)(ClPyz)4(CO)2] or in the targeted zone [1.5 kcal/mol for 5+1 Fe(II)(MeCN)5(CO), supplementary material, Table S20]. For CS/TS Fe(II)(ClPyz)4(CO)2, homoleptic averaging had underestimated both the ΔEH-L and HOMO level (i.e., out of the zone at −15.32 eV) and could not distinguish between CS and TS isomers (supplementary material, Table S20). Although FS/MS interpolation slightly underestimated the ΔEH-L values for CS/TS Fe(II)(ClPyz)4(CO)2, the errors are within what we had observed over other sets. In addition to shifting which compounds are predicted to fall within the target zone, the difference between the HO-only to FS/MS interpolation schemes causes some regions of property space to be predicted to be enriched, while others are predicted to be depleted (Fig. 10).
As a control, we also identified two complexes predicted to be in the zone by the HO-only interpolation but out of zone by FS/MS-interpolation, CS/TS Fe(II)(MeOH)4(CH3CN)2 (supplementary material, Table S20). The homoleptic averaging predicts the ΔEH-L value for both complexes to be −2.6 kcal/mol, while the calculated ΔEH-L values are significantly lower (−9.1 and −13.9 kcal/mol). The FS/MS interpolation captures well the relative CS/TS energetics in both this case and the in-zone example and only slightly overestimates the ΔEH-L value (supplementary material, Table S20). Overall, FS/MS-interpolation MAEs for these binary complexes are low (2.5 kcal/mol for ΔEH-L and 0.1 eV for HOMO levels) and less than half that observed from HO-only averaging (supplementary material, Table S20).
As a more stringent test of our FS/MS interpolation scheme, we also selected six representative ternary complexes predicted to be in the targeted zone from FS/MS interpolation but out of the targeted zone when estimated with homoleptic interpolation (Fig. 11 and supplementary material, Table S21). All complexes were generally predicted to be more high-spin favoring by homoleptic averaging than predicted from FS/MS interpolation, and several were also predicted to have deeper HOMO levels than the targeted region (supplementary material, Table S21). Over this set, explicit DFT calculations show that five of the complexes have ΔEH-L values in the targeted zone and four have HOMO levels in the targeted zone (supplementary material, Table S21). These include FA Fe(II)(ClPyz)3(CO)2(DMTF), which was predicted to be strongly HS by homoleptic averaging (ΔEH-L = −9.0 kcal/mol) but was much closer to the FS/MS-interpolated value (calculated ΔEH-L = −2.2 kcal/mol vs FS/MS ΔEH-L = −0.1 kcal/mol, supplementary material, Table S21). For the worst-performing example, FA Fe(II)(DMTF)3(CO)2(H2O), FS/MS interpolation overestimates ΔEH-L by 5 kcal/mol (−3.0 kcal/mol vs −8.2 kcal/mol) and predicts a deeper HOMO level (−13.20 eV vs −12.55 eV, supplementary material, Table S21). This could be due to the weak coordination of the metal by water, which leads to a more stabilized high-spin state. Performance of FS/MS interpolation on the remaining CA and TA complexes ranges from good (∼2–3 kcal/mol errors) to exceptional in the case of CA Fe(II)(ClPyz)4(CNH)(NH3) where errors on ΔEH-L are below 0.1 kcal/mol and the HOMO level are around 0.05 eV (supplementary material, Table S21). Overall errors are low from FS/MS interpolation at around 2.5 kcal/mol for ΔEH-L and 0.4 eV for the HOMO level, roughly half their values from homoleptic averaging. These results demonstrate that coarse interpolation of large spaces with homoleptic complexes can be followed up by improved interpolation using selected FS/MS compounds to identify the most promising binary or ternary complexes for explicit calculation. Overall, the demonstrated approach represents a data-efficient strategy to infer properties across large compound spaces with systematically improvable fidelity.
IV. CONCLUSIONS
A large-scale analysis of the mononuclear octahedral complexes deposited in the CSD revealed a propensity toward specific, higher-symmetry classes. In addition, few complexes contained more than three unique ligand types. To assess the relative diversity of these complexes compared to the enumerated chemical space, we obtained expressions for the theoretical number of complexes of the five binary and eight ternary symmetry classes for octahedral complexes. We showed that even for a relatively small number of neutral, monodentate ligands present in Fe(II) complexes, the total theoretical space of 816 200 binary and ternary complexes far exceeded those that had been characterized in the CSD.
An aim of identifying which uncharacterized compounds are most likely to be valuable or informative motivated our evaluation of interpolative schemes to determine the extent to which heteroleptic complex properties could be inferred from parent homoleptic complexes. Over representative test cases, we observed that a linear weighted averaging of homoleptic properties could reasonably (to ∼4 kcal/mol for ΔEH-L and 0.24 eV for the HOMO level) predict properties of binary and ternary heteroleptic complexes. We demonstrated a refinement of the approach to be able to distinguish isomers (e.g., CS vs TS or CA vs TA) by using expressions that also incorporated either CS/TS or FS/MS binary complexes at a slightly higher computational cost but with errors that were half as large (∼2 kcal/mol for ΔEH-L and 0.15 eV for the HOMO level). The most data-efficient approach required four FS/MS and three homoleptic energies (i.e., 7 total) to infer 18 binary and 29 ternary complex properties.
Finally, we demonstrated a two-stage discovery approach to leverage and validate our interpolative schemes. We first used 56 homoleptic Fe(II) complexes composed of neutral, closed-shell monodentate ligands to infer the properties of 816 200 binary or ternary complexes of these ligands using HO-only averaging. We then defined a targeted zone of the HOMO level and ΔEH-L that contained none of the homoleptic complexes. To avoid explicit calculations for all (∼3000) FS/MS complexes needed to achieve high fidelity over the full range of ligands, we then refined our analysis to the top 12 most frequently occurring ligands predicted to be in the targeted zone. From this set, we studied 66 each of FS and MS complexes to refine our interpolation of 6776 complexes. This approach helped us to identify 3 FS/MS complexes in the targeted zone that had not been predicted by homoleptic averaging alone. It also had a higher validation rate for binary and ternary complexes than homoleptic averaging, with all FS/MS-interpolation predicted complexes residing in the targeted zone or just outside it. Overall, errors for ΔEH-L of around 5 kcal/mol with homoleptic averaging and 2 kcal/mol with FS/MS-interpolation are also comparable to prior machine learning (i.e., artificial neural network) model predictions on similar datasets.51 Thus, this approach represents a promising multi-stage strategy for efficient chemical space exploration at low cost: an initial coarse interpolation from homoleptic complexes can be systematically refined by incorporating cis and trans isomer effects over a smaller subspace of ligands. While demonstrated here for magnetic and orbital energy properties, this approach is expected to have similar applicability in predicting other properties where ligands can be expected to behave in an approximately additive manner, such as in redox potentials or catalysis. This observed additivity could also be integrated into machine learning model property predictions or used synergistically to augment datasets for machine learning.
SUPPLEMENTARY MATERIAL
See the supplementary material for the details of mononuclear octahedral complex curation from the CSD; details of the complex symmetry class assignment; statistics of CSD complexes by a number of unique ligands; statistics of CSD complexes by symmetry classes; denticity of ligands in computation-ready CSD complexes by symmetry classes; denticity of ligands in Fe(II) CSD complexes by symmetry classes; number of unique monodentate ligands in Fe(II) complexes; sampled vs theoretical number of complexes with monodentate ligands; homoleptic complex properties, ΔEH-L and the HOMO level; interpolated vs calculated ΔEH-L for transition metal complexes with pairs of NH3/H2O/CO; interpolated and actual ΔEH-L for pairs of H2O/CO and NH3 or CH3CN; interpolated vs calculated HOMO for transition metal complexes with pairs of NH3/H2O/CO; MAEs for ΔEH-L binary complex estimates from interpolation schemes; interpolated and actual HOMO for pairs of H2O/CO and NH3 or CH3CN; MAEs for HOMO binary complex estimates from interpolation schemes; interpolated vs calculated ΔEH-L for transition metal complexes with three ligands: NH3/H2O/CO; interpolated and actual ΔEH-L for ternary H2O/CO/CH3CN complexes; interpolated and actual ΔEH-L for ternary H2O/CO/NH3 complexes; interpolated vs calculated HOMO for transition metal complexes with three ligands: NH3/H2O/CO; interpolated and actual HOMO for ternary H2O/CO/CH3CN complexes; interpolated and actual HOMO for ternary H2O/CO/NH3 complexes; CS/TS-derived interpolation expressions for ternary complexes; MAEs of HOMO and ΔEH-L for ternary complexes; properties of 56 homoleptic Fe(II) complexes; scatter of HOMO and ΔEH-L for 56 homoleptic Fe(II) complexes; 12 most common ligands in the target zone for Fe(II) complexes; homoleptic-only interpolated target zone for Fe(II) complexes; parity plot for predicted vs calculated 132 FS/MS Fe(II) complexes; MAEs for FS and MS complexes from HO-only averaging; binary complexes in the validation set for interpolation; and ternary complexes in the validation set for interpolation.
ACKNOWLEDGMENTS
This work was primarily supported by the Office of Naval Research under Grant No. N00014-20-1-2150. A.N., N.A., and D.B.K.C. were partially supported by the National Science Foundation Graduate Research Fellowships under Grants No. 1122374 (to A.N.) and Grant No. 1745302 (to N.A. and D.B.K.C.). C.D. was partially supported by a seed fellowship from the Molecular Sciences Software Institute under NSF Grant No. OAC-1547580. The authors thank Adam H. Steeves for a critical reading of the manuscript.
The authors declare no competing financial interest.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
N.A., S.G., and M.G.T. contributed equally to this work.
Naveen Arunachalam: Data curation (equal); Formal analysis (equal); Writing – original draft (equal); Writing – review & editing (equal). Stefan Gugler: Conceptualization (supporting); Formal analysis (equal); Methodology (equal). Michael G. Taylor: Data curation (equal); Formal analysis (equal); Writing – original draft (equal); Writing – review & editing (equal). Chenru Duan: Data curation (equal); Formal analysis (equal); Methodology (equal); Validation (equal); Writing – review & editing (equal). Aditya Nandy: Supervision (equal); Writing – review & editing (equal). Jon Paul Janet: Data curation (supporting); Formal analysis (supporting); Methodology (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). Ralf Meyer: Data curation (supporting); Formal analysis (supporting); Methodology (supporting); Writing – review & editing (equal). Jonas Oldenstaedt: Formal analysis (supporting); Methodology (supporting); Writing – review & editing (equal). Daniel B. K. Chu: Formal analysis (supporting); Methodology (supporting); Writing – review & editing (supporting). Heather J. Kulik: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Supervision (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available within the article and its supplementary material.