The success of deep learning has driven the proliferation and refinement of numerous non-convex optimization algorithms. Despite this growing array of options, the field of nanophotonic inverse design continues to rely heavily on quasi-Newton optimizers such as L-BFGS and basic momentum-based methods such as Adam. A systematic survey of these and other algorithms in the nanophotonics context remains lacking. Here, we compare 24 widely used machine learning optimizers on inverse design tasks. We study two prototypical nanophotonics inverse design problems—the mode splitter and wavelength demultiplexer—across various system sizes, using both hand-tuned and meta-learned hyperparameters. We find that Adam derivatives, as well as the Fromage optimizer, consistently outperform L-BFGS and standard gradient descent, regardless of system size. While meta-learning has a negligible-to-negative impact on Adam and Fromage, it significantly improves others, particularly AdaGrad derivatives and simple gradient descent, such that their performance is on par with Adam. In addition, we observe that the most effective optimizers exhibit the lowest correlation between initial and final performance. Our results and codebase (github.com/Ma-Lab-Cal/photonicsOptComp) provide a valuable framework for selecting and benchmarking optimizers in nanophotonic inverse design.
I. INTRODUCTION
Nanophotonic devices, which manipulate light inside of integrated circuits using nanoscale structures, have a large and growing range of applications, from medical imaging1 and data center switches2 to all-optical logic units,3 physical neural networks,4 and photonic quantum computers.5 Traditional methods for designing nanophotonic devices require a human engineer to determine the overall geometry, with only a handful of parameters machine optimized.6,7 In contrast, with inverse design (InvDes), the designer simply specifies desired performance and fabrication constraints and allows a computer algorithm to form the device layout. The design is treated as an optimization problem, with the target functionality and constraints encoded in a scalar cost function and an optimizer used to iteratively update parameters defining the structure based on electromagnetic simulation results.
Driven by a need for components that are smaller, better performing, and more multifunctional, packed into unconventional footprints in the ever-shrinking confines of integrated circuits, nanophotonic InvDes has rapidly matured in the past decade.8–10 With refined methods for parameterization and ensuring compliance with foundry design rules, alongside new graphics processing unit (GPU)-accelerated electromagnetic simulators, modern InvDes algorithms can reliably yield designs with superior performance-per-area than their rationally designed counterparts.11–13
Much less attention, however, has been paid to the optimizers. Historically, quasi-Newton methods have dominated the literature, with the L-BFGS algorithm being the most popular.14–16 These optimizers remain extremely common, although the Adam optimizer and simple gradient descent have also gained increasing traction due to their success in neural network training.8,13,17
There remains little diversity, however, in the choice of optimizer for nanophotonic inverse design, and choosing the best optimizer is largely a matter of intuition. The optimizer is of critical importance, however, as the optimization landscape is highly non-convex.18 Results are very initialization-dependent, as the local optimum found by the algorithm depends upon where in the design space the search begins. A good optimizer should balance exploration of the design space to avoid low-performing local optima and exploitation to quickly isolate high-performing local optima.19,20 This potentially conflicts with the common strategy of accepting or rejecting initializations after a few optimization iterations in order to quickly survey the design space,18 as the best optimizers should be the least dependent upon initialization. It also raises the question of optimal system size. Just as there is an optimal network complexity in many deep learning problems,21,22 there may be an optimal system size in InvDes problems, balancing higher achievable performance with the difficulty in finding good designs.
A broad survey of both classic and newly developed optimizers for InvDes, similar to those performed for other machine learning tasks,19,20 would be of great use in developing a more systematic approach to optimizer selection. Similarly, a study of design performance as a function of system size and of the correlation between design performance at different stages of optimization across different optimizers is needed to find better methods of initialization selection.
Here, we apply 24 different gradient-based and quasi-Newton optimizers to InvDes tasks. We take two benchmark design problems—the mode splitter and wavelength multiplexer—and compare their relative performance. We find that, without meta-learning, most Adam optimizers perform the best across all system sizes, with Fromage also performing well and L-BFGS lagging. Meta-learning, optimizing optimizer hyperparameters in tandem with the design parameters, benefits most other optimizers and leaves many on par with Adam but has a negligible-to-negative impact on Adam’s performance. Larger design regions yield better results for the wavelength multiplexer, but intermediate system sizes are ideal for the simpler mode splitter. We find a weak correlation between design performance very early in the optimization and final performance, with the best optimizers having the weakest correlation after long optimizations.
We now begin with a brief description of the InvDes workflow and introduce the different optimizers, followed by the design tasks and survey results.
II. METHODS
Solving Eq. (1) requires three components: a parameterization strategy, which determines how ϕ is mapped to the structure and what terms are included in L throughout the optimization; a simulator, which solves f(ϕ, e) = 0 given ϕ; and an optimizer, which iteratively updates ϕ to minimize L. We detail our choices for each of these components below.
A. Parameterization strategy
We use the three-field method44 for our parameterization strategy. We break the optimization into four phases (Fig. 1). First, we use a continuum phase, in which we map ϕ to a continuum of permittivities between that of our cladding material (silicon dioxide) and that of our design material (silicon). This allows the algorithm to easily tunnel between designs that might be inaccessible in a binarized device.
Next, we use a discretization phase, in which we gradually force ϕ to map to either silicon dioxide or silicon, rather than allowing intermediate permittivities. At the end of this phase, the device is a binary silicon dioxide–silicon structure. This is followed by a settling phase, in which the binary design is further optimized.
Until now, we have neglected the fabrication penalty term Lf. A large number of schemes have been developed for both enforcing fabrication constraints during the optimization process45–49 and during post-processing.50 We adopt the popular three-field scheme.44 During the final DRC (design rules check) phase, we add Lf to L and let the algorithm alter the device to improve fabricability, sacrificing performance as necessary. For more details on the implementation and the explicit form of Lf, see Zhou et al.44 and our codebase.51
B. Simulator
We use the finite difference time domain (FDTD)52 method for our simulations via the fdtdz53 and pjz54 packages. In all cases, a 40 nm cubic lattice was used, with each time step lasting 0.08 fs. Simulations were run long enough to reach a steady state before extracting electric fields. For a 4 μm2 or smaller design region, simulation lasts for 200 fs (38 periods of 1550 nm light); for a 16 μm2 or smaller region, simulation lasts for 240 fs (46 periods); and for larger structures (up to 36 μm2), simulation lasts for 320 fs (61 periods). Due to the scale invariance of Maxwell’s equations, the results would be the same, provided both wavelength and cell size are scaled by the same factor.
The design region was 240 nm thick silicon, embedded in 2 μm of silicon dioxide cladding in all directions [see Fig. 2(a)]. In the and directions, within the plane of the design region, adiabatic absorption boundary conditions were used, extending another 2 μm beyond the cladding. In the direction, perfectly matched layers (PMLs) were used, embedded in the last six layers of cladding. Single-mode input and output waveguides were 400 nm wide, and dual-mode waveguides were 720 nm wide. The target minimum feature size for both silicon and silicon dioxide structures was 90 nm during the DRC phase.
Mode splitter results. (a) Design target: a dual-mode input is split between two outputs, with the TE00 component diverted to the lower output and the TE10 mode converted to TE00 and sent to the upper output. Perfect performance corresponds to a cost function of Lp = −2. (b) Cost function, averaged over all initializations, as a function of wall-clock time for select optimizers acting on a 100 × 100 design space without (left) and with (right) meta-learning. (c) Total cost function at the end of optimization for every optimizer and initialization without (upper) and with (lower) meta-learning. For each optimizer, the three sub-columns are 50 × 50, 100 × 100, and 150 × 150 design regions, respectively.
Mode splitter results. (a) Design target: a dual-mode input is split between two outputs, with the TE00 component diverted to the lower output and the TE10 mode converted to TE00 and sent to the upper output. Perfect performance corresponds to a cost function of Lp = −2. (b) Cost function, averaged over all initializations, as a function of wall-clock time for select optimizers acting on a 100 × 100 design space without (left) and with (right) meta-learning. (c) Total cost function at the end of optimization for every optimizer and initialization without (upper) and with (lower) meta-learning. For each optimizer, the three sub-columns are 50 × 50, 100 × 100, and 150 × 150 design regions, respectively.
C. Optimizer
Optimizers seek out the minimum of L by iteratively updating ϕ based on the simulation results. We focus on gradient descent (GD) and quasi-Newton (QN) optimizers, both first-order methods that require an explicit first derivative dL/dϕ to determine descent direction and do not require an explicit second derivative (as would be needed by second-order methods such as the Newton optimizer). Such first-order gradients are efficiently obtained with the adjoint method,16 a special case of the backpropagation method used to train neural networks. GD methods use either a fixed learning rate or a heuristic based on past gradients and loss evaluations to dynamically choose the learning rate. QN methods approximate the second derivative using the first derivative to make an educated guess of the optimal learning rate, then evaluate the cost function at multiple points along the descent direction to confirm or refine this choice in a process called “line search.”
For InvDes tasks, which typically have many fewer parameters than modern deep learning models, QN methods (particularly L-BFGS43) are believed to perform well due to their consistency and robustness against numerical artifacts, while momentum-assisted GD (particularly Adam23) are rapidly growing in popularity thanks to their speed, tunability, and ability to evade shallow minima.55 The deep learning community has, however, developed a wide range of GD algorithms, many derivatives of Adam or other “classic” optimizers such as AdaGrad,31 but many very distinct, such as Lion42 or LARS.36 These have each been shown to outperform their peers on specific deep learning tasks (see References in Table I), but have never been explored for nanophotonic inverse design.
Optimizer parameters. “LR” refers to the learning rate.
Optimizer . | Parameters . | Description . | Ref. . | |
---|---|---|---|---|
Adam family | Adam | LR = 1, b1 = 0.9, b2 = 0.999 | Uses current grad and moving average of first and second moments of previous grads to modify the learning rate of each parameter individually | 23 |
AdaBelief | LR = 0.1, b1 = 0.9, b2 = 0.999 | Takes smaller steps when momentum and grad disagree | 24 | |
Adamax | LR = 1, b1 = 0.9, b2 = 0.999 | Second moment replaced with max of second moment and current grad | 23 | |
AdamaxW | LR = 1, b1 = 0.9, b2 = 0.999 | Combination of Adamax and AdamW | 25 | |
AdamW | LR = 0.1, b1 = 0.9, b2 = 0.999 | Shifts weight decay regularization term out of moving average | 26 | |
AMSGrad | LR = 1, b1 = 0.9, b2 = 0.999 | Running average replaced by running maximum | 27 | |
NAdam | LR = 0.1, b1 = 0.9, b2 = 0.999 | Uses Nesterov momentum, advancing parameters along momentum direction a bit before computing grad | 26 | |
NAdamW | LR = 1, b1 = 0.9, b2 = 0.999 | Combination of NAdam and AdamW | 26 | |
Novograd | LR = 10, b1 = 0.9, b2 = 0.25 | Second moment replaced with norm of second moment | 28 | |
RAdam | LR = 10, b1 = 0.9, b2 = 0.999 | Rectifies the momentum to reduce variance in early phases | 29 | |
Yogi | LR = 10, b1 = 0.9, b2 = 0.999 | Reduces the learning rate when grads vary considerably between iterations | 30 | |
AdaGrad family | AdaGrad | LR = 100 | Modifies the learning rate for each parameter by the square root of the element-wise sum of the outer product of all previous grads | 31 |
AdaDelta | LR = 100, ρ = 0.9 | Uses previous grads to approximate a second derivative | 32 | |
AdaFactor | LR = 0.1, decayRate = 0.8 | Stores only row/column sums of past grads to reduce memory use | 33 | |
RMSProp | LR = 0.1, decay = 0.9 | Replaces the sum of past grads with an exponential moving average | 34 | |
SM3 | LR = 10, decay = 0.9 | Stores only the extrema of the past grads to reduce memory use | 35 | |
LARS family | LARS | LR = 100, weightDecay = 0.1, ɛ = 10−9 | Uses a uniform learning rate, scaled by the L2 norm of a running weight divided by the L2 norm of the current grad | 36 |
Fromage | LR = 0.1 | Rescales the LR to prevent it from increasing without bound | 37 | |
Lamb | LR = 0.1, b1 = 0.9, b2 = 0.999, weightDecay = 0.1 | Incorporates the second moment of Adam into the LARS algorithm | 38 | |
GD family | GD | LR = 100 | Each update is simply a user-chosen learning rate times the grad | 39 |
OptimGD | LR = 100, α = 1, β = 1 | Uses current/previous grad to predict next grad and adjusts LR accordingly | 40 | |
NoisyGD | LR = 100, η = 10−4, γ = 2 | Injects artificial Gaussian noise into grad | 41 | |
Lion family | Lion | LR = 0.1, b1 = 0.9, b2 = 0.99, weightDecay = 0.001 | Itself created by a ML algorithm, uses the element-wise sign of the first moment to adapt the learning rate of each parameter | 42 |
Quasi-Newton | L-BFGS | N/A | Approximates the second derivative using all previous grads and performs a line search to optimize the learning rate each iteration | 43 |
Optimizer . | Parameters . | Description . | Ref. . | |
---|---|---|---|---|
Adam family | Adam | LR = 1, b1 = 0.9, b2 = 0.999 | Uses current grad and moving average of first and second moments of previous grads to modify the learning rate of each parameter individually | 23 |
AdaBelief | LR = 0.1, b1 = 0.9, b2 = 0.999 | Takes smaller steps when momentum and grad disagree | 24 | |
Adamax | LR = 1, b1 = 0.9, b2 = 0.999 | Second moment replaced with max of second moment and current grad | 23 | |
AdamaxW | LR = 1, b1 = 0.9, b2 = 0.999 | Combination of Adamax and AdamW | 25 | |
AdamW | LR = 0.1, b1 = 0.9, b2 = 0.999 | Shifts weight decay regularization term out of moving average | 26 | |
AMSGrad | LR = 1, b1 = 0.9, b2 = 0.999 | Running average replaced by running maximum | 27 | |
NAdam | LR = 0.1, b1 = 0.9, b2 = 0.999 | Uses Nesterov momentum, advancing parameters along momentum direction a bit before computing grad | 26 | |
NAdamW | LR = 1, b1 = 0.9, b2 = 0.999 | Combination of NAdam and AdamW | 26 | |
Novograd | LR = 10, b1 = 0.9, b2 = 0.25 | Second moment replaced with norm of second moment | 28 | |
RAdam | LR = 10, b1 = 0.9, b2 = 0.999 | Rectifies the momentum to reduce variance in early phases | 29 | |
Yogi | LR = 10, b1 = 0.9, b2 = 0.999 | Reduces the learning rate when grads vary considerably between iterations | 30 | |
AdaGrad family | AdaGrad | LR = 100 | Modifies the learning rate for each parameter by the square root of the element-wise sum of the outer product of all previous grads | 31 |
AdaDelta | LR = 100, ρ = 0.9 | Uses previous grads to approximate a second derivative | 32 | |
AdaFactor | LR = 0.1, decayRate = 0.8 | Stores only row/column sums of past grads to reduce memory use | 33 | |
RMSProp | LR = 0.1, decay = 0.9 | Replaces the sum of past grads with an exponential moving average | 34 | |
SM3 | LR = 10, decay = 0.9 | Stores only the extrema of the past grads to reduce memory use | 35 | |
LARS family | LARS | LR = 100, weightDecay = 0.1, ɛ = 10−9 | Uses a uniform learning rate, scaled by the L2 norm of a running weight divided by the L2 norm of the current grad | 36 |
Fromage | LR = 0.1 | Rescales the LR to prevent it from increasing without bound | 37 | |
Lamb | LR = 0.1, b1 = 0.9, b2 = 0.999, weightDecay = 0.1 | Incorporates the second moment of Adam into the LARS algorithm | 38 | |
GD family | GD | LR = 100 | Each update is simply a user-chosen learning rate times the grad | 39 |
OptimGD | LR = 100, α = 1, β = 1 | Uses current/previous grad to predict next grad and adjusts LR accordingly | 40 | |
NoisyGD | LR = 100, η = 10−4, γ = 2 | Injects artificial Gaussian noise into grad | 41 | |
Lion family | Lion | LR = 0.1, b1 = 0.9, b2 = 0.99, weightDecay = 0.001 | Itself created by a ML algorithm, uses the element-wise sign of the first moment to adapt the learning rate of each parameter | 42 |
Quasi-Newton | L-BFGS | N/A | Approximates the second derivative using all previous grads and performs a line search to optimize the learning rate each iteration | 43 |
We classify our library of optimizers into six families, each consisting of variations of the main algorithm. Each algorithm, along with a brief description and our choice of hyperparameters, can be found in Table I. The learning rates were chosen based on an order-of-magnitude sweep, testing learning rates from 0.001 to 100 for each optimizer on a wavelength multiplexer optimization (see Fig. S1 of the supplementary material).
D. Survey procedure
In order to survey the performance of our optimizers across system sizes with and without meta-learning, we use two InvDes problems: an “easy” design task in the two-mode splitter [Fig. 2(a)] and a “hard” design task in the triple wavelength demultiplexer [Fig. 5(a)]. For each problem, we randomly generate seven initializations with spatially correlated Gaussian noise of varying amplitude for each of the three design region sizes (see Figs. 3 and 6 for examples of initializations). We apply each optimizer to the same seven initializations at each system size and determine the onset and duration of each optimization phase by elapsed wall-clock time.
When meta-learning was performed, the meta-optimizer was unmodified GD with a learning rate of 1. Only the learning rate hyperparameter was meta-learned, allowed to vary exponentially by up to an order of magnitude in either direction, with other hyperparameters held constant. Four iterations of structure optimization were performed for every one meta-optimization iteration. Meta-learning is not applied to L-BFGS, as the learning rate is set by the line search.
When meta-learning is not performed, cosine-annealing with warmup56,57 is used for each of the four optimization phases independently, with both the initial and final learning rates one tenth the maximum learning rate shown in Table I. Warm-up lasts for 10% of the duration of each phase.
Both optimizer and meta-optimizer learning rates are relatively large,58 as this was observed to significantly accelerate convergence without sacrificing final performance, and meta-learning consistently favored large learning rates throughout most of the optimization process (Fig. S1 of the supplementary material). Larger ideal learning rates compared with those typical of deep learning problems may be explained by the fact that inverse design uses full gradient information every iteration, rather than batches of training data, and larger strides, therefore, result in less jitter than they would in a deep learning problem.
E. Hardware
All mode splitter results were obtained using an Nvidia GeForce RTX 4070, requiring on average 0.76/1.1/1.5 s per iteration for a 50 × 50/100 × 100/150 × 150 cell design region, respectively. All wavelength demultiplexer results were obtained using an Nvidia GeForce RTX 4080 Super, requiring on average 1.8/2.4/2.6 s per iteration for a 90 × 90/115 × 115/140 × 140 cell design region, respectively.
III. RESULTS: MODE SPLITTER
A. Design task
B. Optimizer performance
The training curves for select optimizers and the final performance of all optimizers for the three design region sizes can be seen in Fig. 2, and images of the final designs for a 100 × 100 design region can be found in Fig. 3.
Mode splitter structures at the end of optimization for each initialization and select optimizers with and without meta-learning. Colors indicate the performance. A system size of 100 × 100 is shown. v refers to the variance of the Gaussian noise used to generate the initialization.
Mode splitter structures at the end of optimization for each initialization and select optimizers with and without meta-learning. Colors indicate the performance. A system size of 100 × 100 is shown. v refers to the variance of the Gaussian noise used to generate the initialization.
The Adam23 optimizer and its variants generally produce the best results. Meta-learning has a negative impact on consistency, but the best single devices (“hero” devices) were generally found with meta-learning. The best-performing device was yielded by the meta-learned Yogi30 optimizer, while Novograd28 had the best average and median performance overall. The differences between the best meta-learned design and the best non-meta-learned design are small, however. All Adam variants appear robust and high-performing and are excellent candidates for InvDes optimization.
We attribute the general failure of meta-learning to improve Adam to conflict with the momentum-based gradient updates. Optimizing hyperparameters in this way is not always better than fixed parameters or schedules such as cosine annealing,59 being highly sensitive to the parameters of the meta-optimizer60 and vulnerable to overfitting,61 with the optimal set of hyperparameters in one batch of four inner loops being very different from the optimal set in the next batch.
AdaGrad31 optimizers have generally poorer performance than the Adam family. They are more prone to noise during the descent, likely due to their longer memory of the gradient compared to Adam, which negatively impacts convergence during the continuum phase. Both their average performance and hero devices are inferior to Adam’s without meta-learning. Meta-learning has a very positive impact on their performance, such that AdaGrad, RMSProp,34 and AdaFactor33 have performance similar to Adam but fail to find either the hero devices of meta-learned Yogi or reach the consistency of cosine-annealed Novograd. We theorize that the longer gradient memory might yield more generalizable hyperparameters at the completion of each outer loop compared to Adam.
The LARS family of optimizers has mixed results. LARS36 and Lamb38 have highly inconsistent performance and fail to fall into the best minima of other optimizers found. LARS, in particular, yielded designs with extremely poor performance, most of which are off the scale of Fig. 2. Cosine-annealed Fromage,37 however, is one of the best InvDes optimizers. While its hero devices are inferior to those of Yogi, Fromage’s results are more concentrated on the best-performing devices. LARS and Lamb are nominally designed for very large networks with extremely non-convex landscapes and allow update rates that can differ by orders of magnitude between parameters. However, they are known to cause divergence of the design parameters,37 with the magnitude of the parameters rapidly growing throughout the optimization. This was observed in our data, with the parameters increasing by many orders of magnitude. Fromage rectifies this, enjoying the advantages of LARS in extremely non-convex spaces without the often fatal divergence flaw.
Lion42 lacks both the consistency and hero devices of Adam and Fromage, even with the benefits it derives from meta-learning.
The performance of static learning rate GD is similar to that of Lion, although it is better able to find hero devices when meta-learning is used to dynamically alter its learning rate. Artificial noise injection (NoisyGD41) appears to have a negligible-to-negative impact on GD performance.
L-BFGS43 has highly inconsistent performance. The additional function evaluations demanded by the line search reduce the total iterations that can be performed in a fixed wall-clock time, and the algorithm struggles to right itself during and after discretization and DRC. The mode splitter is a small and relatively simple problem that can be solved in a few iterations, a best-case scenario for L-BFGS, and the optimizer is able to find a core of high-performing devices, although with a number of very poor outliers. As will be seen with the wavelength multiplexer, however, the cost of line search becomes clear for harder problems with longer optimizations.
Because each pixel is 40 nm, in theory even pixel-level features are fabricable with e.g., GlobalFoundries’s SOI processes.62 However, to generate more robust designs, DRC successfully enforces at least an 80 nm minimum feature size across most designs. The number of designs violating DRC could be reduced by increasing the weight on the Lf term of the cost function, at the cost of performance. The distribution of minimum feature sizes computed using the imageruler package50,63 across all optimizers and sizes can be found in Fig. S3 of the supplementary material.
C. Initialization and size dependence
As illustrated by the final structures in Fig. 3, even with meta-learning and heuristics like momentum to more thoroughly explore the design space, InvDes remains highly initialization dependent. This is indicative of a highly nonconvex design space, and such sensitivity is not limited to gradient-based photonics InvDes, with topology optimization problems in mechanics64 and acoustics,65 as well as neural network training,66,67 exhibiting similar sensitivity. Within the limited statistics of our study, uniform and low-amplitude noise patterns appear to lead to higher performance, while high-amplitude noise yielded worse designs. Yet, even from the same initialization, there is wide diversity in the final designs. This highlights again the abundance of local minima of varying quality in our optimization landscape and, in turn, the importance of choosing and tuning the optimizer.
The performance of the initial structure generally correlates reasonably well with the performance of the final design in this simple problem. Because optimization is short, none of the algorithms have time to travel far from the initialization in the optimization landscape, making initial performance a generally acceptable, if rather inconsistent, predictor of final performance across the board [see Fig. S2(a) of the supplementary material].
The performance of the design after just two iterations is more predictive of final performance, although this is by no means consistent either, while the performance at the end of the continuum phase is generally highly correlated with final performance. This suggests that at least a few iterations should be performed on an initialization before accepting or rejecting it when pre-sampling initial conditions and supports the use of a continuum phase to find better discretized devices.
Both the best average performance and the best hero devices were found in the 100 × 100 design space [Fig. 4(b)], indicating a possible trade-off between a sparsity of deep minima in small systems and an overabundance of shallow minima in large systems, with moderately sized systems proving the easiest InvDes targets.
Analysis of mode splitter data. (a) Pearson correlation coefficients ρ between the final L (state F) and L of the initialization [ρ(0, F)], second iteration [ρ(2, F)], end of the continuum phase [ρ(C, F)], and end of the discretization phase [ρ(D, F)]. med(F) is the median final L, and the −M suffix on an optimizer indicates meta-learning. Data from all system sizes and initializations have been aggregated. (b) Final L as a function of system size. Points are at average over all initializations, and error bars indicate the min/max L.
Analysis of mode splitter data. (a) Pearson correlation coefficients ρ between the final L (state F) and L of the initialization [ρ(0, F)], second iteration [ρ(2, F)], end of the continuum phase [ρ(C, F)], and end of the discretization phase [ρ(D, F)]. med(F) is the median final L, and the −M suffix on an optimizer indicates meta-learning. Data from all system sizes and initializations have been aggregated. (b) Final L as a function of system size. Points are at average over all initializations, and error bars indicate the min/max L.
IV. RESULTS: WAVELENGTH DEMULTIPLEXER
A. Design task
Wavelength demultiplexer results. (a) Design target: a polychromatic input is split between three outputs, with 1500, 1550, and 1600 nm light sent to different channels. Perfect performance corresponds to a cost function of Lp = −3. (b) Cost function, averaged over all initializations, as a function of wall-clock time for select optimizers acting on a 115 × 115 design space without (left) and with (right) meta-learning. (c) Total cost function at the end of optimization for every optimizer and initialization without (upper) and with (lower) meta-learning. For each optimizer, the three sub-columns are 90 × 90, 115 × 115, and 140 × 140 design regions, respectively.
Wavelength demultiplexer results. (a) Design target: a polychromatic input is split between three outputs, with 1500, 1550, and 1600 nm light sent to different channels. Perfect performance corresponds to a cost function of Lp = −3. (b) Cost function, averaged over all initializations, as a function of wall-clock time for select optimizers acting on a 115 × 115 design space without (left) and with (right) meta-learning. (c) Total cost function at the end of optimization for every optimizer and initialization without (upper) and with (lower) meta-learning. For each optimizer, the three sub-columns are 90 × 90, 115 × 115, and 140 × 140 design regions, respectively.
B. Optimizer performance
The training curves and the final performance of select optimizers for the three design region sizes can be seen in Fig. 5. The results follow the general trends of the mode splitter but with distinct features. All Adam optimizers perform very well, with minimal differences in performance among them. Meta-learned RAdam29 yielded the best hero device, Yogi had the best median performance, and meta-learned NAdam26 had the best average performance overall. For most Adam optimizers, meta-learning has a negligible-to-negative effect, although it was again necessary for the best hero devices.
AdaGrad and its family, with the exception of SM3,35 were more consistent than in the mode splitter case. We attribute this to the longer optimization time, as shown in Figs. 5(b) and 5(c), AdaGrad follows a more noisy path during the early stages than most other optimizers, and the longer optimization time appears necessary for these fluctuations to damp out. AdaGrad optimizers are again generally improved by meta-learning.
As in the splitter case, cosine-annealed Fromage performs as well as Adam optimizers. Simple GD likewise performs as well as Adam, so long as meta-learning is used. L-BFGS once again under-performs first-order methods, even for the smallest device size, with the handicap of fewer iterations in the same wall-clock time, making much more impact during these longer optimizations.
The distribution of minimum feature sizes computed using the imageruler package50,63 across all optimizers and sizes can be found in Fig. S4 of the supplementary material.
C. Initialization and size dependence
As illustrated by the final structures in Fig. 6, the wavelength demultiplexer is also highly initialization dependent. Reduced noise amplitude in the initialization is again correlated with higher performance.
Wavelength demultiplexer structures at the end of optimization for each initialization and select optimizer, with and without meta-learning. Colors indicate the performance. A system size of 115 × 115 is shown. v refers to the variance of the Gaussian noise used to generate the initialization.
Wavelength demultiplexer structures at the end of optimization for each initialization and select optimizer, with and without meta-learning. Colors indicate the performance. A system size of 115 × 115 is shown. v refers to the variance of the Gaussian noise used to generate the initialization.
Correlation between final performance and performance at different stages of the design process again supports the use of the continuum phase, as good continuum designs are highly correlated with good discretized designs. However, compared to the mode splitter, performance at initialization and after a small number of iterations are much poorer predictors of final performance [Fig. 6(a)]. We additionally observe that optimizers with better median final performance tend to have a lower correlation between initial and final performance [see also Fig. S2(b) of the supplementary material]. We believe this indicates they are better explorers, traveling far from the initialization and rejecting poor minima rather than immediately isolating the nearest local extremum. In the shorter mode splitter optimization, this deviation in behavior was not as clear, as there was not enough time for optimizers to explore the design space, and the simpler problem meant there were many high-performing minima closer to a given initialization.
The average performance as a function of system size also differs from the mode splitter, with the best results derived from the largest, 140 × 140 cell design region [Fig. 7(b)]. This is consistent with the prediction that a more difficult design task will have a larger optimal design region size.
Analysis of wavelength demultiplexer data. (a) Pearson correlation coefficients ρ between the final L (state F) and L of the initialization [ρ(0, F)], second iteration [ρ(2, F)], end of the continuum phase [ρ(C, F)], and end of the discretization phase [ρ(D, F)]. med(F) is the median final L, and the −M suffix on an optimizer indicates meta-learning. Data from all system sizes and initializations have been aggregated. (b) Final L as a function of system size. Points are at average over all initializations, and error bars indicate the min/max L.
Analysis of wavelength demultiplexer data. (a) Pearson correlation coefficients ρ between the final L (state F) and L of the initialization [ρ(0, F)], second iteration [ρ(2, F)], end of the continuum phase [ρ(C, F)], and end of the discretization phase [ρ(D, F)]. med(F) is the median final L, and the −M suffix on an optimizer indicates meta-learning. Data from all system sizes and initializations have been aggregated. (b) Final L as a function of system size. Points are at average over all initializations, and error bars indicate the min/max L.
V. CONCLUSION AND DISCUSSION
We study the performance of a large number of the most common machine learning optimizers when applied to nanophotonic inverse design tasks, across different initializations and as a function of system size. We find the Adam family of optimizers to have the best performance overall. Meta-learning seems to interfere with momentum in many cases, resulting in more poor-performing outliers, but also the highest-performing hero devices. Meta-learned Yogi produced the best hero devices in the mode splitter problem, while meta-learned RAdam produced the best wavelength demultiplexer design. AdaGrad optimizers generally underperformed Adam optimizers, although with meta-learning they had roughly equivalent performance. Fromage, with cosine-annealing, was the only high-performing LARS optimizer, outperforming Adam in some cases. L-BFGS generally underperformed first-order methods, likely due to the increased computational burden associated with line search.
We show that larger system sizes do not always yield better results; there appears to exist an optimal system size for a given design target, related to its complexity, after which increasing system size makes it more difficult for currently available optimizers to locate high-performing minima. Future optimizers or samplers68 might mitigate this issue.
We find that the performance at initialization or after a few iterations is a poor indicator of final performance as problem complexity increases, particularly for the best optimizers. Nonetheless, performance at the end of a sufficiently long continuum optimization is strongly correlated with final performance, allowing poor-performing continuum designs to be safely rejected.
We, therefore, suggest Adam optimizers be the preferred first choice for nanophotonic InvDes, with Fromage and, in some cases, meta-learned AdaGrad derivatives offering strong alternatives. In contrast, quasi-Newton methods such as L-BFGS, although widely used, may be less effective except for the smallest design regions.
Finally, we note that there remains a vast array of distinct InvDes tasks beyond the two we examined, along with a wide phase space of hyperparameters for both optimizers and meta-optimizers. Most InvDes tasks utilize qualitatively similar cost functions, ultimately dependent on field magnitude at particular input and output points, and are, therefore, expected to have similarly nonconvex design spaces. The similarities in the results obtained for the two tasks presented here suggest that other InvDes tasks, both in nanophotonics and other fields, are likely to exhibit similar behavior.
Nevertheless, future research should expand to include a broader range of tasks, parameter settings, and system sizes to fully understand the strengths and limitations of various existing machine-learning optimization strategies. This could also pave the way for the development of new optimizers specifically tailored to nanophotonic InvDes.
SUPPLEMENTARY MATERIAL
The supplementary material contains the following figures: Fig. S1 depicts the performance vs learning rate for each optimizer, Fig. S2 depicts the correlation between the quality of the initial guess and the quality of the final design as a function of the quality of the final design, and Figs. S3 and S4 depict the relative ability of different optimizers to perform DRC by providing histograms of the number of final designs possessing a given minimum length scale.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Nathaniel Morrison: Conceptualization (equal); Formal analysis (equal); Software (equal); Visualization (equal); Writing – original draft (equal). Eric Y. Ma: Conceptualization (equal); Formal analysis (equal); Project administration (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The code and supporting data for this article are publicly available through the Ma Lab GitHub organization.51