The success of deep learning has driven the proliferation and refinement of numerous non-convex optimization algorithms. Despite this growing array of options, the field of nanophotonic inverse design continues to rely heavily on quasi-Newton optimizers such as L-BFGS and basic momentum-based methods such as Adam. A systematic survey of these and other algorithms in the nanophotonics context remains lacking. Here, we compare 24 widely used machine learning optimizers on inverse design tasks. We study two prototypical nanophotonics inverse design problems—the mode splitter and wavelength demultiplexer—across various system sizes, using both hand-tuned and meta-learned hyperparameters. We find that Adam derivatives, as well as the Fromage optimizer, consistently outperform L-BFGS and standard gradient descent, regardless of system size. While meta-learning has a negligible-to-negative impact on Adam and Fromage, it significantly improves others, particularly AdaGrad derivatives and simple gradient descent, such that their performance is on par with Adam. In addition, we observe that the most effective optimizers exhibit the lowest correlation between initial and final performance. Our results and codebase (github.com/Ma-Lab-Cal/photonicsOptComp) provide a valuable framework for selecting and benchmarking optimizers in nanophotonic inverse design.

Nanophotonic devices, which manipulate light inside of integrated circuits using nanoscale structures, have a large and growing range of applications, from medical imaging1 and data center switches2 to all-optical logic units,3 physical neural networks,4 and photonic quantum computers.5 Traditional methods for designing nanophotonic devices require a human engineer to determine the overall geometry, with only a handful of parameters machine optimized.6,7 In contrast, with inverse design (InvDes), the designer simply specifies desired performance and fabrication constraints and allows a computer algorithm to form the device layout. The design is treated as an optimization problem, with the target functionality and constraints encoded in a scalar cost function and an optimizer used to iteratively update parameters defining the structure based on electromagnetic simulation results.

Driven by a need for components that are smaller, better performing, and more multifunctional, packed into unconventional footprints in the ever-shrinking confines of integrated circuits, nanophotonic InvDes has rapidly matured in the past decade.8–10 With refined methods for parameterization and ensuring compliance with foundry design rules, alongside new graphics processing unit (GPU)-accelerated electromagnetic simulators, modern InvDes algorithms can reliably yield designs with superior performance-per-area than their rationally designed counterparts.11–13 

Much less attention, however, has been paid to the optimizers. Historically, quasi-Newton methods have dominated the literature, with the L-BFGS algorithm being the most popular.14–16 These optimizers remain extremely common, although the Adam optimizer and simple gradient descent have also gained increasing traction due to their success in neural network training.8,13,17

There remains little diversity, however, in the choice of optimizer for nanophotonic inverse design, and choosing the best optimizer is largely a matter of intuition. The optimizer is of critical importance, however, as the optimization landscape is highly non-convex.18 Results are very initialization-dependent, as the local optimum found by the algorithm depends upon where in the design space the search begins. A good optimizer should balance exploration of the design space to avoid low-performing local optima and exploitation to quickly isolate high-performing local optima.19,20 This potentially conflicts with the common strategy of accepting or rejecting initializations after a few optimization iterations in order to quickly survey the design space,18 as the best optimizers should be the least dependent upon initialization. It also raises the question of optimal system size. Just as there is an optimal network complexity in many deep learning problems,21,22 there may be an optimal system size in InvDes problems, balancing higher achievable performance with the difficulty in finding good designs.

A broad survey of both classic and newly developed optimizers for InvDes, similar to those performed for other machine learning tasks,19,20 would be of great use in developing a more systematic approach to optimizer selection. Similarly, a study of design performance as a function of system size and of the correlation between design performance at different stages of optimization across different optimizers is needed to find better methods of initialization selection.

Here, we apply 24 different gradient-based and quasi-Newton optimizers to InvDes tasks. We take two benchmark design problems—the mode splitter and wavelength multiplexer—and compare their relative performance. We find that, without meta-learning, most Adam optimizers perform the best across all system sizes, with Fromage also performing well and L-BFGS lagging. Meta-learning, optimizing optimizer hyperparameters in tandem with the design parameters, benefits most other optimizers and leaves many on par with Adam but has a negligible-to-negative impact on Adam’s performance. Larger design regions yield better results for the wavelength multiplexer, but intermediate system sizes are ideal for the simpler mode splitter. We find a weak correlation between design performance very early in the optimization and final performance, with the best optimizers having the weakest correlation after long optimizations.

We now begin with a brief description of the InvDes workflow and introduce the different optimizers, followed by the design tasks and survey results.

Similar to neural network training, nanophotonic InvDes can be reduced to the minimization of a scalar-valued function,
(1)
(2)
where L = Lp + Lf is the cost function encoding both the desired performance (Lp) and a penalty for fabrication design rule violation (Lf), ϕ is the set of design parameters that map to the physical structure of the device (a permittivity distribution for nanophotonics), e is the set of physical fields generated by the device (typically electric fields for nanophotonics), and f(ϕ, e) = 0 is the governing equation that maps structure ϕ to fields e (Maxwell’s equations for nanophotonics).

Solving Eq. (1) requires three components: a parameterization strategy, which determines how ϕ is mapped to the structure and what terms are included in L throughout the optimization; a simulator, which solves f(ϕ, e) = 0 given ϕ; and an optimizer, which iteratively updates ϕ to minimize L. We detail our choices for each of these components below.

We use the three-field method44 for our parameterization strategy. We break the optimization into four phases (Fig. 1). First, we use a continuum phase, in which we map ϕ to a continuum of permittivities between that of our cladding material (silicon dioxide) and that of our design material (silicon). This allows the algorithm to easily tunnel between designs that might be inaccessible in a binarized device.

FIG. 1.

Typical nanophotonics inverse design workflow.

FIG. 1.

Typical nanophotonics inverse design workflow.

Close modal

Next, we use a discretization phase, in which we gradually force ϕ to map to either silicon dioxide or silicon, rather than allowing intermediate permittivities. At the end of this phase, the device is a binary silicon dioxide–silicon structure. This is followed by a settling phase, in which the binary design is further optimized.

Until now, we have neglected the fabrication penalty term Lf. A large number of schemes have been developed for both enforcing fabrication constraints during the optimization process45–49 and during post-processing.50 We adopt the popular three-field scheme.44 During the final DRC (design rules check) phase, we add Lf to L and let the algorithm alter the device to improve fabricability, sacrificing performance as necessary. For more details on the implementation and the explicit form of Lf, see Zhou et al.44 and our codebase.51 

We use the finite difference time domain (FDTD)52 method for our simulations via the fdtdz53 and pjz54 packages. In all cases, a 40 nm cubic lattice was used, with each time step lasting 0.08 fs. Simulations were run long enough to reach a steady state before extracting electric fields. For a 4 μm2 or smaller design region, simulation lasts for 200 fs (38 periods of 1550 nm light); for a 16 μm2 or smaller region, simulation lasts for 240 fs (46 periods); and for larger structures (up to 36 μm2), simulation lasts for 320 fs (61 periods). Due to the scale invariance of Maxwell’s equations, the results would be the same, provided both wavelength and cell size are scaled by the same factor.

The design region was 240 nm thick silicon, embedded in 2 μm of silicon dioxide cladding in all directions [see Fig. 2(a)]. In the x̂ and ŷdirections, within the plane of the design region, adiabatic absorption boundary conditions were used, extending another 2 μm beyond the cladding. In the ẑdirection, perfectly matched layers (PMLs) were used, embedded in the last six layers of cladding. Single-mode input and output waveguides were 400 nm wide, and dual-mode waveguides were 720 nm wide. The target minimum feature size for both silicon and silicon dioxide structures was 90 nm during the DRC phase.

FIG. 2.

Mode splitter results. (a) Design target: a dual-mode input is split between two outputs, with the TE00 component diverted to the lower output and the TE10 mode converted to TE00 and sent to the upper output. Perfect performance corresponds to a cost function of Lp = −2. (b) Cost function, averaged over all initializations, as a function of wall-clock time for select optimizers acting on a 100 × 100 design space without (left) and with (right) meta-learning. (c) Total cost function at the end of optimization for every optimizer and initialization without (upper) and with (lower) meta-learning. For each optimizer, the three sub-columns are 50 × 50, 100 × 100, and 150 × 150 design regions, respectively.

FIG. 2.

Mode splitter results. (a) Design target: a dual-mode input is split between two outputs, with the TE00 component diverted to the lower output and the TE10 mode converted to TE00 and sent to the upper output. Perfect performance corresponds to a cost function of Lp = −2. (b) Cost function, averaged over all initializations, as a function of wall-clock time for select optimizers acting on a 100 × 100 design space without (left) and with (right) meta-learning. (c) Total cost function at the end of optimization for every optimizer and initialization without (upper) and with (lower) meta-learning. For each optimizer, the three sub-columns are 50 × 50, 100 × 100, and 150 × 150 design regions, respectively.

Close modal

Optimizers seek out the minimum of L by iteratively updating ϕ based on the simulation results. We focus on gradient descent (GD) and quasi-Newton (QN) optimizers, both first-order methods that require an explicit first derivative dL/ to determine descent direction and do not require an explicit second derivative (as would be needed by second-order methods such as the Newton optimizer). Such first-order gradients are efficiently obtained with the adjoint method,16 a special case of the backpropagation method used to train neural networks. GD methods use either a fixed learning rate or a heuristic based on past gradients and loss evaluations to dynamically choose the learning rate. QN methods approximate the second derivative using the first derivative to make an educated guess of the optimal learning rate, then evaluate the cost function at multiple points along the descent direction to confirm or refine this choice in a process called “line search.”

For InvDes tasks, which typically have many fewer parameters than modern deep learning models, QN methods (particularly L-BFGS43) are believed to perform well due to their consistency and robustness against numerical artifacts, while momentum-assisted GD (particularly Adam23) are rapidly growing in popularity thanks to their speed, tunability, and ability to evade shallow minima.55 The deep learning community has, however, developed a wide range of GD algorithms, many derivatives of Adam or other “classic” optimizers such as AdaGrad,31 but many very distinct, such as Lion42 or LARS.36 These have each been shown to outperform their peers on specific deep learning tasks (see References in Table I), but have never been explored for nanophotonic inverse design.

TABLE I.

Optimizer parameters. “LR” refers to the learning rate.

OptimizerParametersDescriptionRef.
Adam family Adam LR = 1, b1 = 0.9, b2 = 0.999 Uses current grad and moving average of first and second moments of previous grads to modify the learning rate of each parameter individually 23  
AdaBelief LR = 0.1, b1 = 0.9, b2 = 0.999 Takes smaller steps when momentum and grad disagree 24  
Adamax LR = 1, b1 = 0.9, b2 = 0.999 Second moment replaced with max of second moment and current grad 23  
AdamaxW LR = 1, b1 = 0.9, b2 = 0.999 Combination of Adamax and AdamW 25  
AdamW LR = 0.1, b1 = 0.9, b2 = 0.999 Shifts weight decay regularization term out of moving average 26  
AMSGrad LR = 1, b1 = 0.9, b2 = 0.999 Running average replaced by running maximum 27  
NAdam LR = 0.1, b1 = 0.9, b2 = 0.999 Uses Nesterov momentum, advancing parameters along momentum direction a bit before computing grad 26  
NAdamW LR = 1, b1 = 0.9, b2 = 0.999 Combination of NAdam and AdamW 26  
Novograd LR = 10, b1 = 0.9, b2 = 0.25 Second moment replaced with norm of second moment 28  
RAdam LR = 10, b1 = 0.9, b2 = 0.999 Rectifies the momentum to reduce variance in early phases 29  
Yogi LR = 10, b1 = 0.9, b2 = 0.999 Reduces the learning rate when grads vary considerably between iterations 30  
AdaGrad family AdaGrad LR = 100 Modifies the learning rate for each parameter by the square root of the element-wise sum of the outer product of all previous grads 31  
AdaDelta LR = 100, ρ = 0.9 Uses previous grads to approximate a second derivative 32  
AdaFactor LR = 0.1, decayRate = 0.8 Stores only row/column sums of past grads to reduce memory use 33  
RMSProp LR = 0.1, decay = 0.9 Replaces the sum of past grads with an exponential moving average 34  
SM3 LR = 10, decay = 0.9 Stores only the extrema of the past grads to reduce memory use 35  
LARS family LARS LR = 100, weightDecay = 0.1, ɛ = 10−9 Uses a uniform learning rate, scaled by the L2 norm of a running weight divided by the L2 norm of the current grad 36  
Fromage LR = 0.1 Rescales the LR to prevent it from increasing without bound 37  
Lamb LR = 0.1, b1 = 0.9, b2 = 0.999, weightDecay = 0.1 Incorporates the second moment of Adam into the LARS algorithm 38  
GD family GD LR = 100 Each update is simply a user-chosen learning rate times the grad 39  
OptimGD LR = 100, α = 1, β = 1 Uses current/previous grad to predict next grad and adjusts LR accordingly 40  
NoisyGD LR = 100, η = 10−4, γ = 2 Injects artificial Gaussian noise into grad 41  
Lion family Lion LR = 0.1, b1 = 0.9, b2 = 0.99, weightDecay = 0.001 Itself created by a ML algorithm, uses the element-wise sign of the first moment to adapt the learning rate of each parameter 42  
Quasi-Newton L-BFGS N/A Approximates the second derivative using all previous grads and performs a line search to optimize the learning rate each iteration 43  
OptimizerParametersDescriptionRef.
Adam family Adam LR = 1, b1 = 0.9, b2 = 0.999 Uses current grad and moving average of first and second moments of previous grads to modify the learning rate of each parameter individually 23  
AdaBelief LR = 0.1, b1 = 0.9, b2 = 0.999 Takes smaller steps when momentum and grad disagree 24  
Adamax LR = 1, b1 = 0.9, b2 = 0.999 Second moment replaced with max of second moment and current grad 23  
AdamaxW LR = 1, b1 = 0.9, b2 = 0.999 Combination of Adamax and AdamW 25  
AdamW LR = 0.1, b1 = 0.9, b2 = 0.999 Shifts weight decay regularization term out of moving average 26  
AMSGrad LR = 1, b1 = 0.9, b2 = 0.999 Running average replaced by running maximum 27  
NAdam LR = 0.1, b1 = 0.9, b2 = 0.999 Uses Nesterov momentum, advancing parameters along momentum direction a bit before computing grad 26  
NAdamW LR = 1, b1 = 0.9, b2 = 0.999 Combination of NAdam and AdamW 26  
Novograd LR = 10, b1 = 0.9, b2 = 0.25 Second moment replaced with norm of second moment 28  
RAdam LR = 10, b1 = 0.9, b2 = 0.999 Rectifies the momentum to reduce variance in early phases 29  
Yogi LR = 10, b1 = 0.9, b2 = 0.999 Reduces the learning rate when grads vary considerably between iterations 30  
AdaGrad family AdaGrad LR = 100 Modifies the learning rate for each parameter by the square root of the element-wise sum of the outer product of all previous grads 31  
AdaDelta LR = 100, ρ = 0.9 Uses previous grads to approximate a second derivative 32  
AdaFactor LR = 0.1, decayRate = 0.8 Stores only row/column sums of past grads to reduce memory use 33  
RMSProp LR = 0.1, decay = 0.9 Replaces the sum of past grads with an exponential moving average 34  
SM3 LR = 10, decay = 0.9 Stores only the extrema of the past grads to reduce memory use 35  
LARS family LARS LR = 100, weightDecay = 0.1, ɛ = 10−9 Uses a uniform learning rate, scaled by the L2 norm of a running weight divided by the L2 norm of the current grad 36  
Fromage LR = 0.1 Rescales the LR to prevent it from increasing without bound 37  
Lamb LR = 0.1, b1 = 0.9, b2 = 0.999, weightDecay = 0.1 Incorporates the second moment of Adam into the LARS algorithm 38  
GD family GD LR = 100 Each update is simply a user-chosen learning rate times the grad 39  
OptimGD LR = 100, α = 1, β = 1 Uses current/previous grad to predict next grad and adjusts LR accordingly 40  
NoisyGD LR = 100, η = 10−4, γ = 2 Injects artificial Gaussian noise into grad 41  
Lion family Lion LR = 0.1, b1 = 0.9, b2 = 0.99, weightDecay = 0.001 Itself created by a ML algorithm, uses the element-wise sign of the first moment to adapt the learning rate of each parameter 42  
Quasi-Newton L-BFGS N/A Approximates the second derivative using all previous grads and performs a line search to optimize the learning rate each iteration 43  

We classify our library of optimizers into six families, each consisting of variations of the main algorithm. Each algorithm, along with a brief description and our choice of hyperparameters, can be found in Table I. The learning rates were chosen based on an order-of-magnitude sweep, testing learning rates from 0.001 to 100 for each optimizer on a wavelength multiplexer optimization (see Fig. S1 of the supplementary material).

In order to survey the performance of our optimizers across system sizes with and without meta-learning, we use two InvDes problems: an “easy” design task in the two-mode splitter [Fig. 2(a)] and a “hard” design task in the triple wavelength demultiplexer [Fig. 5(a)]. For each problem, we randomly generate seven initializations with spatially correlated Gaussian noise of varying amplitude for each of the three design region sizes (see Figs. 3 and 6 for examples of initializations). We apply each optimizer to the same seven initializations at each system size and determine the onset and duration of each optimization phase by elapsed wall-clock time.

When meta-learning was performed, the meta-optimizer was unmodified GD with a learning rate of 1. Only the learning rate hyperparameter was meta-learned, allowed to vary exponentially by up to an order of magnitude in either direction, with other hyperparameters held constant. Four iterations of structure optimization were performed for every one meta-optimization iteration. Meta-learning is not applied to L-BFGS, as the learning rate is set by the line search.

When meta-learning is not performed, cosine-annealing with warmup56,57 is used for each of the four optimization phases independently, with both the initial and final learning rates one tenth the maximum learning rate shown in Table I. Warm-up lasts for 10% of the duration of each phase.

Both optimizer and meta-optimizer learning rates are relatively large,58 as this was observed to significantly accelerate convergence without sacrificing final performance, and meta-learning consistently favored large learning rates throughout most of the optimization process (Fig. S1 of the supplementary material). Larger ideal learning rates compared with those typical of deep learning problems may be explained by the fact that inverse design uses full gradient information every iteration, rather than batches of training data, and larger strides, therefore, result in less jitter than they would in a deep learning problem.

All mode splitter results were obtained using an Nvidia GeForce RTX 4070, requiring on average 0.76/1.1/1.5 s per iteration for a 50 × 50/100 × 100/150 × 150 cell design region, respectively. All wavelength demultiplexer results were obtained using an Nvidia GeForce RTX 4080 Super, requiring on average 1.8/2.4/2.6 s per iteration for a 90 × 90/115 × 115/140 × 140 cell design region, respectively.

The mode splitter targets mode separation and conversion at 1550 nm [see Fig. 2(a)]. When a TE00 mode is injected into the 720 nm-wide input waveguide (port 0), it must be entirely converted to TE00 mode in the bottom 400 nm output waveguide (port 1). When a TE10 mode is injected into the input waveguide, it must be entirely converted to TE00 mode in the top 400 nm output waveguide (port 2). In terms of scattering parameters computed from the electric fields e, where Sij(λ; mi, mj) denotes the proportion of energy at wavelength λ injected at port i with mode mi transferred to port j with mode mj, the performance component of the cost function is
(3)
Three design region sizes (excluding cladding and waveguide leads) of 50 × 50 cells (2 × 2 μm2), 100 × 100 cells (4 × 4 μm2), and 150 × 150 cells (6 × 6 μm2) were used.

The training curves for select optimizers and the final performance of all optimizers for the three design region sizes can be seen in Fig. 2, and images of the final designs for a 100 × 100 design region can be found in Fig. 3.

FIG. 3.

Mode splitter structures at the end of optimization for each initialization and select optimizers with and without meta-learning. Colors indicate the performance. A system size of 100 × 100 is shown. v refers to the variance of the Gaussian noise used to generate the initialization.

FIG. 3.

Mode splitter structures at the end of optimization for each initialization and select optimizers with and without meta-learning. Colors indicate the performance. A system size of 100 × 100 is shown. v refers to the variance of the Gaussian noise used to generate the initialization.

Close modal

The Adam23 optimizer and its variants generally produce the best results. Meta-learning has a negative impact on consistency, but the best single devices (“hero” devices) were generally found with meta-learning. The best-performing device was yielded by the meta-learned Yogi30 optimizer, while Novograd28 had the best average and median performance overall. The differences between the best meta-learned design and the best non-meta-learned design are small, however. All Adam variants appear robust and high-performing and are excellent candidates for InvDes optimization.

We attribute the general failure of meta-learning to improve Adam to conflict with the momentum-based gradient updates. Optimizing hyperparameters in this way is not always better than fixed parameters or schedules such as cosine annealing,59 being highly sensitive to the parameters of the meta-optimizer60 and vulnerable to overfitting,61 with the optimal set of hyperparameters in one batch of four inner loops being very different from the optimal set in the next batch.

AdaGrad31 optimizers have generally poorer performance than the Adam family. They are more prone to noise during the descent, likely due to their longer memory of the gradient compared to Adam, which negatively impacts convergence during the continuum phase. Both their average performance and hero devices are inferior to Adam’s without meta-learning. Meta-learning has a very positive impact on their performance, such that AdaGrad, RMSProp,34 and AdaFactor33 have performance similar to Adam but fail to find either the hero devices of meta-learned Yogi or reach the consistency of cosine-annealed Novograd. We theorize that the longer gradient memory might yield more generalizable hyperparameters at the completion of each outer loop compared to Adam.

The LARS family of optimizers has mixed results. LARS36 and Lamb38 have highly inconsistent performance and fail to fall into the best minima of other optimizers found. LARS, in particular, yielded designs with extremely poor performance, most of which are off the scale of Fig. 2. Cosine-annealed Fromage,37 however, is one of the best InvDes optimizers. While its hero devices are inferior to those of Yogi, Fromage’s results are more concentrated on the best-performing devices. LARS and Lamb are nominally designed for very large networks with extremely non-convex landscapes and allow update rates that can differ by orders of magnitude between parameters. However, they are known to cause divergence of the design parameters,37 with the magnitude of the parameters rapidly growing throughout the optimization. This was observed in our data, with the parameters increasing by many orders of magnitude. Fromage rectifies this, enjoying the advantages of LARS in extremely non-convex spaces without the often fatal divergence flaw.

Lion42 lacks both the consistency and hero devices of Adam and Fromage, even with the benefits it derives from meta-learning.

The performance of static learning rate GD is similar to that of Lion, although it is better able to find hero devices when meta-learning is used to dynamically alter its learning rate. Artificial noise injection (NoisyGD41) appears to have a negligible-to-negative impact on GD performance.

L-BFGS43 has highly inconsistent performance. The additional function evaluations demanded by the line search reduce the total iterations that can be performed in a fixed wall-clock time, and the algorithm struggles to right itself during and after discretization and DRC. The mode splitter is a small and relatively simple problem that can be solved in a few iterations, a best-case scenario for L-BFGS, and the optimizer is able to find a core of high-performing devices, although with a number of very poor outliers. As will be seen with the wavelength multiplexer, however, the cost of line search becomes clear for harder problems with longer optimizations.

Because each pixel is 40 nm, in theory even pixel-level features are fabricable with e.g., GlobalFoundries’s SOI processes.62 However, to generate more robust designs, DRC successfully enforces at least an 80 nm minimum feature size across most designs. The number of designs violating DRC could be reduced by increasing the weight on the Lf term of the cost function, at the cost of performance. The distribution of minimum feature sizes computed using the imageruler package50,63 across all optimizers and sizes can be found in Fig. S3 of the supplementary material.

As illustrated by the final structures in Fig. 3, even with meta-learning and heuristics like momentum to more thoroughly explore the design space, InvDes remains highly initialization dependent. This is indicative of a highly nonconvex design space, and such sensitivity is not limited to gradient-based photonics InvDes, with topology optimization problems in mechanics64 and acoustics,65 as well as neural network training,66,67 exhibiting similar sensitivity. Within the limited statistics of our study, uniform and low-amplitude noise patterns appear to lead to higher performance, while high-amplitude noise yielded worse designs. Yet, even from the same initialization, there is wide diversity in the final designs. This highlights again the abundance of local minima of varying quality in our optimization landscape and, in turn, the importance of choosing and tuning the optimizer.

The performance of the initial structure generally correlates reasonably well with the performance of the final design in this simple problem. Because optimization is short, none of the algorithms have time to travel far from the initialization in the optimization landscape, making initial performance a generally acceptable, if rather inconsistent, predictor of final performance across the board [see Fig. S2(a) of the supplementary material].

The performance of the design after just two iterations is more predictive of final performance, although this is by no means consistent either, while the performance at the end of the continuum phase is generally highly correlated with final performance. This suggests that at least a few iterations should be performed on an initialization before accepting or rejecting it when pre-sampling initial conditions and supports the use of a continuum phase to find better discretized devices.

Both the best average performance and the best hero devices were found in the 100 × 100 design space [Fig. 4(b)], indicating a possible trade-off between a sparsity of deep minima in small systems and an overabundance of shallow minima in large systems, with moderately sized systems proving the easiest InvDes targets.

FIG. 4.

Analysis of mode splitter data. (a) Pearson correlation coefficients ρ between the final L (state F) and L of the initialization [ρ(0, F)], second iteration [ρ(2, F)], end of the continuum phase [ρ(C, F)], and end of the discretization phase [ρ(D, F)]. med(F) is the median final L, and the −M suffix on an optimizer indicates meta-learning. Data from all system sizes and initializations have been aggregated. (b) Final L as a function of system size. Points are at average over all initializations, and error bars indicate the min/max L.

FIG. 4.

Analysis of mode splitter data. (a) Pearson correlation coefficients ρ between the final L (state F) and L of the initialization [ρ(0, F)], second iteration [ρ(2, F)], end of the continuum phase [ρ(C, F)], and end of the discretization phase [ρ(D, F)]. med(F) is the median final L, and the −M suffix on an optimizer indicates meta-learning. Data from all system sizes and initializations have been aggregated. (b) Final L as a function of system size. Points are at average over all initializations, and error bars indicate the min/max L.

Close modal
The triple wavelength demultiplexer targets wavelength-dependent separation of a polychromatic input [see Fig. 5(a)]. In particular, we demand that 1500, 1550, and 1650 nm light be directed to the three output waveguides. In terms of the scatting parameters, the performance component of the cost function is
(4)
Three design region sizes (excluding cladding and waveguide leads) of 90 × 90 cells (3.6 × 3.6 μm2), 115 × 115 cells (4.6 × 4.6 μm2), and 140 × 140 cells (5.6 × 5.6 μm2) were used. A reduced number of optimizers were used for the wavelength demultiplexer due to the much longer optimization time.
FIG. 5.

Wavelength demultiplexer results. (a) Design target: a polychromatic input is split between three outputs, with 1500, 1550, and 1600 nm light sent to different channels. Perfect performance corresponds to a cost function of Lp = −3. (b) Cost function, averaged over all initializations, as a function of wall-clock time for select optimizers acting on a 115 × 115 design space without (left) and with (right) meta-learning. (c) Total cost function at the end of optimization for every optimizer and initialization without (upper) and with (lower) meta-learning. For each optimizer, the three sub-columns are 90 × 90, 115 × 115, and 140 × 140 design regions, respectively.

FIG. 5.

Wavelength demultiplexer results. (a) Design target: a polychromatic input is split between three outputs, with 1500, 1550, and 1600 nm light sent to different channels. Perfect performance corresponds to a cost function of Lp = −3. (b) Cost function, averaged over all initializations, as a function of wall-clock time for select optimizers acting on a 115 × 115 design space without (left) and with (right) meta-learning. (c) Total cost function at the end of optimization for every optimizer and initialization without (upper) and with (lower) meta-learning. For each optimizer, the three sub-columns are 90 × 90, 115 × 115, and 140 × 140 design regions, respectively.

Close modal

The training curves and the final performance of select optimizers for the three design region sizes can be seen in Fig. 5. The results follow the general trends of the mode splitter but with distinct features. All Adam optimizers perform very well, with minimal differences in performance among them. Meta-learned RAdam29 yielded the best hero device, Yogi had the best median performance, and meta-learned NAdam26 had the best average performance overall. For most Adam optimizers, meta-learning has a negligible-to-negative effect, although it was again necessary for the best hero devices.

AdaGrad and its family, with the exception of SM3,35 were more consistent than in the mode splitter case. We attribute this to the longer optimization time, as shown in Figs. 5(b) and 5(c), AdaGrad follows a more noisy path during the early stages than most other optimizers, and the longer optimization time appears necessary for these fluctuations to damp out. AdaGrad optimizers are again generally improved by meta-learning.

As in the splitter case, cosine-annealed Fromage performs as well as Adam optimizers. Simple GD likewise performs as well as Adam, so long as meta-learning is used. L-BFGS once again under-performs first-order methods, even for the smallest device size, with the handicap of fewer iterations in the same wall-clock time, making much more impact during these longer optimizations.

The distribution of minimum feature sizes computed using the imageruler package50,63 across all optimizers and sizes can be found in Fig. S4 of the supplementary material.

As illustrated by the final structures in Fig. 6, the wavelength demultiplexer is also highly initialization dependent. Reduced noise amplitude in the initialization is again correlated with higher performance.

FIG. 6.

Wavelength demultiplexer structures at the end of optimization for each initialization and select optimizer, with and without meta-learning. Colors indicate the performance. A system size of 115 × 115 is shown. v refers to the variance of the Gaussian noise used to generate the initialization.

FIG. 6.

Wavelength demultiplexer structures at the end of optimization for each initialization and select optimizer, with and without meta-learning. Colors indicate the performance. A system size of 115 × 115 is shown. v refers to the variance of the Gaussian noise used to generate the initialization.

Close modal

Correlation between final performance and performance at different stages of the design process again supports the use of the continuum phase, as good continuum designs are highly correlated with good discretized designs. However, compared to the mode splitter, performance at initialization and after a small number of iterations are much poorer predictors of final performance [Fig. 6(a)]. We additionally observe that optimizers with better median final performance tend to have a lower correlation between initial and final performance [see also Fig. S2(b) of the supplementary material]. We believe this indicates they are better explorers, traveling far from the initialization and rejecting poor minima rather than immediately isolating the nearest local extremum. In the shorter mode splitter optimization, this deviation in behavior was not as clear, as there was not enough time for optimizers to explore the design space, and the simpler problem meant there were many high-performing minima closer to a given initialization.

The average performance as a function of system size also differs from the mode splitter, with the best results derived from the largest, 140 × 140 cell design region [Fig. 7(b)]. This is consistent with the prediction that a more difficult design task will have a larger optimal design region size.

FIG. 7.

Analysis of wavelength demultiplexer data. (a) Pearson correlation coefficients ρ between the final L (state F) and L of the initialization [ρ(0, F)], second iteration [ρ(2, F)], end of the continuum phase [ρ(C, F)], and end of the discretization phase [ρ(D, F)]. med(F) is the median final L, and the −M suffix on an optimizer indicates meta-learning. Data from all system sizes and initializations have been aggregated. (b) Final L as a function of system size. Points are at average over all initializations, and error bars indicate the min/max L.

FIG. 7.

Analysis of wavelength demultiplexer data. (a) Pearson correlation coefficients ρ between the final L (state F) and L of the initialization [ρ(0, F)], second iteration [ρ(2, F)], end of the continuum phase [ρ(C, F)], and end of the discretization phase [ρ(D, F)]. med(F) is the median final L, and the −M suffix on an optimizer indicates meta-learning. Data from all system sizes and initializations have been aggregated. (b) Final L as a function of system size. Points are at average over all initializations, and error bars indicate the min/max L.

Close modal

We study the performance of a large number of the most common machine learning optimizers when applied to nanophotonic inverse design tasks, across different initializations and as a function of system size. We find the Adam family of optimizers to have the best performance overall. Meta-learning seems to interfere with momentum in many cases, resulting in more poor-performing outliers, but also the highest-performing hero devices. Meta-learned Yogi produced the best hero devices in the mode splitter problem, while meta-learned RAdam produced the best wavelength demultiplexer design. AdaGrad optimizers generally underperformed Adam optimizers, although with meta-learning they had roughly equivalent performance. Fromage, with cosine-annealing, was the only high-performing LARS optimizer, outperforming Adam in some cases. L-BFGS generally underperformed first-order methods, likely due to the increased computational burden associated with line search.

We show that larger system sizes do not always yield better results; there appears to exist an optimal system size for a given design target, related to its complexity, after which increasing system size makes it more difficult for currently available optimizers to locate high-performing minima. Future optimizers or samplers68 might mitigate this issue.

We find that the performance at initialization or after a few iterations is a poor indicator of final performance as problem complexity increases, particularly for the best optimizers. Nonetheless, performance at the end of a sufficiently long continuum optimization is strongly correlated with final performance, allowing poor-performing continuum designs to be safely rejected.

We, therefore, suggest Adam optimizers be the preferred first choice for nanophotonic InvDes, with Fromage and, in some cases, meta-learned AdaGrad derivatives offering strong alternatives. In contrast, quasi-Newton methods such as L-BFGS, although widely used, may be less effective except for the smallest design regions.

Finally, we note that there remains a vast array of distinct InvDes tasks beyond the two we examined, along with a wide phase space of hyperparameters for both optimizers and meta-optimizers. Most InvDes tasks utilize qualitatively similar cost functions, ultimately dependent on field magnitude at particular input and output points, and are, therefore, expected to have similarly nonconvex design spaces. The similarities in the results obtained for the two tasks presented here suggest that other InvDes tasks, both in nanophotonics and other fields, are likely to exhibit similar behavior.

Nevertheless, future research should expand to include a broader range of tasks, parameter settings, and system sizes to fully understand the strengths and limitations of various existing machine-learning optimization strategies. This could also pave the way for the development of new optimizers specifically tailored to nanophotonic InvDes.

The supplementary material contains the following figures: Fig. S1 depicts the performance vs learning rate for each optimizer, Fig. S2 depicts the correlation between the quality of the initial guess and the quality of the final design as a function of the quality of the final design, and Figs. S3 and S4 depict the relative ability of different optimizers to perform DRC by providing histograms of the number of final designs possessing a given minimum length scale.

The authors have no conflicts to disclose.

Nathaniel Morrison: Conceptualization (equal); Formal analysis (equal); Software (equal); Visualization (equal); Writing – original draft (equal). Eric Y. Ma: Conceptualization (equal); Formal analysis (equal); Project administration (equal); Writing – original draft (equal); Writing – review & editing (equal).

The code and supporting data for this article are publicly available through the Ma Lab GitHub organization.51 

1.
S.
Zhang
,
C. L.
Wong
,
S.
Zeng
,
R.
Bi
,
K.
Tai
,
K.
Dholakia
, and
M.
Olivo
, “
Metasurfaces for biomedical applications: Imaging and sensing from a nanophotonics perspective
,”
Nanophotonics
10
,
259
293
(
2020
).
2.
S. J.
Ben Yoo
, “
Prospects and challenges of photonic switching in data centers and computing systems
,”
J. Lightwave Technol.
40
,
2214
2243
(
2022
).
3.
B.
Neşeli
,
Y. A.
Yilmaz
,
H.
Kurt
, and
M.
Turduev
, “
Inverse design of ultra-compact photonic gates for all-optical logic operations
,”
J. Phys. D: Appl. Phys.
55
,
215107
(
2022
).
4.
Y.
Shen
,
N. C.
Harris
,
S.
Skirlo
,
M.
Prabhu
,
T.
Baehr-Jones
,
M.
Hochberg
,
X.
Sun
,
S.
Zhao
,
H.
Larochelle
,
D.
Englund
, and
M.
Soljačić
, “
Deep learning with coherent nanophotonic circuits
,”
Nat. Photonics
11
,
441
446
(
2017
).
5.
S.
Slussarenko
and
G. J.
Pryde
, “
Photonic quantum information processing: A concise review
,”
Appl. Phys. Rev.
6
,
041303
(
2019
).
6.
A. F.
Koenderink
,
A.
Alù
, and
A.
Polman
, “
Nanophotonics: Shrinking light-based technology
,”
Science
348
,
516
521
(
2015
).
7.
H.
Altug
,
S.-H.
Oh
,
S. A.
Maier
, and
J.
Homola
, “
Advances and applications of nanophotonic biosensors
,”
Nat. Nanotechnol.
17
,
5
16
(
2022
).
8.
A. Y.
Piggott
,
J.
Lu
,
K. G.
Lagoudakis
,
J.
Petykiewicz
,
T. M.
Babinec
, and
J.
Vučković
, “
Inverse design and demonstration of a compact and broadband on-chip wavelength demultiplexer
,”
Nat. Photonics
9
,
374
377
(
2015
).
9.
S.
Molesky
,
Z.
Lin
,
A. Y.
Piggott
,
W.
Jin
,
J.
Vucković
, and
A. W.
Rodriguez
, “
Inverse design in nanophotonics
,”
Nat. Photonics
12
,
659
670
(
2018
).
10.
A. D.
White
,
L.
Su
,
D. I.
Shahar
,
K. Y.
Yang
,
G. H.
Ahn
,
J. L.
Skarda
,
S.
Ramachandran
, and
J.
Vučković
, “
Inverse design of optical vortex beam emitters
,”
ACS Photonics
10
,
803
807
(
2023
).
11.
R.
Deng
,
W.
Liu
, and
L.
Shi
, “
Inverse design in photonic crystals
,”
Nanophotonics
13
,
1219
1237
(
2024
).
12.
R. E.
Christiansen
and
O.
Sigmund
, “
Inverse design in photonics by topology optimization: Tutorial
,”
J. Opt. Soc. Am. B
38
,
496
509
(
2021
).
13.
A. Y.
Piggott
,
E. Y.
Ma
,
L.
Su
,
G. H.
Ahn
,
N. V.
Sapra
,
D.
Vercruysse
,
A. M.
Netherton
,
A. S. P.
Khope
,
J. E.
Bowers
, and
J.
Vučković
, “
Inverse-designed photonics for semiconductor foundries
,”
ACS Photonics
7
,
569
575
(
2020
).
14.
M.
Minkov
,
I. A. D.
Williamson
,
L. C.
Andreani
,
D.
Gerace
,
B.
Lou
,
A. Y.
Song
,
T. W.
Hughes
, and
S.
Fan
, “
Inverse design of photonic crystals through automatic differentiation
,”
ACS Photonics
7
,
1729
1741
(
2020
).
15.
G.
Zhang
,
D.-X.
Xu
,
Y.
Grinberg
, and
O.
Liboiron-Ladouceur
, “
Topological inverse design of nanophotonic devices with energy constraint
,”
Opt. Express
29
,
12681
12695
(
2021
).
16.
T. W.
Hughes
,
M.
Minkov
,
I. A. D.
Williamson
, and
S.
Fan
, “
Adjoint method and inverse design for nonlinear nanophotonic devices
,”
ACS Photonics
5
,
4781
4787
(
2018
).
17.
M. F.
Schubert
,
A. K. C.
Cheung
,
I. A. D.
Williamson
,
A.
Spyra
, and
D. H.
Alexander
, “
Inverse design of photonic devices with strict foundry fabrication constraints
,”
ACS Photonics
9
,
2327
2336
(
2022
).
18.
S.
Gertler
,
Z.
Kuang
,
C.
Christie
, and
O. D.
Miller
, “
Many physical design problems are sparse QCQPs
,” arXiv:2303.17691 [physics, physics:quant-ph] (
2023
).
19.
D.
Soydaner
, “
A comparison of optimization algorithms for deep learning
,”
Int. J. Pattern Recognit. Artif. Intell.
34
,
2052013
(
2020
).
20.
D. K. R.
Gaddam
,
M. D.
Ansari
,
S.
Vuppala
,
V. K.
Gunjan
, and
M. M.
Sati
, “
A performance comparison of optimization algorithms on a generated dataset
,” in
ICDSMLA 2020
(
Springer
,
Singapore
,
2022
), pp.
1407
1415
.
21.
L.
Wan
,
M.
Zeiler
,
S.
Zhang
,
Y. L.
Cun
, and
R.
Fergus
, “
Regularization of neural networks using DropConnect
,” in
Proceedings of the 30th International Conference on Machine Learning
PMLR (PMLR
,
2013
), pp.
1058
1066
.
22.
M. M.
Bejani
and
M.
Ghatee
, “
A systematic review on overfitting control in shallow and deep neural networks
,”
Artif. Intell. Rev.
54
,
6391
6438
(
2021
).
23.
D. P.
Kingma
and
J.
Ba
, “
Adam: A method for stochastic optimization
,” arXiv:1412.6980 [cs] (
2017
).
24.
J.
Zhuang
,
T.
Tang
,
Y.
Ding
,
S.
Tatikonda
,
N.
Dvornek
,
X.
Papademetris
, and
J. S.
Duncan
, “
AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients
,” in
Proceedings of the 34th Conference on Neural Information Processing
Systems (Curran Associates, Inc
.,
2020
).
25.
I.
Loshchilov
and
F.
Hutter
, “
Decoupled weight decay regularization
,” in
7th International Conference on Learning Representations
(
ICLR
,
2019
).
26.
T.
Dozat
, Incorporating Nesterov Momentum into Adam,
2016
.
27.
S. J.
Reddi
,
S.
Kale
, and
S.
Kumar
, “
On the convergence of Adam and beyond
,” in
International Conference on Learning Representations
,
2018
.
28.
B.
Ginsburg
,
P.
Castonguay
,
O.
Hrinchuk
,
O.
Kuchaiev
,
V.
Lavrukhin
,
R.
Leary
,
J.
Li
,
H.
Nguyen
,
Y.
Zhang
, and
J. M.
Cohen
, “
Stochastic gradient methods with layer-wise adaptive moments for training of deep networks
,” arXiv:1905.11286 [cs, stat] (
2020
).
29.
L.
Liu
,
H.
Jiang
,
P.
He
,
W.
Chen
,
X.
Liu
,
J.
Gao
, and
J.
Han
, “
On the variance of the adaptive learning rate and beyond
,” in
Proceedings of the 8th International Conference on Learning Representations
(
ICLR
,
2020
).
30.
M.
Zaheer
,
S.
Reddi
,
D.
Sachan
,
S.
Kale
, and
S.
Kumar
, “
Adaptive methods for nonconvex optimization
,”
Advances in Neural Information Processing Systems
(
Curran Associates, Inc.
,
2018
), Vol.
31
.
31.
J.
Duchi
,
E.
Hazan
, and
Y.
Singer
, “
Adaptive subgradient methods for online learning and stochastic optimization
,”
J. Mach. Learn. Res.
12
,
2121
2159
(
2011
).
32.
M. D.
Zeiler
, “
ADADELTA: An adaptive learning rate method
,” arXiv:1212.5701 [cs] (
2012
).
33.
N.
Shazeer
and
M.
Stern
, “
Adafactor: Adaptive learning rates with sublinear memory cost
,” in
Proceedings of the 35th International Conference on Machine
Learning (PMLR
,
2018
).
34.
A.
Graves
, “
Generating sequences with recurrent neural networks
,” arXiv:1308.0850 [cs] (
2014
).
35.
R.
Anil
,
V.
Gupta
,
T.
Koren
, and
Y.
Singer
, “
Memory-efficient adaptive optimization
,” in
Proceedings of the 37th International Conference on Neural Information Processing Systems
(
Curran Associates, Inc
.,
2019
).
36.
Y.
You
,
I.
Gitman
, and
B.
Ginsburg
, “
Large batch training of convolutional networks
,” arXiv:1708.03888 [cs] (
2017
).
37.
J.
Bernstein
,
A.
Vahdat
,
Y.
Yue
, and
M.-Y.
Liu
, “
On the distance between two neural networks and the stability of learning
,” in
Advances in Neural Information Processing Systems
,
33
(
Curran Associates, Inc
., 2020), pp.
21370
21381
38.
Y.
You
,
J.
Li
,
S.
Reddi
,
J.
Hseu
,
S.
Kumar
,
S.
Bhojanapalli
,
X.
Song
,
J.
Demmel
,
K.
Keutzer
, and
C.-J.
Hsieh
, “
Large batch optimization for deep learning: Training BERT in 76 minutes
,” arXiv:1904.00962 [cs, stat] (
2020
).
39.
I.
Sutskever
,
J.
Martens
,
G.
Dahl
, and
G.
Hinton
, “
On the importance of initialization and momentum in deep learning
,” in
Proceedings of the 30th International Conference on Machine Learning
PMLR (PMLR
,
2013
), pp.
1139
1147
.
40.
A.
Mokhtari
,
A.
Ozdaglar
, and
S.
Pattathil
, “
A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach
,” in
Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (PMLR, 2020).
41.
A.
Neelakantan
,
L.
Vilnis
,
Q. V.
Le
,
I.
Sutskever
,
L.
Kaiser
,
K.
Kurach
, and
J.
Martens
, “
Adding gradient noise improves learning for very deep networks
,” arXiv:1511.06807 [cs, stat] (
2015
).
42.
X.
Chen
,
C.
Liang
,
D.
Huang
,
E.
Real
,
K.
Wang
,
H.
Pham
,
X.
Dong
,
T.
Luong
,
C.-J.
Hsieh
,
Y.
Lu
, and
Q. V.
Le
, “
Symbolic discovery of optimization algorithms
,” in
NIPS’23: Proceedings of the 37th International Conference on Neural Information Processing Systems
(
Curran Associates, Inc.
,
2024
), pp.
49205
49233
.
43.
D. C.
Liu
and
J.
Nocedal
, “
On the limited memory BFGS method for large scale optimization
,”
Math. Program.
45
,
503
528
(
1989
).
44.
M.
Zhou
,
B. S.
Lazarov
,
F.
Wang
, and
O.
Sigmund
, “
Minimum length scale in topology optimization by geometric constraints
,”
Comput. Methods Appl. Mech. Eng.
293
,
266
282
(
2015
).
45.
A. M.
Hammond
,
A.
Oskooi
,
S. G.
Johnson
, and
S. E.
Ralph
, “
Photonic topology optimization with semiconductor-foundry design-rule constraints
,”
Opt. Express
29
,
23916
23938
(
2021
).
46.
Y.
Zhou
,
C.
Mao
,
E.
Gershnabel
,
M.
Chen
, and
J. A.
Fan
, “
Large-area, high-numerical-aperture, freeform metasurfaces
,”
Laser Photonics Rev.
18
,
2300988
(
2024
).
47.
E.
Khoram
,
X.
Qian
,
M.
Yuan
, and
Z.
Yu
, “
Controlling the minimal feature sizes in adjoint optimization of nanophotonic devices using b-spline surfaces
,”
Opt. Express
28
,
7060
7069
(
2020
).
48.
Y.
Zhou
,
Y.
Shao
,
C.
Mao
, and
J. A.
Fan
, “
Inverse-designed metasurfaces with facile fabrication parameters
,”
J. Opt.
26
,
055101
(
2024
).
49.
E. W.
Wang
,
D.
Sell
,
T.
Phan
, and
J. A.
Fan
, “
Robust design of topology-optimized metasurfaces
,”
Opt. Mater. Express
9
,
469
482
(
2019
).
50.
M.
Chen
,
R. E.
Christiansen
,
J. A.
Fan
,
G.
Işiklar
,
J.
Jiang
,
S. G.
Johnson
,
W.
Ma
,
O. D.
Miller
,
A.
Oskooi
,
M. F.
Schubert
,
F.
Wang
,
I. A. D.
Williamson
,
W.
Xue
, and
Y.
Zhou
, “
Validation and characterization of algorithms and software for photonics inverse design
,”
J. Opt. Soc. Am. B
41
,
A161
A176
(
2024
).
51.
N.
Morrison
(
2024
). “photonicsOptComp,”
UC Berkeley
. https://github.com/Ma-Lab-Cal/photonicsOptComp
52.
U. S.
Inan
and
R. A.
Marshall
,
Numerical Electromagnetics: The FDTD Method
(
Cambridge University Press
,
2011
).
53.
J.
Lu
, Spinsphotonics/fdtdz, SPINS Photonics,
2024
.
54.
J.
Lu
, Spinsphotonics/pjz, SPINS Photonics,
2024
.
55.
N.
Ye
,
F.
Roosta-Khorasani
, and
T.
Cui
, “
Optimization methods for inverse problems
,” in
2017 MATRIX Annals
, edited by
J.
de Gier
,
C. E.
Praeger
and
T.
Tao
(
Springer International Publishing
,
Cham
,
2019
), pp.
121
140
.
56.
A.
Gotmare
,
N. S.
Keskar
,
C.
Xiong
, and
R.
Socher
, “
A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation
,” arXiv:1810.13243 [cs, stat] (
2018
).
57.
Z.
Liu
, “
Super convergence cosine annealing with warm-up learning rate
,” in
CAIBDA 2022; 2nd International Conference on Artificial Intelligence, Big Data and Algorithms
(
IEEE
,
Nanjing, China
,
2022
), pp.
1
7
.
58.
L. N.
Smith
and
N.
Topin
, “
Super-convergence: Very fast training of neural networks using large learning rates
,”
Proc. SPIE
11006
,
369
386
(
2019
).
59.
T.
Yu
and
H.
Zhu
, “
Hyper-parameter optimization: A review of algorithms and applications
,” arXiv:2003.05689 (
2020
).
60.
D.
Maclaurin
,
D.
Duvenaud
, and
R. P.
Adams
, “
Gradient-based hyperparameter optimization through reversible learning
,” in
Proceedings of the 32nd International Conference on Machine
Learning (JMLR
,
2015
).
61.
L.
Yang
and
A.
Shami
, “
On hyperparameter optimization of machine learning algorithms: Theory and practice
,”
Neurocomputing
415
,
295
316
(
2020
).
62.
L.
Gwennap
, “
FD-SOI offers alternative to FinFET
,” in
Microprocessor Report
(
TechInsights
,
Mountain View, CA
,
2016
).
63.
A.
Oskooi
, NanoComp/imageruler, Nanostructures and Computation Group,
2024
.
64.
S.
Yang
,
S.
Lee
, and
K.
Yee
, “
Inverse design optimization framework via a two-step deep learning approach: Application to a wind turbine airfoil
,”
Eng. Comput.
39
,
2239
2255
(
2023
).
65.
B.
Pan
,
X.
Song
,
J.
Xu
,
D.
Sui
,
H.
Xiao
,
J.
Zhou
, and
J.
Gu
, “
Accelerated inverse design of customizable acoustic metaporous structures using a CNN-GA-based hybrid optimization framework
,”
Appl. Acoust.
210
,
109445
(
2023
).
66.
J.
Jiang
and
J. A.
Fan
, “
Global optimization of dielectric metasurfaces using a physics-driven neural network
,”
Nano Lett.
19
,
5366
5372
(
2019
).
67.
M. V.
Narkhede
,
P. P.
Bartakke
, and
M. S.
Sutaone
, “
A review on weight initialization strategies for neural networks
,”
Artif. Intell. Rev.
55
,
291
322
(
2022
).
68.
J.
Robnik
,
G. B.
De Luca
,
E.
Silverstein
, and
U.
Seljak
, “
Microcanonical Hamiltonian Monte Carlo
,”
J. Mach. Learn. Res.
24
,
311:14696
311:14729
(
2024
).