In the limit of small trial moves the Metropolis Monte Carlo algorithm is equivalent to gradient descent on the energy function in the presence of Gaussian white noise. This observation was originally used to demonstrate a correspondence between Metropolis Monte Carlo moves of model molecules and overdamped Langevin dynamics, but it also applies in the context of training a neural network: making small random changes to the weights of a neural network, accepted with the Metropolis probability, with the loss function playing the role of energy, has the same effect as training by explicit gradient descent in the presence of Gaussian white noise. We explore this correspondence in the context of a simple recurrent neural network. We also explore regimes in which this correspondence breaks down, where the gradient of the loss function becomes very large or small. In these regimes the Metropolis algorithm can still effect training, and so can be used as a probe of the loss function of a neural network in regimes in which gradient descent struggles. We also show that training can be accelerated by making purposely-designed Monte Carlo trial moves of neural-network weights.

1.
N.
Metropolis
,
A. W.
Rosenbluth
,
M. N.
Rosenbluth
,
A. H.
Teller
, and
E.
Teller
, “
Equation of state calculations by fast computing machines
,”
J. Chem. Phys.
21
,
1087
1092
(
1953
).
2.
J. E.
Gubernatis
, “
Marshall Rosenbluth and the metropolis algorithm
,”
Phys. Plasmas
12
,
057303
(
2005
).
3.
M. N.
Rosenbluth
, “
Genesis of the Monte Carlo algorithm for statistical mechanics
,”
AIP Conf. Proc.
690
,
22
30
(
2003
).
4.
M.
Helene Whitacre
and
A.
Wright Rosenbluth
,
Tech. Rep.
,
Los Alamos National Lab.(LANL)
,
Los Alamos, NM, USA
,
2021
.
5.
G.
Bhanot
, “
The metropolis algorithm
,”
Rep. Prog. Phys.
51
,
429
(
1988
).
6.
I.
Beichl
and
F.
Sullivan
, “
The metropolis algorithm
,”
Comput. Sci. Eng.
2
,
65
69
(
2000
).
7.
D.
Frenkel
and
B.
Smit
,
Understanding Molecular Simulation: From Algorithms to Applications
(
Academic Press
,
2001
), Vol.
1
.
8.
K.
Kikuchi
,
M.
Yoshida
,
T.
Maekawa
, and
H.
Watanabe
, “
Metropolis Monte Carlo method as a numerical technique to solve the Fokker-Planck equation
,”
Chem. Phys. Lett.
185
,
335
338
(
1991
).
9.
K.
Kikuchi
,
M.
Yoshida
,
T.
Maekawa
, and
H.
Watanabe
, “
Metropolis Monte Carlo method for Brownian dynamics simulation generalized to include hydrodynamic interactions
,”
Chem. Phys. Lett.
196
,
57
61
(
1992
).
10.
S.
Whitelam
,
V.
Selin
,
S.-W.
Park
, and
I.
Tamblyn
, “
Correspondence between neuroevolution and gradient descent
,”
Nat. Commun.
12
,
6317
(
2021
).
11.

The original paper1 drew random numbers using the middle-square method, and Refs. 8 and 9 use uniform random numbers.

12.
E.
Sanz
and
D.
Marenduzzo
, “
Dynamic Monte Carlo versus Brownian dynamics: A comparison for self-diffusion and crystallization in colloidal fluids
,”
J. Chem. Phys.
132
,
194102
(
2010
).
13.

For strong, short-ranged potentials the required value of σ may be too small to be convenient, in which case we need to make collective Monte Carlo moves in order to approximate a realistic dynamics.45 

14.
S.
Whitelam
,
V.
Selin
,
I.
Benlolo
,
C.
Casert
, and
I.
Tamblyn
, “
Training neural networks using metropolis Monte Carlo and an adaptive variant
,”
Mach. Learn.: Sci. Technol.
3
,
045026
(
2022
).
15.

The difference in speed does not scale linearly with the number of weights N, however, as is sometimes stated.

16.
R. S.
Sexton
,
R. E.
Dorsey
, and
J. D.
Johnson
, “
Beyond backpropagation: Using simulated annealing for training neural networks
,”
J. Organ. End User Comput.
11
,
3
10
(
1999
).
17.
L. R.
Rere
,
M. I.
Fanany
, and
A. M.
Arymurthy
, “
Simulated annealing algorithm for deep learning
,”
Procedia Comput. Sci.
72
,
137
144
(
2015
).
18.
R.
Tripathi
and
B.
Singh
, “
RSO: A gradient free sampling based approach for training deep neural networks
,” arXiv:2005.05955 (
2020
).
19.
J.
Schmidhuber
, “
Deep learning in neural networks: An overview
,”
Neural Networks
61
,
85
117
(
2015
).
20.
Y.
LeCun
,
Y.
Bengio
, and
G.
Hinton
, “
Deep learning
,”
Nature
521
,
436
444
(
2015
).
21.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
,
Deep Learning
(
MIT Press
,
2016
).
22.
L.
Metz
,
C. D.
Freeman
,
S. S.
Schoenholz
, and
T.
Kachman
, “
Gradients are not all you need
,” arXiv:2111.05803 (
2021
).
23.
J. H.
Holland
, “
Genetic algorithms
,”
Sci. Am.
267
,
66
72
(
1992
).
24.
D. B.
Fogel
and
L. C.
Stayton
, “
On the effectiveness of crossover in simulated evolutionary optimization
,”
BioSystems
32
,
171
182
(
1994
).
25.
D. J.
Montana
and
L.
Davis
, “
Training feedforward neural networks using genetic algorithms
,” in
IJCAI’89
,
1989
, pp.
762
767
.
26.
V.
Mnih
,
K.
Kavukcuoglu
,
D.
Silver
,
A.
Graves
,
I.
Antonoglou
,
D.
Wierstra
, and
M.
Riedmiller
, “
Playing Atari with deep reinforcement learning
,” arXiv:1312.5602 (
2013
).
27.
G.
Morse
and
K. O.
Stanley
, “
Simple evolutionary optimization can rival stochastic gradient descent in neural networks
,” in
Proceedings of the Genetic and Evolutionary Computation Conference
,
2016
, pp.
477
484
.
28.
T.
Salimans
,
J.
Ho
,
X.
Chen
,
S.
Sidor
, and
I.
Sutskever
, “
Evolution strategies as a scalable alternative to reinforcement learning
,” arXiv:1703.03864 (
2017
).
29.
U.
Wolff
, “
Collective Monte Carlo updating for spin systems
,”
Phys. Rev. Lett.
62
,
361
(
1989
).
30.
R. H.
Swendsen
and
J.-S.
Wang
, “
Nonuniversal critical dynamics in Monte Carlo simulations
,”
Phys. Rev. Lett.
58
,
86
(
1987
).
31.
J.
Liu
and
E.
Luijten
, “
Rejection-free geometric cluster algorithm for complex fluids
,”
Phys. Rev. Lett.
92
,
035504
(
2004
).
32.
K.
Cho
,
B.
Van Merriënboer
,
D.
Bahdanau
, and
Y.
Bengio
, “
On the properties of neural machine translation: Encoder-decoder approaches
,” arXiv:1409.1259 (
2014
).
33.
J.
Chung
,
C.
Gulcehre
,
K. H.
Cho
, and
Y.
Bengio
, “
Empirical evaluation of gated recurrent neural networks on sequence modeling
,” arXiv:1412.3555 (
2014
).
34.
S.
Hochreiter
and
J.
Schmidhuber
, “
Long short-term memory
,”
Neural Comput.
9
,
1735
1780
(
1997
).
35.
A.
Vaswani
,
N.
Shazeer
,
N.
Parmar
,
J.
Uszkoreit
,
L.
Jones
,
A. N.
Gomez
,
Ł.
Kaiser
, and
I.
Polosukhin
, “
Attention is all you need
,” in
Advances in Neural Information Processing Systems
,
2017
, Vol. 30.
36.
Y.
Bengio
,
P.
Simard
, and
P.
Frasconi
, “
Learning long-term dependencies with gradient descent is difficult
,”
IEEE Trans. Neural Networks
5
,
157
166
(
1994
).
37.
R.
Pascanu
,
T.
Mikolov
, and
Y.
Bengio
, “
On the difficulty of training recurrent neural networks
,” in
International Conference on Machine Learning
(
PMLR
,
2013
), pp.
1310
1318
.
38.
J.
Collins
,
J.
Sohl-Dickstein
, and
D.
Sussillo
, “
Capacity and trainability in recurrent neural networks
,” arXiv:1611.09913 (
2016
).
39.
M.
Arjovsky
,
A.
Shah
, and
Y.
Bengio
, “
Unitary evolution recurrent neural networks
,” in
International Conference on Machine Learning
(
PMLR
,
2016
), pp.
1120
1128
.
40.
I. J.
Goodfellow
,
O.
Vinyals
, and
A. M.
Saxe
, “
Qualitatively characterizing neural network optimization problems
,” arXiv:1412.6544 (
2014
).
41.
D.
Jiwoong Im
,
M.
Tao
, and
K.
Branson
, “
An empirical analysis of the optimization of deep network loss surfaces
,” arXiv:1612.04010 (
2016
).
42.
W.
Scott
,
T.
Powers
,
J.
Hershey
,
J.
Le Roux
, and
L.
Atlas
, “
Full-capacity unitary recurrent neural networks
,” in
Advances in Neural Information Processing Systems
,
2016
, Vol. 29.
43.
Z.
Mhammedi
,
A.
Hellicar
,
A.
Rahman
, and
J.
Bailey
, “
Efficient orthogonal parametrisation of recurrent neural networks using householder reflections
,” in
International Conference on Machine Learning
(
PMLR
,
2017
), pp.
2401
2409
.
44.
K.
Helfrich
and
Q.
Ye
, “
Eigenvalue normalized recurrent neural networks for short term memory
,” in
Proceedings of the AAAI Conference on Artificial Intelligence
2020
, Vol.
34
, pp.
4115
4122
.
45.
S.
Whitelam
and
P. L.
Geissler
, “
Avoiding unphysical kinetic traps in Monte Carlo simulations of strongly attractive particles
,”
J. Chem. Phys.
127
,
154101
(
2007
).
You do not currently have access to this content.