A foundational machine-learning architecture is reinforcement learning, where an outstanding problem is achieving an optimal balance between exploration and exploitation. Specifically, exploration enables the agents to discover optimal policies in unknown domains of the environment for gaining potentially large future rewards, while exploitation relies on the already acquired knowledge to maximize the immediate rewards. We articulate an approach to this problem, treating the dynamical process of reinforcement learning as a Markov decision process that can be modeled as a nondeterministic finite automaton and defining a subset of states in the automaton to represent the preference for exploring unknown domains of the environment. Exploration is prioritized by assigning higher transition probabilities to these states. We derive a mathematical framework to systematically balance exploration and exploitation by formulating it as a mixed integer programming (MIP) problem to optimize the agent’s actions and maximize the discovery of novel preferential states. Solving the MIP problem provides a trade-off point between exploiting known states and exploring unexplored regions. We validate the framework computationally with a benchmark system and argue that the articulated automaton is effectively an adaptive network with a time-varying connection matrix, where the states in the automaton are nodes and the transitions among the states represent the edges. The network is adaptive because the transition probabilities evolve over time. The established connection between the adaptive automaton arising from reinforcement learning and the adaptive network opens the door to applying theories of complex dynamical networks to address frontier problems in machine learning and artificial intelligence.

1.
L.
Zhu
,
Y.-C.
Lai
,
F. C.
Hoppensteadt
, and
J.
He
, “
Cooperation of spike timing-dependent and heterosynaptic plasticities in neural networks: A Fokker-Planck approach
,”
Chaos
16
,
023105
(
2006
).
2.
C.
Kuehn
, “
Time-scale and noise optimality in self-organized critical adaptive networks
,”
Phys. Rev. E
85
,
026103
(
2012
).
3.
O. V.
Popovych
,
S.
Yanchuk
, and
P. A.
Tass
, “
Self-organized noise resistance of oscillatory neural networks with spike timing-dependent plasticity
,”
Sci. Rep.
3
,
2926
(
2013
).
4.
D. V.
Kasatkin
,
S.
Yanchuk
,
E.
Schöll
, and
V. I.
Nekorkin
, “
Self-organized emergence of multilayer structure and chimera states in dynamical networks with adaptive couplings
,”
Phys. Rev. E
96
,
062211
(
2017
).
5.
L.
Horstmeyer
,
C.
Kuehn
, and
S.
Thurner
, “
Network topology near criticality in adaptive epidemics
,”
Phys. Rev. E
98
,
042313
(
2018
).
6.
L.
Horstmeyer
and
C.
Kuehn
, “
Adaptive voter model on simplicial complexes
,”
Phys. Rev. E
101
,
022305
(
2020
).
7.
R.
Berner
,
J.
Sawicki
, and
E.
Schöll
, “
Birth and stabilization of phase clusters by multiplexing of adaptive networks
,”
Phys. Rev. Lett.
124
,
088301
(
2020
).
8.
R.
Berner
,
S.
Vock
,
E.
Schöll
, and
S.
Yanchuk
, “
Desynchronization transitions in adaptive networks
,”
Phys. Rev. Lett.
126
,
028301
(
2021
).
9.
R.
Berner
,
S.
Yanchuk
, and
E.
Schöll
, “
What adaptive neuronal networks teach us about power grids
,”
Phys. Rev. E
103
,
042315
(
2021
).
10.
L.
Horstmeyer
,
C.
Kuehn
, and
S.
Thurner
, “
Balancing quarantine and self-distancing measures in adaptive epidemic networks
,”
Bull. Math. Biol.
84
,
79
(
2022
).
11.
D.
Schlager
,
K.
Clau
β, and
C.
Kuehn
, “
Stability analysis of multiplayer games on adaptive simplicial complexes
,”
Chaos
32
,
053128
(
2022
).
12.
M. A.
Gkogkas
,
C.
Kuehn
, and
C.
Xu
, “
Continuum limits for adaptive network dynamics
,”
Commun. Math. Sci.
21
,
83
106
(
2023
).
13.
B.
Jüttner
and
E. A.
Martens
, “
Complex dynamics in adaptive phase oscillator networks
,”
Chaos
33
,
053106
(
2023
).
14.
K.
Klemm
and
E. A.
Martens
, “
Bifurcations in adaptive vascular networks: Toward model calibration
,”
Chaos
33
,
093135
(
2023
).
15.
J.
Sawicki
,
R.
Berner
,
S. A.
Loos
,
M.
Anvari
,
R.
Bader
,
W.
Barfuss
,
N.
Botta
,
N.
Brede
,
I.
Franović
,
D. J.
Gauthier
et al., “
Perspectives on adaptive dynamical systems
,”
Chaos
33
,
071501
(
2023
).
16.
R.
Berner
,
T.
Gross
,
C.
Kuehn
,
J.
Kurths
, and
S.
Yanchuk
, “
Adaptive dynamical networks
,”
Phys. Rep.
1031
,
1
59
(
2023
).
17.
O. V.
Maslennikov
and
V. I.
Nekorkin
, “
Adaptive dynamical networks
,”
Phys. Usp.
60
,
694
(
2017
).
18.
L. P.
Kaelbling
,
M. L.
Littman
, and
A. W.
Moore
, “
Reinforcement learning: A survey
,”
J. Artif. Intell. Res.
4
,
237
285
(
1996
).
19.
R. S.
Sutton
and
A. G.
Barto
,
Reinforcement Learning: An Introduction
(
MIT Press
,
2018
).
20.
M. A.
Wiering
and
M.
Van Otterlo
, “
Reinforcement learning
,”
Adapt. Learn. Opt.
12
,
729
(
2012
).
21.
H.
Liu
,
A.
Kumar
,
W.
Yang
, and
B.
Dumoulin
, “Explore-exploit: A framework for interactive and online learning,” arXiv:1812.00116 (2018).
22.
M.
Moradi
,
Y.
Weng
, and
Y.-C.
Lai
, “
Defending smart electrical power grids against cyberattacks with deep Q-learning
,”
PRX Energy
1
,
033005
(
2022
).
23.
S.
Song
,
J.
Weng
,
H.
Su
,
D.
Yan
,
H.
Zou
, and
J.
Zhu
, “Playing FPS games with environment-aware hierarchical reinforcement learning,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI, 2019), pp. 3475–3482.
24.
S.
Curi
,
F.
Berkenkamp
, and
A.
Krause
, “Efficient model-based reinforcement learning through optimistic policy search and planning,” in Advances in Neural Information Processing Systems (Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2020), Vol. 33, pp. 14156–14170.
25.
M.
Moradi
,
Y.
Weng
,
J.
Dirkman
, and
Y.-C.
Lai
, “
Preferential cyber defense for power grids
,”
PRX Energy
2
,
043007
(
2023
).
26.
M.
Moradi
,
S.
Panahi
,
Z.-M.
Zhai
,
Y.
Weng
,
J.
Dirkman
, and
Y.-C.
Lai
, “
Heterogeneous reinforcement learning for defending power grids against attacks
,”
APL Mach. Learn.
2
,
026121
(
2024
).
27.
Y.
Zhang
,
P.
Cai
,
C.
Pan
, and
S.
Zhang
, “
Multi-agent deep reinforcement learning-based cooperative spectrum sensing with upper confidence bound exploration
,”
IEEE Access
7
,
118898
118906
(
2019
).
28.
P.
Auer
, “Using upper confidence bounds for online learning,” in Proceedings of the 41st Annual Symposium on Foundations of Computer Science (IEEE, 2000), pp. 270–279.
29.
D. J.
Russo
,
B.
Van Roy
,
A.
Kazerouni
,
I.
Osband
,
Z.
Wen
et al., “
A tutorial on Thompson sampling
,”
Found. Trends Mach. Learn.
11
,
1
96
(
2018
).
30.
Y.
Ouyang
,
M.
Gagrani
,
A.
Nayyar
, and
R.
Jain
, “Learning unknown Markov decision processes: A Thompson sampling approach,” in Advances in Neural Information Processing Systems (Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2017), Vol. 30.
31.
T.
Zhang
, “
Feel-good thompson sampling for contextual bandits and reinforcement learning
,”
SIAM J. Math. Data Sci.
4
,
834
857
(
2022
).
32.
A.
Gopalan
and
S.
Mannor
, “Thompson sampling for learning parameterized Markov decision processes,” in Conference on Learning Theory (PMLR, 2015), pp. 861–898.
33.
I.
Osband
and
B.
Van Roy
, “Bootstrapped Thompson sampling and deep exploration,” arXiv:1507.00300 (2015).
34.
A. G.
Barto
, “Intrinsic motivation and reinforcement learning,” in Intrinsically Motivated Learning in Natural and Artificial Systems (Springer, 2013), pp. 17–47.
35.
A.
Aubret
,
L.
Matignon
, and
S.
Hassas
, “A survey on intrinsic motivation in reinforcement learning,” arXiv:1908.06976 (2019).
36.
N.
Chentanez
,
A.
Barto
, and
S.
Singh
, “Intrinsically motivated reinforcement learning,” in Advances in Neural Information Processing Systems (Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2004), Vol. 17.
37.
J.
Li
,
X.
Shi
,
J.
Li
,
X.
Zhang
, and
J.
Wang
, “
Random curiosity-driven exploration in deep reinforcement learning
,”
Neurocomputing
418
,
139
147
(
2020
).
38.
O.
Zhelo
,
J.
Zhang
,
L.
Tai
,
M.
Liu
, and
W.
Burgard
, “Curiosity-driven exploration for mapless navigation with deep reinforcement learning,” arXiv:1804.00456 (2018).
39.
L.
Zheng
,
J.
Chen
,
J.
Wang
,
J.
He
,
Y.
Hu
,
Y.
Chen
,
C.
Fan
,
Y.
Gao
, and
C.
Zhang
, “Episodic multi-agent reinforcement learning with curiosity-driven exploration,” in Advances in Neural Information Processing Systems (Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2021), Vol. 34, pp. 3757–3769.
40.
X.
Li
,
Z.
Serlin
,
G.
Yang
, and
C.
Belta
, “
A formal methods approach to interpretable reinforcement learning for robotic planning
,”
Sci. Rob.
4
,
eaay6276
(
2019
).
41.
N.
Fulton
and
A.
Platzer
, “Verifiably safe off-model reinforcement learning,” in Tools and Algorithms for the Construction and Analysis of Systems: 25th International Conference, TACAS 2019, Held As Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, 6–11 April 2019, Proceedings, Part I (Springer, 2019), pp. 413–430.
42.
B.
Könighofer
,
F.
Lorber
,
N.
Jansen
, and
R.
Bloem
, “Shield synthesis for reinforcement learning,” in Leveraging Applications of Formal Methods, Verification and Validation: Verification Principles: 9th International Symposium on Leveraging Applications of Formal Methods, ISoLA 2020, Rhodes, Greece, 20–30 October 2020, Proceedings, Part I 9 (Springer, 2020), pp. 290–306.
43.
G.
Amir
,
M.
Schapira
, and
G.
Katz
, “Towards scalable verification of deep reinforcement learning,” in 2021 Formal Methods in Computer Aided Design (FMCAD) (IEEE, 2021), pp. 193–203.
44.
J.
Yang
,
I.
Borovikov
, and
H.
Zha
, “Hierarchical cooperative multi-agent reinforcement learning with skill discovery,” arXiv:1912.03558 (2019).
45.
C.
Du
,
Y.
Lu
,
H.
Meng
, and
J.
Park
, “
Evolution of cooperation on reinforcement-learning driven-adaptive networks
,”
Chaos
34
,
041101
(
2024
).
46.
S. E.
Razavi
,
M. A.
Moradi
,
S.
Shamaghdari
, and
M. B.
Menhaj
, “
Adaptive optimal control of unknown discrete-time linear systems with guaranteed prescribed degree of stability using reinforcement learning
,”
Int. J. Dyn. Control
10
,
870
878
(
2022
).
47.
Z.-M.
Zhai
,
M.
Moradi
,
L.-W.
Kong
,
B.
Glaz
,
M.
Haile
, and
Y.-C.
Lai
, “
Model-free tracking control of complex dynamical trajectories with machine learning
,”
Nat. Commun.
14
,
5698
(
2023
).
48.
Y.
Lu
,
Y.
Wang
,
Y.
Liu
,
J.
Chen
,
L.
Shi
, and
J.
Park
, “
Reinforcement learning relieves the vaccination dilemma
,”
Chaos
33
,
073110
(
2023
).
49.
L.-W.
Kong
,
H.-W.
Fan
,
C.
Grebogi
, and
Y.-C.
Lai
, “
Machine learning prediction of critical transition and system collapse
,”
Phys. Rev. Res.
3
,
013090
(
2021
).
50.
L.-W.
Kong
,
H.-W.
Fan
,
C.
Grebogi
, and
Y.-C.
Lai
, “
Emergence of transient chaos and intermittency in machine learning
,”
J. Phys. Complex.
2
,
035014
(
2021
).
51.
S.
Panahi
and
Y.-C.
Lai
, “
Adaptable reservoir computing: A paradigm for model-free data-driven prediction of critical transitions in nonlinear dynamical systems
,”
Chaos
34
,
051501
(
2024
).
52.
L.-W.
Kong
,
G. A.
Brewer
, and
Y.-C.
Lai
, “
Reservoir-computing based associative memory and itinerancy for complex dynamical attractors
,”
Nat. Commun.
15
,
4840
(
2024
).
53.
J. E.
Hopcroft
,
R.
Motwani
, and
J. D.
Ullman
,
Introduction to Automata Theory, Languages, and Computation
, 3rd ed. (
Addison-Wesley
,
New York
,
2006
).
54.
B.
Khoussainov
and
A.
Nerode
,
Automata Theory and Its Applications
(
Springer Science & Business Media
,
2012
).
55.
M. O.
Rabin
and
D.
Scott
, “
Finite automata and their decision problems
,”
IBM J. Res. Dev.
3
,
114
125
(
1959
).
56.
J.
Fu
, “Probabilistic planning with preferences over temporal goals,” in 2021 American Control Conference (ACC) (IEEE, 2021), pp. 4854–4859.
57.
R.
Berner
,
Patterns of Synchrony in Complex Networks of Adaptively Coupled Oscillators
(
Springer Nature
,
2021
).
58.
C.
Zhou
and
J.
Kurths
, “
Dynamical weights and enhanced synchronization in adaptive complex networks
,”
Phys. Rev. Lett.
96
,
164102
(
2006
).
59.
F.
Sorrentino
and
E.
Ott
, “
Adaptive synchronization of dynamics on evolving complex networks
,”
Phys. Rev. Lett.
100
,
114101
(
2008
).
60.
L. M.
Pecora
and
T. L.
Carroll
, “
Master stability functions for synchronized coupled systems
,”
Phys. Rev. Lett.
80
,
2109
(
1998
).
61.
T.
Nishikawa
,
A. E.
Motter
,
Y.-C.
Lai
, and
F. C.
Hoppensteadt
, “
Heterogeneity in oscillator networks: Are smaller worlds easier to synchronize?
,”
Phys. Rev. Lett.
91
,
014101
(
2003
).
62.
L.
Huang
,
K.
Park
,
Y.-C.
Lai
,
L.
Yang
, and
K.
Yang
, “
Abnormal synchronization in complex clustered networks
,”
Phys. Rev. Lett.
97
,
164101
(
2006
).
63.
X.
Wang
,
L.
Huang
,
Y.-C.
Lai
, and
C. H.
Lai
, “
Optimization of synchronization in gradient clustered networks
,”
Phys. Rev. E
76
,
056113
(
2007
).
64.
L.
Huang
,
Y.-C.
Lai
, and
R. A.
Gatenby
, “
Dynamics-based scalability of complex networks
,”
Phys. Rev. E
78
,
045102
(
2008
).
65.
L.
Huang
,
Q.
Chen
,
Y.-C.
Lai
, and
L. M.
Pecora
, “
Generic behavior of master-stability functions in coupled nonlinear dynamical systems
,”
Phys. Rev. E
80
,
036204
(
2009
).
66.
F.
Sorrentino
,
G.
Barlev
,
A. B.
Cohen
, and
E.
Ott
, “
The stability of adaptive synchronization of chaotic systems
,”
Chaos
20
,
013103
(
2010
).
67.
R.
Berner
,
S.
Vock
,
E.
Schöll
, and
S.
Yanchuk
, “
Desynchronization transitions in adaptive networks
,”
Phys. Rev. Lett.
126
,
028301
(
2021
).
68.
N.
Pisaruk
,
Mixed Integer Programming: Models and Methods
(
Belarus State University
,
2019
).
You do not currently have access to this content.