In this paper, we present a proof-of-concept distributed reinforcement learning framework for wind farm energy capture maximization. The algorithm we propose uses Q-Learning in a wake-delayed wind farm environment and considers time-varying, though not yet fully turbulent, wind inflow conditions. These algorithm modifications are used to create the Gradient Approximation with Reinforcement Learning and Incremental Comparison (GARLIC) framework for optimizing wind farm energy capture in time-varying conditions, which is then compared to the FLOw Redirection and Induction in Steady State (FLORIS) static lookup table wind farm controller baseline.

## I. INTRODUCTION

Wind farms extract energy from the wind and convert it to electrical energy. When multiple wind turbines operate together in a wind farm, the energy extracted from the upstream wind creates a wake that propagates downstream. Turbines within this wake experience lower wind speeds and subsequently produce less power. Wind farm wake interactions create several engineering challenges. Most importantly, for this work, wake interactions can decrease the total production of a wind farm as greedy extraction of energy by upstream turbines can disproportionately affect downstream turbines.^{1} We showed in Ref. 2 that a distributed form of reinforcement learning (RL) could be used to control a wind turbine's yaw misalignment, thus deflecting its wake away from downstream turbines in simpler wind environments. This use of reinforcement learning proved to be adaptable to yaw angle sensor faults by incorporating the farm power output into a feedback path. This provides a key advantage over the current state-of-the-art method of optimizing farm yaw angles ahead of time using a wake deflection model like the FLOw Redirection and Induction in Steady State (FLORIS) model^{3} and assembling a static lookup table (LUT).^{4} The difference between a static LUT control algorithm and an adaptable RL algorithm is shown in Fig. 1. An example of FLORIS output illustrating the wake interactions that necessitate wake steering is shown in Fig. 2.

There are a number of ideas in the literature to mitigate the negative impacts of wake interaction. Axial induction control is often used to de-rate individual turbines in order to achieve power reserve maximization as well as power reference tracking.^{5,6} Tilt control, as in Refs. 7 and 8, involves tilting the rotor plane forward or backward to deflect the slower moving wake above or below downstream turbines although existing MW-scale turbines are not capable of active tilt actuation. Finally, a common method for achieving wind farm power maximization is through yaw misalignment or the intentional misalignment of upstream turbines with the prevailing wind direction to deflect the wake laterally away from downstream turbines.^{9–12}

Typically, yaw angle or axial control is accomplished by optimizing simulated wind farm output using a wake velocity and deflection model, such as the FLORIS^{3} model and its optimization scheme.

Using such methods can produce a (typically fixed) LUT based on wind farm parameters like wind speed and direction, as in Ref. 4. This LUT can then determine proper turbine yaw angles for a given wind speed and wind direction. However, this method presumes that the model is sufficiently accurate to characterize the turbines, wakes, and environment. Modeling inaccuracies could therefore cause significant issues for an open-loop LUT relying on this assumption. Other model-based real-time optimization approaches have been proposed. However, these methods are still susceptible to modeling inaccuracy. An increasingly common method to address these inaccuracies in modeling turbine wake dynamics is to use data-driven or model-free approaches that take measurements in real-time and use them to create or update dynamic system models to select more optimal control inputs.^{13,14} One such solution is the game-theory-based approach described in Ref. 15 that can dynamically adjust to wind farm conditions. The method in Ref. 15 reduces the computational complexity of the wind farm optimization problem by treating wind turbines as agents that aim to optimize a local reward function and does not require dependency on models, but it also either assumes steady-state conditions and instantaneous wake propagation or it requires large time delays to allow the wake to propagate through the entire wind farm. Gebraad *et al.*^{16} proposed incorporating a time delay into a distributed gradient-descent routine, which allows the algorithm to adjust in real time to unpredictable errors and achieve faster convergence by only considering a turbine's nearest downstream neighbor. The approach in Ref. 16 successfully adapts a model-free, agent-based control method to a system with inherent time delays, but must reconverge every time wind conditions change. An example of the concept of changing neighborhoods as certain conditions changing can be found in the dynamic clustering approach described in Ref. 17.

Although promising in its ability to adapt to wind farm sensor faults, the approach in Ref. 2 was significantly limited in terms of applicability to a real wind farm because it used unrealistic stepwise constant wind conditions. This work builds on the proof-of-concept wind farm optimization algorithm detailed in Ref. 2. Specifically, this paper will modify the algorithm further to successfully optimize in wind conditions that are not stepwise constant as a step toward proving the capability of an RL algorithm to optimize wind energy capture in time-varying wind speeds compared to a static LUT using offline learning to speed convergence and having the ability to react to sensor errors in the field.

This paper is organized as follows. Section II provides background on reinforcement learning and how it was adapted to a wind farm application in Ref. 2. Section III describes how the RL algorithm summarized in Sec. II is adapted to an environment with wake delay and time-varying wind speed. Section IV compares the performance of the modifications described in Sec. III with the FLORIS baseline controller and also examines different scenarios in which RL could achieve the dynamic adjustment of a yaw angle schedule to fit the environment. Finally, Sec. V summarizes the work and provides several possible paths forward needed to transform this proof-of-concept to field readiness.

## II. OVERVIEW OF Q-LEARNING

At its most basic level, RL is a model-free control algorithm that involves an autonomous agent taking actions, receiving some sort of reward signal from the environment, and then modifying its future action selections based on this reward signal.^{18} In this work, we use Q-Learning, a popular type of RL algorithm. Here, an agent is a turbine, and the actions that it takes are yaw angle misalignments, which will be described in Sec. II A.

### A. Algorithm Summary

In Ref. 2, we introduced a distributed variant of the Q-Learning algorithm for RL, which we summarize here. The object of Q-Learning is to assemble a state-action value mapping known as $Q(s,\alpha )$, with *s* representing a discrete system state (condition) and *α* representing an action that an agent takes within its environment. $Q(s,\alpha )$ quantifies the expected future gains of taking an action, *α*, from a state, *s*, and then proceeding optimally from there. The Q-Learning process involves the use of the following equation:

where $Qi,t(s,\alpha )$ is the estimated expected value of agent *i* taking action *α* while in-state *s*, $li,t\u2208(0,1]$ is a learning rate determining how quickly the algorithm updates at time *t* for agent *i*, $ri,t+1$ is an environmental reward signal for agent *i*, $\beta i\u2208[0,1)$ is a discount factor for agent *i* that makes convergence possible over an infinite time horizon, and *t* is the simulation iteration number. As in Ref. 19, we use the definition of learning rate in the following equation:

where *k*_{1}, *k*_{2}, and *k*_{3} are constants that can be tuned to determine how quickly the learning rate changes, and $n(Si,t)$ is the number of times the state given by $Si,t$ has been visited. Thus, as a state is visited more frequently, $n(Si,t)$ increases and the learning rate decreases. Because a state is represented by a set of indices $Si,t$ in a discrete state space, the state-action mapping $Q(s,\alpha )$ must necessarily be tabular. This Q-table is discussed in more detail in Sec. II C.

The action space, *A*, selected in Ref. 2 was given by

This action space guarantees that an individual turbine will only be able to move by a small yaw angle change $\Delta \gamma $ every time it initiates a control action. While this action space had three possible actions in Ref. 2, it has been reduced to two for this work: increase and decrease, meaning that the yaw angle can go up or down. The third action mentioned in Ref. 2 is not needed to obtain a yaw angle set point, as will be discussed in Sec. III D 1, and is therefore no longer included, leaving $\alpha \u2208{0,2}$. The agent's state space is defined as the vector $x=[U,\varphi ,\gamma ]T$, with *U* representing wind speed, $\varphi $ representing wind direction, and *γ* representing the yaw angle. Although $\varphi $ is included for completeness, in this work only wind speed *U* is fully investigated to simplify the model.

To determine what action an agent (turbine) should take, a reward must be defined. Reward signals are calculated by use of the reward function given in the following equation:

with $Vi,t$ representing the sum of the power produced by turbines within a rectangular area extending downwind of a given turbine *i* at time *t*, as in Fig. 2. This neighborhood does not necessarily always encompass the furthest downwind turbine.

In Ref. 2, we outlined a stochastic action selection method, which randomly selects actions before evaluating the resulting reward. The stochastic action selection uses the current state *s* and determines the probability of selecting a given action *α* using the following equation:

For this work, we introduce a second, deterministic action selection method using a gradient approximation for comparison to the stochastic method. The gradient algorithm uses the first-order gradient approximation given in the following equation:

Using this gradient approximation, the yaw angle is thus determined by the following equation:

meaning that the yaw angle keeps moving in the same direction if the reward—aggregate power—is increasing.

### B. Agent coordination via locking

Guestrin *et al.*^{20} noted that an optimization involving multiple agents can be made more efficient by exploiting natural hierarchies in the multi-agent system to force agents to optimize in a given order, a process known as coordination. If this hierarchy is not known ahead of time, coordination can still be achieved by locking all but one agent or group of agents at a time, giving the remaining agent an optimization window in which no other agents act. When this window expires, these agents are themselves locked, and a new group of agents is allowed to act. A wind farm is naturally hierarchical, with downstream turbines depending on upstream turbines' wake deflections. Therefore, coordination can be implemented by allotting each turbine (or row of turbines, relative to the wind direction) an optimization window. When a given window finishes, the turbines are locked and the next row of turbines either further upstream or further downstream begins optimizing. Figure 3 compares two methods of coordination: upstream first, in which turbines that are further upstream optimize before downstream turbines, and downstream first, which optimizes in the opposite order.

As shown in Fig. 3, the gradient action selection method [Eq. (7)] converges much faster than the stochastic action selection method [Eq. (5)]. However, the gradient approximation requires more coordination among turbines, which would be difficult to achieve in more rapidly changing wind conditions, especially with respect to wind direction. As will be discussed in Sec. III E, these two methods will be combined to achieve a framework that both optimizes quickly and is adaptable using the best features of each.

Because wind field wake effects do not propagate instantaneously (i.e., take a nonzero period of time to reach downwind turbines), upstream turbines must delay the calculation of their own reward function until it is known that a wake has propagated downstream, thus allowing a given yaw angle control action to be associated with a corresponding change in the reward function. This was accomplished in Ref. 2 by preventing a turbine from calculating reward until the wake had completely propagated through its neighborhood. Dynamic wake effects are simulated using a modified version of the FLORIS model developed by NREL.^{3} This quasi-dynamic simulation model is described in Ref. 21, which uses a different RL algorithm to achieve its control optimization but is simulated using the same augmented FLORIS used in this paper to verify results.

### C. Q-Table Interpretation

This paper uses a Q-table containing the state-action pairs $(s,\alpha )$ to determine yaw set points. To illustrate how a Q-table can be read, we will use a representative Q-table shown in Fig. 4 with select states labeled. This Q-table is for the upwind turbine turbine 0 and provides a visualization of the Q-table values after RL has taken place.

The matrix depicted in Fig. 4 is a subset of the full agent Q-table for one specific wind speed, but for ease of notation, we will still refer to this subset as a Q-table. The full Q-table (in this case) is of dimension 4: one dimension for each of the states (wind speed, wind direction, and yaw angle) and an additional dimension for the actions. Because wind speed and wind direction are not controllable, Q-table subsections are tables that are located at specific indices (specified by the discrete indices of the wind speed and wind direction observations) of the full agent Q-table. For instance, in Fig. 4, the vertical axis (yaw angle) is the third state in a tuple $si=(8,0,\gamma i)$, with the first two entries of the tuple representing the wind speed and wind direction. So, a state such as *s*_{26} represents a full set of discrete state values, with the first two state values being taken *a priori* (in this case 8 m/s wind speed and 0° wind direction) and the third (controllable) state value being $\gamma 26=26\xb0$. The horizontal axis of the Q-table represents the action space, with two possible actions.

In this paper, the yaw angle state is chosen to be discrete rather than continuous as the tabular approach is not conducive to a continuous state space.

Using these definitions of *s _{i}* and

*α*, the value approximation at each box in Fig. 4 is given by $Q(si,\alpha j)$, or the amount of value that an agent assigns to taking action

_{j}*α*in-state

_{j}*s*. In Fig. 4, the numerical quantity stored in the Q-table at $Q(si,\alpha j)$ is represented by color. A quantity toward the upper range of the color bar (yellow) is larger, while a quantity in the lower range of the color bar (purple) is smaller. Importantly, $Q(si,\alpha j)$ does not represent the reward that is expected from taking an action in a given state. It is instead a quantification of the reward an agent would expect to receive if it started in-state

_{i}*s*, took action

_{i}*a*, and proceeded optimally from there for the rest of the simulation. The expected reward can be calculated over an infinite time horizon since the discount term

_{j}*β*in Eq. (1) is in the interval $[0,1)$, thus creating a convergent geometric series. Reward signals [Eq. (4)] are not explicitly shown in this figure.

The progression of states is such that the gradient turns around (takes a different action) whenever a negative reward is encountered. Using the gradient algorithm and the information in Fig. 4, an agent would proceed as follows (Table I). Assume that the agent is currently in state *s*_{26} and that it had previously taken the increase action *α*_{1} and received a reward of +1 according to Eq. (4). According to the gradient algorithm [Eq. (7)], which continues to move in the same direction when the reward is positive, the agent selects the same action *α*_{1} again, moving into state *s*_{27}. This action again produces a reward of +1, and the agent once again selects action *α*_{1} to move to state *s*_{28}. However, upon moving to state *s*_{28}, the agent receives a reward of −1. It is for this reason that the color representing $Q(s27,\alpha 1)$ is so dark, corresponding to a low expected value, as the agent will always receive a negative reward moving from *s*_{27} to *s*_{28}. Upon receiving the negative reward, the agent reverses course, taking action *α*_{0} into state *s*_{27} and receiving a reward of +1. The agent repeats this action, once again taking *α*_{0} to move into *s*_{26}; however, this time receiving a reward of −1. The agent once again reverses, repeating the cycle until the training period completes. It should be noted that the agent never takes action *α*_{1} in *s*_{28} or action *α*_{0} in *s*_{26}, which is why $Q(s28,\alpha 1)$ and $Q(s26,\alpha 0)$ are the same color as the other unexplored state-action pairs in the Q-table.

Start state . | Action . | End state . | Reward . |
---|---|---|---|

$s26$ | $\alpha 1$ | $s27$ | $+1$ |

$s27$ | $\alpha 1$ | $s28$ | $\u22121$ |

$s28$ | $\alpha 0$ | $s27$ | $+1$ |

$s27$ | $\alpha 0$ | $s26$ | $\u22121$ |

Start state . | Action . | End state . | Reward . |
---|---|---|---|

$s26$ | $\alpha 1$ | $s27$ | $+1$ |

$s27$ | $\alpha 1$ | $s28$ | $\u22121$ |

$s28$ | $\alpha 0$ | $s27$ | $+1$ |

$s27$ | $\alpha 0$ | $s26$ | $\u22121$ |

### D. Dynamic LUT with most-visited optimization

As described in Sec. I, the current state-of-the-art wind farm control method is a static LUT that maps wind conditions such as wind speed and wind direction to a yaw angle schedule using simulations such as FLORIS or Simulator fOr Wind Farm Applications (SOWFA).^{22} This static LUT method seems at odds with the trial-and-error-based stochastic nature of Q-learning. However, it is possible to treat the Q-table as a LUT itself. For example, if an agent *i* keeps track of a table *n _{i}*, where $ni\u2208\mathbb{R}M\xd7N\xd7\cdots $ is of the same dimension as the state vectors for each of the system states and represents how many times a given system state has been visited, then a LUT can be approximated as follows.

Assuming that there are *X* different discrete state vectors, one for each of the expected system states, the state of turbine *i* can be characterized by a set $Si,t={Si,t,0,Si,t,1,\u2026,Si,t,X\u22121}$ of state indices at each simulation time step. Using the definition of the table *n _{i}* given above, $ni({Si,t,0,Si,t,1,\u2026,Si,t,X\u22121})=ni(Si,t)$ represents the number of times that a state characterized by $Si,t$ has been visited. If we further assume that the yaw angle is the last state defined in the discrete state list, then for a given set of wind conditions, a yaw angle set point, $\gamma i,t,s$, for agent

*i*can be given by

with $\gamma \xaf$ representing the discrete yaw angle vector and $\gamma \xaf(i)$ representing the entry in $\gamma \xaf$ at index *i*.

In other words, to retrieve a set point from a Q-table, according to Eq. (8), we would select the most visited yaw angle for a given set of wind conditions. This creates an important advantage over a purely static LUT because $Q(s,\alpha )$ and *n* can fluidly change as the reinforcement learning process progresses, using *n* creates a yaw angle schedule that adapts to its environment, provided that the number of state visitations is tracked. This will henceforth be referred to as a Q-LUT or a deterministic LUT that is acquired from a Q-table. With this capability, a turbine can quickly ramp to a yaw angle set point in changing wind speeds. This idea is elaborated on further in Secs. III B and III D.

## III. ADAPTATIONS TO RL ALGORITHM FOR TIME-VARYING WIND

In the process of developing the RL and gradient methods summarized in Sec. II, we discovered a number of problems related to performance in time-varying wind. This section discusses four of those problems and the solutions that were developed, which are all combined into our full Gradient Approximation with Reinforcement Learning and Incremental Comparison (GARLIC) algorithm described in Sec. III E. Table II summarizes these problems and how they will be addressed in this section. As will be shown, when the RL agent is allowed to learn in time-varying wind using the modifications described in Sec. III, additional power gain can be discovered, as will be shown in Sec. IV.

Problem . | Solution . |
---|---|

Discrete state information separation | Use a gradient approximation to quickly train the agents in each discrete state (Sec. III A) |

Time-varying state observation | Constantly record discrete wind field measurements and choose the most commonly occurring (Sec. III B) |

Noisy power measurement | Measure power output during a given interval and take the average (Sec. III C) |

Overfitting | Convolve the Q-table with a Gaussian kernel to smooth out noise (Sec. III D) |

Problem . | Solution . |
---|---|

Discrete state information separation | Use a gradient approximation to quickly train the agents in each discrete state (Sec. III A) |

Time-varying state observation | Constantly record discrete wind field measurements and choose the most commonly occurring (Sec. III B) |

Noisy power measurement | Measure power output during a given interval and take the average (Sec. III C) |

Overfitting | Convolve the Q-table with a Gaussian kernel to smooth out noise (Sec. III D) |

### A. Gradient approximation for discrete state separation

The first problem addressed in Table II is discrete state information separation. The implementation of Q-Learning used in this paper is tabular, meaning that a Q-LUT is maintained instead of a continuous function approximation. Alternatively, many applications have begun to use neural networks to approximate the Q-learning value function.^{23,24} A tabular approximation like our Q-LUT is, in general, simpler than a neural network approximation and also more interpretable (meaning that a human observer can understand why the algorithm selects the actions that it does following the interpretation procedure described in Sec. II C). However, in the context of a wind farm application, a tabular approximation has the disadvantage that information learned in one state does not provide any information about other states, as is illustrated in Fig. 5. Although the algorithm trained for 8 m/s (left Q-LUT in Fig. 5) and could use this information to determine an optimal yaw angle set point when the mean wind speed is 8 m/s, it has no information about 9 m/s since the two wind speeds are located in completely different sections of the Q-table. Each added dimension of the state space (for instance, wind direction) compounds this problem. Additionally, as already demonstrated in Ref. 2, convergence time in a wake-delayed environment can be extremely slow. Even under stepwise conditions, the amount of time it would take the wind farm to search out its entire state space would be intractable.

This length-of-learning problem is solved through use of the gradient approximation equations (6) and (7) described in Sec. I. It was stated in Sec. II A that the gradient control algorithm optimizes very efficiently, under very tightly controlled conditions. While it is impossible to achieve this level of control in the field, it is possible using a simulation. For every discrete bin in the state space that is expected to be encountered in the field, we first train the wind farm offline using the coordinated gradient algorithm described in Sec. II A. During this training phase, RL is not used to choose actions as the gradient algorithm is completely deterministic. However, during the course of the gradient optimization, each agent continues to measure rewards and update its Q-table. In this way, upon implementation of the Q-LUT in the field, the agents already have an idea of the best actions to take for a given set of wind conditions. This progression is shown in Fig. 6.

### B. Bin counting for time-varying state observations

The second problem noted in Table II is time-varying state observation. Time-varying wind inflow, even of a constant mean, could occasionally spike upward and downward to a degree that a turbine might register the wind speed change as a move to a different discrete wind speed bin. These downward or upward spikes could trigger a new control action. If the turbine were configured to ramp to a new set point based on its Q-table when the wind speed changes using the dynamic LUT method described in Sec. II D, the turbine might begin yawing only to yaw back to its original set point when the spike in wind speed ends. If a turbine constantly yaws to follow the time-varying wind, learning would be interrupted very frequently, and subsequent convergence would be very slow.

This second problem—noisy wind in need of filtering—is addressed using a bin counting technique. Throughout the simulation, an agent takes sample measurements of the wind speed once every second. For each sample, the agent keeps track of which discrete wind speed bin the measurement fell into. Once the total number of samples reaches a predetermined number, the agent determines the bin with the highest count and records its internal state observation in the Q-LUT associated with that bin. This is a form of low-pass filtering that is used frequently in the industry.^{25}

### C. Power averaging for noisy measurements

The third problem addressed in Table II is noisy power measurement, which can confuse the RL algorithm. Figure 7 illustrates this problem by showing that a yaw angle change that should result in the farm power decreasing can be offset by a modest increase in wind speed, leading the agent to believe that the action it just performed was a good one, when in fact the opposite is true. While bin counting allows the agent to filter its state observation, binning the power reading artificially constrains its resolution to a lower-resolution option like the clipped reward signal described in Eq. (4). Thus, we solve the noisy power problem using delay and averaging.

To counteract this effect of being unable to discern between benefit derived from yaw angle changes vs those due to wind speed changes, agents are given one more waiting period, determined by the parameter $power_delay$. During this period, the agent reads and records its own power output. Once the period set by $power_delay$ is complete, the agent averages together the readings that it has collected during the interval. This must be done for every turbine that either changes its yaw angle or has just received a change in wake conditions caused by an upstream, neighboring turbine yaw. More details about this process can be found in Appendix V.

### D. Incremental comparison and the Gaussian blur function

In Sec. II D, we discussed a most-visited method to allow agents to respond rapidly to changes in wind conditions. This approach has a substantial limitation in that it is very easy to overfit to the model.

Overfitting is a phenomenon that presents itself often in applications in which a rule must be learned from training data, particularly in machine learning (of which RL is a subset). The term refers to the learned rule failing to generalize itself and instead too closely following the training data. Because the training data are not always representative of every condition that the system will experience, it will be ill-prepared to react to data outside of its own narrow training experience. In the context of this paper, an agent overfits when the control law that it learns from the standard, steady-state FLORIS model (gradient approximation) during the training phase is not able to adapt itself to the subsequent field environment and instead relies almost exclusively on its steady-state training data from the simulation phase of Fig. 6. Since the model may not match the real farm conditions, if the agent cannot adapt to model mismatch, it will not be able to increase power capture. If the most-visited algorithm from Sec. II D were used, for an agent to learn a different setpoint from the training to the field phases, the turbine would have to spend more time at the new set point than it did at the old one, which might take too long. We therefore consider two solutions: incremental comparison and Gaussian blur.

#### 1. Incremental Comparison

To mitigate overfitting, the Q-LUT acquisition process is modified. The Q-table, as described in Sec. II C, is a tabular mapping of state-action pairs to an approximated value. This means that, for a given “state,” $si,t$, the agent's opinion about which action is best to take is determined by the following equation:

with $\alpha si,t*$ representing the optimal action to take in-state $si,t$. As the agent takes actions and receives rewards, $Q(s,\alpha )$ is gradually filled in with information about the wind farm. When a turbine therefore needs to determine a set point, it can iteratively step through the Q-table, moving up or down through the yaw angle state space according to which action has the highest value. This process will be referred to as incremental comparison. In a perfectly noiseless environment (such as the steady-state environment that the gradient is trained in), the ideal set point could be selected as the state *s _{i}* with the lowest sum $Q(si,\alpha 0)+Q(si,\alpha 1)$. The reason that this is a good choice for the set point is that, when this sum is most negative, the agent has determined that either increasing or decreasing while in this state will result in negative expected reward, meaning that the agent has found an optimum. In Fig. 4, this state is achieved at

*s*

_{27}.

#### 2. Gaussian Blur

When the agents are moved into the field to begin learning, there is one additional consideration to take into account regarding the Q-LUT reading procedure. As will be discussed in Sec. IV D, not all reward signals that will be returned from the environment will be correct. This addition of noise poses a problem for a tabular approach such as our proposed Q-table method as the simple incremental comparison could stop prematurely at the incorrect bin, ceasing learning since it no longer moves. Theoretically, the Q-Learning update equation given in Eq. (1) should converge if given a suitable period of time.^{18} However, because this convergence time might be infeasible due to the wake delay and power filtering accommodations, we must be able to work with Q-table values that have not completely converged. We therefore use a Gaussian blur to approximate the properties of a continuous function using the Q-table.

A Gaussian blur is a common image processing technique that is used to reduce the effect of sharp edges by convolving a Gaussian kernel with the pixels that make up an image, effectively making it a visual low-pass filter.^{26} In our application, we use the Gaussian filter to smooth out the noisy value approximation, treating each state-action pair as a pixel with which to convolve the kernel. Because the Q-table for a given wind field state is two-dimensional, the filter could be implemented in two dimensions as well. However, we only convolve down the Q-table so as to maintain the distinction between the two actions for the incremental comparison algorithm. Here, we use the Gaussian filter package provided by scikit-learn.^{27} An example of an unfiltered vs filtered Q-table is shown in Fig. 8, where the blurred case would be expected to result in smoother progress of the agent toward the optimal set point.

### E. Combined framework for RL in time-varying wind

The procedures described up to this point in both Secs. II and III can finally be combined into one, overarching framework, the Gradient Approximation with Reinforcement Learning and Incremental Comparison (GARLIC) algorithm. The GARLIC framework summarized in Fig. 6 encapsulates the wind farm power maximization process succinctly into a series of simple steps.

First, wind turbines are trained offline using a simple steady-state model such as FLORIS, supplying the necessary stepwise-constant wind conditions and wind turbine coordination to achieve coordination.

Next, having thus assembled a good-enough approximation of each turbine's Q-table, the agents are moved to the field, where they use the averaging and locking mechanisms discussed in Sec. III C to determine accurate reward signals from the environment and to modify the Q-table approximation returned by the gradient.

Finally, the incremental comparison procedure with a Gaussian blur from Sec. III D is used to choose wind turbine set points once the learning period has concluded.

The important advantage that GARLIC offers is a basic level of reliability. As will be shown, even when the agent does not learn in the field at all, the gradient approximation is able to match performance with the standard FLORIS optimization, which has already been shown to be an effective control algorithm.^{4}

## IV. RESULTS

### A. Simulating time-varying wind in FLORIS

In the context of this work, time-varying wind speed effects will be characterized by turbulence intensity. Turbulence intensity is given as Ref. 28:

with *I* representing turbulence intensity, *σ _{i}* representing standard deviation of the wind speed, and $U\u221e$ representing the mean freestream wind speed. Turbulence intensity is typically expressed as a percentage, so a turbulence intensity of 5% would result in

*I*=

*0.05. Turbulence intensity is normally defined over a 10-min interval, but because the mean remains relatively steady for the ensuing simulations, we will define*

*I*over the range in which the mean wind speed is constant, typically several thousand seconds.

To generate the time-varying wind conditions that will be used to enhance the realism of the model, two parameters will be specified: *I* and $U\u221e$. Using Eq. (10), *σ _{i}* will be determined to achieve the desired turbulence intensity with the desired mean. Then, for each simulated time step, a wind speed is chosen from a normal distribution of the proper mean and standard deviation. An example wind speed signal with a 5% turbulence intensity is shown in Fig. 9.

To test GARLIC, each of our simulations is broken into intervals, each with a distinct mean wind speed and turbulence intensity. The modified dynamic FLORIS then takes $U\xaf\u221e$ and *I* as inputs for each simulation interval. During these intervals, the actual wind speed is chosen by sampling a normal distribution of the form $N(U\u221e,\sigma i2)$, with $\sigma i2$ again being calculated from Eq. (10).

#### 1. Metric and Test Bench

Before presenting the results of GARLIC relative to the static FLORIS LUT presented as the baseline controller in Fig. 1 and Ref. 4, we first describe the test environment and the metric that is used to compare the two algorithms. GARLIC will operate in two modes:

Simple LUT mode, in which GARLIC observes the state and yaws to the set point it determines from running the incremental comparison procedure with its Q-table. There is no learning involved in this mode, and as such the behavior is the same as the static FLORIS LUT, although the two controllers might select different yaw set points. This mode effectively skips the second block of Fig. 6.

Active learning mode, in which agents stochastically select actions and evaluate rewards in the hopes of discovering additional power gains over the static algorithm. In this mode, GARLIC attempts to augment the Q-table learned by the gradient approximation to choose more optimal set points. This includes the entire progression of tasks in Fig. 6.

To compare the algorithms, we first allow the GARLIC agents to learn for a set period of time. Then, the Q-tables learned during this phase (mode 2) are used to compare the performance head-to-head with the static LUT (mode 1). During this comparison, turbines yaw directly to their chosen setpoint, and do not engage in any learning. To evaluate the algorithms, we use total farm energy capture as our metric. This test bench therefore allows us to compare what the performance of the two algorithms would be in the long term, once the transients of the learning phase settle down.

### B. GARLIC performance with no yaw measurement error

The GARLIC algorithm is first examined in the case of no yaw angle error. This test did not involve any field training, and instead simply used the steady-state FLORIS model and the gradient approximation algorithm to determine a Q-table (the first block of Fig. 6). Once the agents had all been trained using the gradient, they were placed into the test bench environment described in Sec. IV A 1 for 10 000 s with a wind inflow that is a constant 8 m/s mean wind speed with 5% turbulence intensity, the same conditions that generated Fig. 9. Because there was no need for field training, and therefore no need to perform Q-learning in noisy wind conditions, the bin counting and power averaging windows are unnecessary for this simulation case. However, the Gaussian blur was used, as will also be used in Sec. IV C. The resultant energy capture bar graph is shown in Table III.

Simulation . | Energy (MW h) . |
---|---|

Steady state—no fault | 10.00 |

Dynamic (1°)—no fault | 9.97 |

Steady state—yaw error | 9.84 |

Dynamic (1°)—yaw error | 10.01 |

Dynamic (2°)—yaw error | 10.01 |

Simulation . | Energy (MW h) . |
---|---|

Steady state—no fault | 10.00 |

Dynamic (1°)—no fault | 9.97 |

Steady state—yaw error | 9.84 |

Dynamic (1°)—yaw error | 10.01 |

Dynamic (2°)—yaw error | 10.01 |

Figure 10 shows the yaw angle set points determined by GARLIC for each method. The optimal yaw angle set point for the two upstream turbines is approximately 25° (obtained from a static FLORIS optimization), implying that the GARLIC algorithm overshot the correct set point for the furthest upwind turbine. This behavior can be modified by adapting the *σ* parameter of the Gaussian blur as discussed in Sec. IV D, but the same value as will be used in Sec. IV C is used here for consistency. Before being implemented in the field, this parameter would need to be tuned, which could yield higher energy capture.

### C. Addressing wind limitations: GARLIC performance in case of yaw measurement error

To examine its capabilities in the presence of faults, we test the GARLIC algorithm in the presence of a yaw sensor error. In this case, the wind inflow has an 8 m/s mean wind speed with 5% turbulence intensity. However, a 5° yaw offset error was also introduced to the simulation. This means that the state observation that an agent records is actually 5° higher than the true yaw angle. Two sets of agents were trained and then allowed to learn. One set used a yaw step size of 1°, while the other set used a yaw step size of 2°. Each set of agents was given a period of 100 000 s to learn in the yaw-errored environment (mode 2). This is again an unrealistic amount of time for wind conditions to maintain a constant mean, but it should be noted that the training conditions were time-varying, demonstrating a much greater degree of realism than in Ref. 2. Then, the newly trained agents were placed into the test bench (mode 1), and the energy capture during a 10 000-s simulation at the same wind conditions (8 m/s mean, 5% TI) was compared with the steady-state LUT. The window size for both bin counting and power averaging was 100 s. The results are shown in Table III, with the dynamic GARLIC capturing more energy than the steady-state FLORIS in the presence of the fault.

Figure 11 shows the yaw angle set points for the three different methods having results shown in Table III. As shown in the plots, the 5° offset error causes the steady-state LUT to select yaw angles of approximately $\u22125\xb0,\u200920\xb0$, and $20\xb0$, while the true optimal values are $0\xb0,\u200925\xb0$, and $25\xb0$. The dynamic algorithm with either 1° or 2° yaw step size was able to successfully adjust the yaw angles upward toward the optimal error-free value, which is why they are able to capture more energy.

### D. Parameter tuning: Filter window length, yaw step size, and standard deviation

Because of the addition of the filtering windows (Secs. III B and III C) to the algorithm, GARLIC will necessarily take a longer time to converge than the stepwise constant algorithm described in Sec. II A. However, because the wind inflow is now time-varying, a longer filtering period for the power averaging results in less noise in the reward signal, which means the agent receives more accurate information. This trade-off creates a spectrum of design choices. On one side of this spectrum, the agent has a very short filtering window but updates its Q-table very quickly, flooding the Q-table with lots of data that may or may not be accurate. At the other side of the spectrum, the agent has a very long filtering window, allowing it to have a more accurate calculation of its own reward signal, but converging far slower. In this paper, we design the algorithm so as to use a shorter filtering window, but recommend that future research explore this parameter more fully.

An advantage of a shorter filtering window is to quickly step up and down through the action space while gradually assembling a model of the environment, in contrast to other data-driven approaches such as those of Refs. 29 and 30 that rely on searching out optimal set points, yawing to them, and then evaluating at that point. As a result, no single measurement is intended to provide a full model of the value function at that point, and so it is acceptable if some measurements are noisy. Additionally, the Q-learning algorithm already has stochasticity included in its update equation as one of the typical characteristics of a Markov Decision Process is that the reward function is stochastic.^{18} To therefore impose the restriction that all measurements must be completely noiseless and deterministic would be irrelevant. An empirical demonstration of the window length/accuracy trade-off is shown in Fig. 12. Filter windows in the 1–100 length resulted in approximately 50%–60% accuracy, whereas those of length 500 or 1000 achieved 100% accuracy in calculating the reward. It is difficult to determine which accuracy is good enough for the purposes of power maximization. However, as will be shown in Sec. IV, a window of length 100 is able to produce power gains despite the relatively low accuracy. It should be noted that this experiment provides a good demonstration of the stochasticity of the reward signal, as it shows the percentage of reward signals that are returned correctly.

Another performance consideration is the yaw angle step size or the $\Delta \gamma $ parameter in Eq. (3). A larger change in the yaw angle would result in a larger change in power at the downwind turbines, which would mitigate the impacts of noise. However, a larger yaw step also decreases the resolution of the controller, which increases the probability that the actual optimal yaw set point lies somewhere between two yaw steps. Also, since turbines yaw at fairly slow rates of less than 1° per second, large step sizes are unrealistic. Figure 13 compares the probability of receiving the correct reward between two different yaw step sizes, drawing from the simple empirical experiment used to generate Fig. 12 and a similar experiment with a yaw size of 2°.

Finally, the standard deviation parameter, *σ*, in the Gaussian blur described in Sec. III D 2 can also be tuned. Although the details of the Gaussian blur are not discussed in this paper, a larger standard deviation results in a blurrier Q-table. While this effect is desirable to smooth out noise, too much blur can have a negative effect in that legitimate gradations in the Q-table are erased. This effect is illustrated in Fig. 14, and the effect of choosing a standard deviation that is too large is discussed also in Sec. IV B.

### E. GARLIC ramping performance for noncontinuous learning

Even with the inclusion of turbulence, the wind profiles considered in Sec. IV C are still unrealistic in that they presume wind conditions that maintain a constant mean for extended periods of time. It would be more accurate to model changes in both the instantaneous wind speed (as is the case of the time-varying wind conditions) and changes in the mean wind speed. However, a changing mean wind speed poses an additional problem. Since the optimal yaw set point is a function of wind conditions, a change in mean wind speed causes the resulting yaw position to be suboptimal, especially since yaw rate is limited on modern turbines. Thus, the stochastic action selection method that turbines use is inefficient when the objective is to get to a yaw angle set point as quickly as possible, and the addition of the filtering and locking windows makes this process even slower. By the time a turbine had completed its tracking of a wind speed change, the wind would have most likely changed again.

To mitigate this inefficiency, the ramping procedure described in Sec. II D can be used, meaning that a turbine would detect a wind change, ramp to the new set point determined by incremental comparison, and then continue learning from there. However, this mitigation technique adds a degree of overhead as the turbines are not able to engage in any learning while yawing to a new set point, and so more frequent wind speed variations result in a smaller percentage of time spent learning. Taken together, this means that in varying wind speeds, turbines learn very slowly and are often interrupted in their learning process. Because of this, the incremental comparison algorithm can at times select incorrect set points if the Q-table convergence is incomplete at the time of selection, which is often the case when training is interrupted frequently by changing mean wind speeds. Table IV describes tradeoffs for two simulation parameters as well as that parameter's impact on the algorithm.

Parameter . | Comments . |
---|---|

Length of power-averaging window | Longer power-averaging windows result in more accurate reward signals. As a result, longer power-averaging windows mean the Q-table is more accurate and the agent is less likely to choose the incorrect set point based on a corrupted Q-table value. However, longer windows also increase the time required. |

Gaussian kernel standard deviation | A larger standard deviation results in a smoother Q-table. However, this can result in relative differences between actions $\alpha 0$ and $\alpha 1$ being de-emphasized, which in turn could lead the agent to select the incorrect set point. |

Parameter . | Comments . |
---|---|

Length of power-averaging window | Longer power-averaging windows result in more accurate reward signals. As a result, longer power-averaging windows mean the Q-table is more accurate and the agent is less likely to choose the incorrect set point based on a corrupted Q-table value. However, longer windows also increase the time required. |

Gaussian kernel standard deviation | A larger standard deviation results in a smoother Q-table. However, this can result in relative differences between actions $\alpha 0$ and $\alpha 1$ being de-emphasized, which in turn could lead the agent to select the incorrect set point. |

## V. CONCLUSIONS AND FUTURE WORK

In this paper, we described updates to the delayed RL algorithm described in Ref. 2 that enabled it to successfully increase energy capture in a time-varying wind environment compared to a static LUT. We also described how the refined GARLIC algorithm overcame a set of challenges (Table II) caused by this time-varying wind condition. Although the wind speed variations are not fully turbulent and the wind direction in this paper is constant, the paper serves as a proof of concept that the RL algorithm shows promise for this application.

There are still multiple avenues of future work building upon the concepts developed in this paper. Most importantly, for the algorithm to be useful to the wind energy community, more realistic wind conditions must be examined. Unfortunately, the simulation capabilities are not currently available in FLORIS to test GARLIC in time-varying wind directions, and more computationally expensive software is difficult to use for controller tuning across a large number of parameters. The nearest-term next steps will therefore be to test the algorithm in more turbulent wind conditions and with a larger number of turbines more closely approximating a real wind farm. When simulation capabilities become available, it will be further augmented for time-varying wind direction, which will add the additional complexity of time-varying neighborhoods.

In the area of more realistic wind conditions, changing mean wind speeds will first be investigated further to more efficiently avoid the reward being assigned to the wrong state. In this paper, the approach is conservative in that turbines are locked until it is certain that they are safe to move without interrupting a different turbine's optimization. These conditions could most likely be relaxed, and the degree of relaxation would constitute an additional tuning parameter. Additionally, other machine learning techniques like neural networks or clustering could be used to predict wind conditions and forecast wind farm power production, allowing the control algorithm to behave proactively.^{31–36} As suggested in Sec. IV D, there are many different parameters to tune, such as the yaw step size, $\Delta \gamma $, the filter window, the learning rate, *l _{t}*, the discount factor mentioned in Eq. (1), and the standard deviation of the Gaussian blur. These parameters can have a substantial impact on the results and would most likely need to be tuned by sweeping through a sequence of parameter values and observing the wind farm response. In addition to examining more than just two step sizes for the yaw angle changes, tracking of a dynamic yaw reference could also be incorporated.

Due to the inherent complexity of a wind farm, it is clear that additional research is needed to move beyond this proof-of-concept. However, the GARLIC algorithm proposed in this paper shows promise for RL to be effective at increasing wind farm power output even in time-varying wind conditions and in the presence of a sensor offset error, making it a promising approach for wind farm control.

## ACKNOWLEDGMENTS

This material is based in part upon work supported by Envision Energy and the National Renewable Energy Laboratory.

This material is based in part upon work supported by Envision Energy, Award No. A16–0094-001. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of Envision Energy. This work was authored in part by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36–08GO28308. Funding provided by the U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Wind Energy Technologies Office. The views expressed in the article do not necessarily represent the views of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for U.S. Government purposes.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX: LOCKING/FILTERING ALGORITHM

Section III C introduced the concept of power averaging in order to achieve an accurate reward signal. While these additions complicate the algorithm, they can be understood as an extension of the locking algorithm that was introduced in Ref. 2. This new process is detailed in Algorithm 1. It should be noted that, contrary to Ref. 2, Algorithm 1 introduces a new flag variable $completed_propagation$ to signify the completion of a wake propagation interval; the flag $completed_action$, which was used to signify the completion of the wake propagation interval in Ref. 2, is now redefined to signify the completion of the power averaging interval. Also, the definition of the term locked has been expanded to encompass both of the windows mentioned in Sec. III C: $wake_delay$ and $power_delay$. Using this new locking definition, a turbine is locked if any of the three parameters are nonzero. Finally, we allow a flag variable to have three states in this algorithm: Unset, True, and False. If $completed_action$ or $completed_propagation$ is unset, then the agent is not waiting for a wake delay to propagate or a power filtering window to complete. If either is false, the agent is indeed waiting for either a wake to propagate or the power averaging window to run to completion (or both), and should be locked.

The wind speed bin counting routine mentioned in Sec. III B is not displayed in Algorithm 1 as this process takes place constantly in the background of the simulation, as opposed to being precipitated by a specific event.

1: if i and j $\u2200\u2009j\u2208N(i)$ are unlocked then |

2: select action |

3: calculate wake delay for turbines in neighborhood and set $wake_delay$ for i |

4: set $wake_delay$ $\u2200\u2009j\u2208N(i)$ |

5: set $completed_action$ and $completed_propagation$ flags to be False for i |

6: set $power_delay$ for i |

7: end if |

8: |

9: if $i\u2009wake_delay$ reaches 0 and $i\u2009completed_propagation$ is False then |

10: set $completed_propagation$ to be True for i |

11: end if |

12: if $i\u2009wake_delay\u2009reaches\u20090$ and $i\u2009completed_propagation$ is unset then |

13: set $completed_action$ to be False for i |

14: set $power_delay$ for i |

15: end if |

16: if $i\u2009power_delay$ reaches 0 and $i\u2009completed_action$ is False then |

17: set $completed_action$ to be True for i |

18: end if |

19: |

20: if turbine $i\u2009completed_propagation$ and i and j $\u2200\u2009j\u2208N(i)$ are unlocked then |

21: calculate reward function and reward signal |

22: update Q-table |

23: unset $completed_action$ boolean |

24: end if |

1: if i and j $\u2200\u2009j\u2208N(i)$ are unlocked then |

2: select action |

3: calculate wake delay for turbines in neighborhood and set $wake_delay$ for i |

4: set $wake_delay$ $\u2200\u2009j\u2208N(i)$ |

5: set $completed_action$ and $completed_propagation$ flags to be False for i |

6: set $power_delay$ for i |

7: end if |

8: |

9: if $i\u2009wake_delay$ reaches 0 and $i\u2009completed_propagation$ is False then |

10: set $completed_propagation$ to be True for i |

11: end if |

12: if $i\u2009wake_delay\u2009reaches\u20090$ and $i\u2009completed_propagation$ is unset then |

13: set $completed_action$ to be False for i |

14: set $power_delay$ for i |

15: end if |

16: if $i\u2009power_delay$ reaches 0 and $i\u2009completed_action$ is False then |

17: set $completed_action$ to be True for i |

18: end if |

19: |

20: if turbine $i\u2009completed_propagation$ and i and j $\u2200\u2009j\u2208N(i)$ are unlocked then |

21: calculate reward function and reward signal |

22: update Q-table |

23: unset $completed_action$ boolean |

24: end if |

## References

*ICML '02: Proceedings of the Nineteenth International Conference on Machine Learning*(