We give an overview of a complex systems approach to large blackouts of electric power transmission systems caused by cascading failure. Instead of looking at the details of particular blackouts, we study the statistics and dynamics of series of blackouts with approximate global models. Blackout data from several countries suggest that the frequency of large blackouts is governed by a power law. The power law makes the risk of large blackouts consequential and is consistent with the power system being a complex system designed and operated near a critical point. Power system overall loading or stress relative to operating limits is a key factor affecting the risk of cascading failure. Power system blackout models and abstract models of cascading failure show critical points with power law behavior as load is increased. To explain why the power system is operated near these critical points and inspired by concepts from self-organized criticality, we suggest that power system operating margins evolve slowly to near a critical point and confirm this idea using a power system model. The slow evolution of the power system is driven by a steady increase in electric loading, economic pressures to maximize the use of the grid, and the engineering responses to blackouts that upgrade the system. Mitigation of blackout risk should account for dynamical effects in complex self-organized critical systems. For example, some methods of suppressing small blackouts could ultimately increase the risk of large blackouts.

Cascading failure is the usual mechanism by which failures propagate to cause large blackouts of electric power transmission systems. For example, a long, intricate cascades of events caused the August 1996 blackout in Northwestern America that disconnected 7.5 million customers and $30GW$ of electric power.^{1–3} The August 2003 blackout in Northeastern America disconnected 50 million people and $62GW$ to an area spanning eight states and two provinces.^{4} The vital importance of the electrical infrastructure to society motivates the understanding and analysis of blackouts. Although large blackouts are rare, observed blackout statistics suggest that their risk is not negligible because as blackout size increases, the probability of a blackout decreases in a power law manner that is roughly comparable to the manner of increase of blackout cost. We attribute the power law decrease in blackout probability as size increases to cascading: as failures occur, the power system is successively weakened so that the chance of further failures is increased. Indeed, probabilistic models that capture the essence of this cascading show power law behavior at a critical point. Moreover, similar behavior can be observed in several power systems models of cascading failure when the power system is loaded near a critical point. But why should power systems be designed and operated near a critical point? We see the power system as slowly evolving in response to increasing load, economics, engineering, and recent blackouts so as to move to a complex system equilibrium near a critical point. A power system well below the critical point experiences fewer blackouts and it is economic for its loading to increase, whereas a power system well above the critical point experiences blackouts that drive system upgrades to effectively reduce the loading. We incorporate these slow dynamics of power system upgrade in a simple model to verify that these processes can drive the system to a critical point with a power law distribution of blackout size. The complex dynamics of power system evolution can have a significant effect on the long-term effect of system upgrades. Indeed, we show that an upgrade that initially reduces blackout frequency could eventually lead to an increased frequency of large blackouts.

## I. INTRODUCTION

There is evidence of a power law decrease in blackout probability as blackout size increases from both data and simulations. These discoveries, together with the societal importance of managing blackout risk, motivate the study of cascading failure mechanisms and critical points in blackout models. One explanation of the observed power laws is that when the larger complex system dynamics of a power system evolving in response to economic and engineering forces are considered, the system self-organizes to near a critical point. We give an explanatory overview of this theory, drawing from a range of previous work.^{5–13}

After Sec. II summarizes blackout mechanisms, Sec. III discusses the evidence for a power law in blackout probability and the consequences for blackout risk. Section IV summarizes abstract and power system models of cascading failure and reviews related approaches in the literature. Section V discusses the observed critical points in these models. Section VI discusses quantifying blackout risk. Section VII expands the discussion to describe and model a power system slowly evolving with respect to engineering and economic forces and discusses some initial consequences for blackout mitigation.

## II. SUMMARY OF BLACKOUT CASCADING FAILURE MECHANISMS

We consider blackouts of the bulk electrical power transmission system; that is, the high voltage (greater than, say, $30kV$) portion of the electrical grid.^{14} Power transmission systems are heterogeneous networks of large numbers of components that interact in diverse ways. When component operating limits are exceeded, protection acts to disconnect the component and the component “fails” in the sense of not being available to transmit power. Components can also fail in the sense of misoperation or damage due to aging, fire, weather, poor maintenance, or incorrect design or operating settings. In any case, the failure causes a transient and causes the power flow in the component to be redistributed to other components according to circuit laws, and subsequently redistributed according to automatic and manual control actions. The effects of the component failure can be local or can involve components far away, so that the loading of many other components throughout the network is increased. In particular, the propagation of failures is not limited to adjacent network components. For example, a transmission line that trips transfers its steady state power flow to transmission lines that form a cutset with the tripped line. Moreover, the flows all over the network change. Hidden failures of protection systems can occur when an adjacent transmission line is tripped, but oscillations and other instabilities can occur across the extent of the power system. The interactions involved are diverse and include deviations in power flows, frequency, and voltage magnitude and phase as well as operation or misoperation of protection devices, controls, operator procedures, and monitoring and alarm systems. However, all the interactions between component failures tend to be stronger when components are highly loaded. For example, if a highly loaded transmission line fails, it produces a large transient, there is more power that redistributes to other components, and failures in nearby protection devices are more likely. Moreover, if the overall system is more highly loaded, components have smaller margins so they can tolerate smaller increases in load before failure, the system nonlinearities and dynamical couplings increase, and the system operators have fewer options and more stress.

A typical large blackout has an initial disturbance or trigger events followed by a sequence of cascading events. Each event further weakens and stresses the system and makes subsequent events more likely. Examples of an initial disturbance are short circuits of transmission lines through untrimmed trees, protection device misoperation, and bad weather. The blackout events and interactions are often rare, unusual, or unanticipated because the likely and anticipated failures are already accounted for in power system design and operation. The complexity is such that it can take months after a large blackout to sift through the records, establish the events occurring, and reproduce with computer simulations and hindsight a causal sequence of events.

The historically high reliability of power transmission systems in developed countries is largely due to estimating the transmission system capability and designing and operating the system with margins with respect to a chosen subset of likely and serious contingencies. The analysis is usually either deterministic analysis of estimated worst cases or Monte Carlo simulation of moderately detailed probabilistic models that capture steady state interactions.^{15} Combinations of likely contingencies and some dependencies between events such as common mode or common cause are sometimes considered. The analyses address the first few likely and anticipated failures rather than the propagation of many rare or unanticipated failures in a cascade.

## III. BLACKOUT DATA AND RISK

We consider the statistics of a series of blackouts from several countries. Figure 1 plots the empirical probability distribution of energy unserved in North American blackouts from 1984 to 1998 as documented by the North American Electrical Reliability Council (NERC).^{16} The fall-off with blackout size is close to a power law dependence.^{6,17–20} Moreover, similar results are obtained by separating the data into blackouts in the eastern and western interconnections of North America.^{19} Power law dependence of blackout probability with blackout size are observed in Sweden,^{21} Norway,^{22} New Zealand,^{23} and China.^{24} The approximate power law exponents of the probability distribution function (noncumulative) are shown in Table I. The similarity of the power law form of the probability distribution function (pdf) in different power transmission systems suggests that there may be some universality. The power law region is of course limited in extent in a practical power system by a finite cutoff corresponding to the largest possible blackout.

Source . | Exponent . | Quantity . |
---|---|---|

North America data (Ref. 6) | $\u22121.3$ to $\u22122.0$ | Various |

North America data (Refs. 19 and 20) | $\u22122.0$ | Power |

Sweden data (Ref. 21) | $\u22121.6$ | Energy |

Norway data (Ref. 22) | $\u22121.7$ | Power |

New Zealand data (Ref. 23) | $\u22121.6$ | Energy |

China data (Ref. 24) | $\u22121.8$ | Energy |

$\u22121.9$ | Power | |

OPA model on tree-like 382-node (Ref. 8) | $\u22121.6$ | Power |

Hidden failure model on WSCC 179-node (Ref. 9) | $\u22121.6$ | Power |

Manchester model on 1000-node (Ref. 10) | $\u22121.5$ | Energy |

CASCADE model (Ref. 11) | $\u22121.4$ | No. of failures |

Branching process model (Ref. 12) | $\u22121.5$ | No. of failures |

Source . | Exponent . | Quantity . |
---|---|---|

North America data (Ref. 6) | $\u22121.3$ to $\u22122.0$ | Various |

North America data (Refs. 19 and 20) | $\u22122.0$ | Power |

Sweden data (Ref. 21) | $\u22121.6$ | Energy |

Norway data (Ref. 22) | $\u22121.7$ | Power |

New Zealand data (Ref. 23) | $\u22121.6$ | Energy |

China data (Ref. 24) | $\u22121.8$ | Energy |

$\u22121.9$ | Power | |

OPA model on tree-like 382-node (Ref. 8) | $\u22121.6$ | Power |

Hidden failure model on WSCC 179-node (Ref. 9) | $\u22121.6$ | Power |

Manchester model on 1000-node (Ref. 10) | $\u22121.5$ | Energy |

CASCADE model (Ref. 11) | $\u22121.4$ | No. of failures |

Branching process model (Ref. 12) | $\u22121.5$ | No. of failures |

There are several useful measures of blackout size. Energy unserved and power or customers disconnected are measures that impact society. An example of a measure of disturbance size internal to the power system is number of transmission lines tripped. (Transmission lines can often trip with no load shed and hence no blackout.) Chen *et al.*^{25} fit the empirical probability distribution of $20years$ of North American multiple line failures with a cluster distribution model. Other heavy tailed distributions such as generalized Poisson and a negative binomial model also give reasonable fits to the data.

Blackout risk is the product of blackout probability and blackout cost. Here we assume that blackout cost is roughly proportional to blackout size, although larger blackouts may well have costs (especially indirect costs) that increase faster than linearly.^{15} However, in the case of a power law exponent of blackout probability comparable to $\u22121$, the larger blackouts become rarer at a similar rate as costs increase, and then the risk of large blackouts is comparable to, or even exceeds, the risk of small blackouts.^{13} Thus, power laws in blackout size distributions significantly affect the risk of large blackouts and make the study of large blackouts of practical relevance. (Standard risk analyses that assume independence between events imply exponential dependence of blackout probability on blackout size and hence negligible risk of large blackouts.)

Consideration of the probability distribution of blackout sizes leads naturally to a more detailed framing of the problem of avoiding blackouts. Instead of seeking only to limit blackouts in general, we seek to manipulate the probability distribution of blackouts to jointly limit the frequency of small, medium, and large blackouts. This elaboration is important because measures taken to limit the frequency of small blackouts may inadvertently increase the frequency of large blackouts when the complex dynamics governing transmission expansion are considered, as discussed in Sec. VII.

Important aspects of the complex dynamics of blackouts that we do not focus on here are the long-range time correlations in the blackout sizes and the distribution of time between blackouts.^{6,19–21,23}

The available blackout data are limited and the statistics have a limited resolution. To further understand the mechanisms governing the complex dynamics of power system blackouts, modeling of the power system is indicated.

## IV. MODELS OF CASCADING FAILURE

This section summarizes abstract and power system models of cascading failure that are used to understand the propagation of failures in a blackout assuming a fixed system. Since blackout cascades are over in less than one day and the evolution of power system operation, upgrade, maintenance, and design is much slower, it is reasonable to assume a fixed power system during the progression of any particular cascade. We also review some other approaches and models.

### A. CASCADE model

The CASCADE model is an analytically tractable probabilistic model of cascading failure that captures the weakening of the system as the cascade proceeds.^{11} The features that the CASCADE model abstracts from the formidable complexities of large blackouts are the large but finite number of components, components that fail when their load exceeds a threshold, an initial disturbance loading the system, and the additional loading of components by the failure of other components. The initial overall system stress is represented by upper and lower bounds on a range of initial component loadings. The model neglects the timing of events and the diversity of power system components and interactions.

The CASCADE model^{11} has $n$ identical components with random initial loads. For each component the minimum initial load is $Lmin$ and the maximum initial load is $Lmax$. For $j=1,2,\u2026,n$, component $j$ has initial load $Lj$, which is a random variable uniformly distributed in $[Lmin,Lmax]$. $L1,L2,\u2026,Ln$ are independent. Components fail when their load exceeds $Lfail$. When a component fails, a fixed amount of load $P$ is transferred to each of the components. To start the cascade, we assume an initial disturbance that loads each component by an additional amount $D$. Other components may then fail depending on their initial loads $Lj$ and the failure of any of these components will distribute an additional load $P\u2a7e0$ that can cause further failures in a cascade.

Now we define the normalized CASCADE model that has the same failure statistics. The normalized initial load $\u2113j$ is

Then, $\u2113j$ is a random variable uniformly distributed on [0, 1]. Let

The normalized load increment $p$ is, then, the amount of load increase on any component when one other component fails expressed as a fraction of the load range $Lmax\u2212Lmin$. The normalized initial disturbance $d$ is a shifted initial disturbance expressed as a fraction of the load range. In the case in which $Lfail=Lmax$, then the shift $Lmax\u2212Lfail$ in the numerator of (2) is zero and $d$ is simply the initial disturbance expressed as a fraction of the load range. The shift $Lmax\u2212Lfail$ trades off the initial disturbance and the failure load so that the normalized failure load is $\u2113j=1$.

The distribution of the total number of component failures $S$ is

where $p\u2a7e0$, the saturation function is

and $00\u22611$ and $0\u22150\u22611$ are assumed. If $d\u2a7e0$ and $d+np\u2a7d1$, then there is no saturation $[\varphi (x)=x]$ and (3) reduces to the quasibinomial distribution.^{26,27}

A branching process approximation to the CASCADE model gives a way to quantify the propagation of cascading failures with a parameter $\lambda $ and further simplifies the mathematical modeling.^{12} In a Galton-Watson branching process,^{28,29} the failures are regarded as produced in stages. The failures in each stage independently produce further failures in the next stage according to a probability distribution with mean $\lambda $. The behavior is governed by the parameter $\lambda $. In the subcritical case of $\lambda <1$, the failures will die out (i.e., reach and remain at zero failures at some stage) and the mean number of failures in each stage decreases geometrically. In the supercritical case of $\lambda >1$, although it possible for the process to die out, often the failures increase without bound. Of course, there are a large but finite number of components that can fail in a blackout and in the CASCADE model, so it is also necessary to account for the branching process saturating with all components failed.

The stages of the CASCADE model can be approximated by the stages of a saturating branching process by letting the number of components $n$ become large, while $p$ and $d$ become small in such a way that $\lambda =np$ and $\theta =nd$ remain constant. The number $S$ of components failed in the saturating branching process is, then, a saturating form of the generalized Poisson distribution.^{12} Further approximation of the generalized Poisson distribution yields^{30}

For a very general class of branching processes, at the critical point the probability distribution of the total number of failures has a power law form with exponent $\u22121.5$. The universality of the $\u22121.5$ power law at criticality in the probability distribution of the total number of failures in a branching process suggests that this is a signature for this type of cascading failure.

The branching process approximation does capture some salient features of load-dependent cascading failure and suggests an approach to reducing the risk of large cascading failures by monitoring and limiting the average propagation of failures $\lambda $.^{30–33} However, work remains to confirm the correspondence between these simplified global models and the complexities of cascading failure in real systems. While our main motivation is large blackouts, the CASCADE and branching process models are sufficiently simple and general that they could be applied to cascading failure of other large, interconnected infrastructures.^{34}

### B. Power system blackout models

We summarize some power system models for cascading failure blackouts. All these models include representation of power flows on the grid using circuit laws.

The Oak Ridge-PSERC-Alaska (OPA) model for a fixed network represents transmission lines, loads and generators with the usual dc load flow approximation (linearized real power flows with no losses and uniform voltage magnitudes). Starting from a solved base case, blackouts are initiated by random line outages. Whenever a line is outaged, the generation and load is redispatched using standard linear programming methods (since there is more generation power than the load requires, one must choose how to select and optimize the generation that is used to exactly balance the load). The cost function is weighted to ensure that load shedding is avoided where possible. If any lines were overloaded during the optimization, then these lines are outaged with probability $p1$. The process of redispatch and testing for outages is iterated until there are no more outages. The total load shed is, then, the power lost in the blackout. The OPA model neglects many of the cascading processes in blackouts and the timing of events. However, the OPA model does represent in a simplified way a dynamical process of cascading overloads and outages that is consistent with some basic network and operational constraints. OPA can also represent complex dynamics as the network evolves; this is discussed in Sec. VII.

Chen *et al.*^{9} model power system blackouts using the dc load flow approximation and standard linear programming optimization of the generation dispatch and represent in detail hidden failures of the protection system. The expected blackout size is obtained using importance sampling. The distribution of power system blackout size is obtained by rare event sampling and blackout risk assessment and mitigation methods are studied. There is some indication of a critical point at which there is a power law in the distribution of blackout size in the Western Systems Coordinating Council (WSCC) 179-node system. Carnegie Mellon University has developed a cascading overload dc load flow model on a 3357-node network that shows sharp phase transitions in cascading failure probability as load is increased.^{35}

Anghel *et al.*^{36} go beyond a dc load flow and linear programming generation redispatch representation of cascading overloads to represent the time evolution of the random disturbances and restoration processes and also analyze the effect of operator actions with different risk optimizations of load shedding versus cascading.

The University of Manchester has developed an ac power blackout model that represents a range of cascading failure interactions, including cascade and sympathetic tripping of transmission lines, heuristic representation of generator instability, under-frequency load shedding, post-contingency redispatch of active and reactive resources, and emergency load shedding to prevent a complete system blackout caused by a voltage collapse.^{10,37,38} The Manchester model is used by Rios *et al.*^{37} to evaluate expected blackout cost using Monte Carlo simulation and by Kirschen *et al.*^{38} to apply correlated sampling to develop a calibrated reference scale of system stress that relates system loading to blackout size.

Ni *et al.*^{39} evaluate expected contingency severities based on real-time predictions of the power system state to quantify the risk of operational conditions. The computations account for current and voltage limits, cascading line overloads, and voltage instability. Zima and Andersson^{40} study the transition into subsequent failures after an initial failure and suggest mitigating this transition with a wide-area measurement system.

Hardiman *et al.*^{41} simulate and analyze cascading failure using the TRELSS software. In its “simulation approach” mode, TRELSS represents cascading outages of lines, transformers, and generators due to overloads and voltage violations in large ac networks (up to 13 000 nodes). Protection control groups and islanding are modeled in detail. The cascading outages are ranked in severity and the results have been applied in industry to evaluate transmission expansion plans. Other modes of operation are available in TRELSS that can rank the worst contingencies and take into account remedial actions and compute reliability indices.

### C. Review of other approaches

We briefly review some other approaches to cascading failure and complex systems in power system blackouts.

Roy *et al.*^{42} construct randomly generated tree networks that abstractly represent influences between idealized components. Components can be failed or operational according to a Markov model that represents both internal component failure and repair processes and influences between components that cause failure propagation. The effects of the network degree and the intercomponent influences on the failure size and duration are studied. Pepyne *et al.*^{43} also use a Markov model for discrete state power system nodal components, but propagate failures along the transmission lines of a power systems network with a fixed probability. They study the effect of the propagation probability and maintenance policies that reduce the probability of hidden failures.

The challenging problem of determining cascading failure due to dynamic transients in hybrid nonlinear differential equation models is addressed by DeMarco^{44} using Lyapunov methods applied to a smoothed model and by Parrilo *et al.*^{45} using Karhunen-Loeve and Galerkin model reduction. Lindley and Singpurwalla^{46} describe some foundations for causal and cascading failure in infrastructures and model cascading failure as an increase in a component failure rate within a time interval after another component fails.

Stubna and Fowler^{47} give an alternative view based on highly optimized tolerance of the origin of the power law in the NERC data. Highly optimized tolerance was introduced by Carlson and Doyle to describe power law behavior in a number of engineered or otherwise optimized applications.^{48} To apply highly optimized tolerance to the power system, Stubna and Fowler assume that blackouts propagate one dimensionally and that this propagation is limited by finite resources that are engineered to be optimally distributed to act as barriers to the propagation. The one-dimensional assumption implies that the blackout size in a local region is inversely proportional to the local resources. Minimizing a blackout cost proportional to blackout size subject to a fixed sum of resources leads to a probability distribution of blackout sizes with an asymptotic power tail and two free parameters. The asymptotic power tail exponent is exactly $\u22121$, and this value follows from the one-dimensional assumption. The free parameters can be varied to fit the NERC data for both megawatts lost and customers disconnected. However, a better fit to both these data sets can be achieved by modifying highly optimized tolerance to allow some misallocation of resources.

There is an extensive literature on cascading in graphs^{49,50} that is partially motivated by idealized models of propagation of failures in infrastructure networks such as the internet. The dynamics of cascading is related to statistical topological properties of the graphs. Work on cascading phase transitions and network vulnerability that accounts for forms of network loading includes work by Watts,^{51} Motter *et al.*,^{52} and Crucitti *et al.*^{53} Lesieutre^{54} applies topological graph concepts in a way that is more consistent with power system generation and load patterns.

## V. CRITICAL POINTS

As load increases, it is clear that cascading failure becomes more likely, but exactly how does it become more likely? Our results show that the cascading failure does not gradually and uniformly become more likely; instead there is a critical point or phase transition at which the cascading failure becomes more likely. In complex systems and statistical physics, a critical point is associated with power laws in probability distributions and changes in gradient (for a type-2 phase transition) or a discontinuity (for a type-1 phase transition) in some measured quantity as the system passes through the critical point.

The critical point defines a reference point of system stress or loading for increasing risk of cascading failure. Designing and operating the power system appropriately with respect to this critical point would manage the distribution of blackout risk among small, medium, and large blackouts. However, while the power law region at the critical point indicate a substantial risk of large blackouts, it is premature at this stage of risk analysis to presume that operation near a critical point is bad because it entails some substantial risks. There is also economic gain from an increased loading of the power transmission system. Indeed, one of the objectives in pursuing the risk analysis of cascading blackouts is to determine and quantify the tradeoffs involved so that sensible decisions about optimal design and operation and blackout mitigation can be made.

Implementing the management of blackout risk would require limiting the system throughput and this is costly. Managing the tradeoff between the certain cost of limiting throughput and the rare but very costly widespread catastrophic cascading failure may be difficult. Indeed. we maintain in Sec. VII that for large blackouts, economic, engineering, and societal forces may self-organize the system to near a critical point and that efforts to mitigate the risk should take account of these broader dynamics.^{13}

### A. Qualitative effect of load increase on distribution of blackout size

Consider cascading failure in a power transmission system in the impractically extreme cases of very low and very high loading. At very low loading, any failures that occur have minimal impact on other components and these other components have large operating margins. Multiple failures are possible, but they are approximately independent so that the probability of multiple failures is approximately the product of the probabilities of each of the failures. Since the blackout size is roughly proportional to the number of failures, the probability distribution of blackout size will have an exponential tail. The distribution of blackout size is different if the power system is operated recklessly at a very high loading in which every component is close to its loading limit. Any initial disturbance then causes a cascade of failures leading to total or near total blackout. It is clear that the probability distribution of blackout size must somehow change continuously from the exponential form to the certain total blackout form as loading increases from a very low to a very high loading. We are interested in the nature of the transition between these two extremes. Our results presented below suggest that the transition occurs via a critical point at which there is a power law region in the probability distribution of blackout size. Note that since we always assume a finite size power grid, the power law region cannot extend further than the total blackout size.

### B. Critical points as load increases in CASCADE

This subsection describes one way to represent a load increase in the CASCADE model and how this leads to a parameterization of the normalized model. The effect of the load increase on the distribution of the number of components failed is then described.^{11}

We assume for convenience that the system has $n=1000$ components. Suppose that the system is operated so that the initial component loadings vary from $Lmin$ to $Lmax=Lfail=1$. The average initial component loading $L=(Lmin+1)\u22152$ may then be increased by increasing $Lmin$. The initial disturbance $D=0.0004$ is assumed to be the same as the load transfer amount $P=0.0004$. These modeling choices for component load lead via the normalization (2) to the parametrization

The increase in the normalized power transfer $p$ with increased $L$ may be thought of as strengthening the component interactions that cause cascading failure.

The distribution for the subcritical and nonsaturating case $L=0.6$ has an approximately exponential tail as shown in Fig. 2. The tail becomes heavier as $L$ increases and the distribution for the critical case $L=0.8$, $np=1$ has an approximate power law region over a range of $S$. The power law region has an exponent of approximately $\u22121.4$. The distribution for the supercritical and saturated case $L=0.9$ has an approximately exponential tail for small $r$, zero probability of intermediate $r$, and a probability of 0.80 of all 1000 components failing. If an intermediate number of components fail in a saturated case, then the cascade always proceeds to all 1000 components failing.

The increase in the mean number of failures as the average initial component loading $L$ is increased is shown in Fig. 3. The sharp change in gradient at the critical loading $L=0.8$ corresponds to the saturation of (3) and the consequent increasing probability of all components failing. Indeed, at $L=0.8$, the change in gradient in Fig. 3 together with the power law region in Fig. 2 suggest a type-2 phase transition in the system. In this regime of the CASCADE model, mean number of failures detects the critical point in the same way as percolation measures such as the frequency of a sufficiently large number of components failing.^{10}

The model results show how system loading can influence the risk of cascading failure. At low loading there is an approximately exponential tail in the distribution of number of components failed and a low risk of large cascading failure. There is a critical loading at which there is a power law region in the distribution of number of components failed and a sharp increase in the gradient of the mean number of components failed. As loading is increased past the critical loading, the distribution of number of components failed saturates, there is an increasingly significant probability of all components failing, and there is a significant risk of large cascading failure.

### C. Critical transitions as load increases in power system models

Criticality can be observed in the fast dynamics OPA model as load power demand is slowly increased, as shown in Fig. 4. (Random fluctuations in the pattern of load are superimposed on the load increase in order to provide statistical data.) At a critical loading, the gradient of the expected blackout size sharply increases. Moreover, the pdf of blackout size shows a power law region at the critical loading, as shown in Fig. 5. OPA can also display complicated critical point behavior corresponding to both generation and transmission line limits.^{7}

As noted in Sec. IV, the cascading hidden failure model of Chen *et al.*^{9} on a 179-node system and the Carnegie Mellon University cascading overload model^{35} on a 3357-node system also show some indications of a critical point as load is increased. The results of Chen *et al.* show a gradual increase in expected blackout size near the critical point, whereas the Carnegie Mellon model shows sharp increases in the probability of larger blackouts at the critical point.

The most realistic power system model critical point results obtained so far are with the Manchester blackout simulation on a 1000-bus realistic model of a European power system.^{10} Figure 6 shows the mean blackout size as system loading is increased and Fig. 7 shows a power law region in blackout size distribution at the critical point.

## VI. QUANTIFYING BLACKOUT RISK

At a critical point there is a power law region, a sharp increase in mean blackout size, and an increased risk of cascading failure. Thus, the critical point gives a reference or a power system operational limit with respect to cascading failure. That is, we are suggesting adding an “increased risk of cascading failure” limit to the established power system operating limits such as thermal, voltage, and transient stability. How does one practically monitor or measure margin to the critical point?

One approach is to increase loading in a blackout simulation incorporating cascading failure mechanisms until a critical point is detected by a sharp increase in mean blackout size. The mean blackout size is calculated at each loading level by running the simulation repeatedly with some random variation in the system initial conditions so that a variety of cascading outages are simulated. This approach is straightforward and likely to be useful, but it is not fast and it seems that it would be difficult or impossible to apply to real system data. It could also be challenging to describe and model a good sample of the diverse interactions involved in cascading failure in a fast enough simulation. This approach, together with checks on the power law behavior of the distribution of blackout size, was used to find criticality in several power system and abstract models of cascading failure.^{7,9,11,12} Confirming critical points in this way in a range of power system models incorporating more detailed or different cascading failure mechanisms would help to establish further the key role that critical points play in cascading failure.

Another approach that is currently being developed^{30–33} is to monitor or measure from real or simulated data how much failures propagate after they are initiated. Branching process models such as the Galton-Watson process described in Sec. IV have a parameter $\lambda $ that measures both the mean failure propagation and proximity to criticality. In branching process models, the mean number of failures is multiplied by $\lambda $ at each stage of the branching process. Although there is statistical variation about the mean behavior, it is known^{29} that for subcritical systems with $\lambda <1$, the failures will die out and that for supercritical systems with $\lambda >1$, the number of failures can exponentially increase. (The exponential increase will in practice be limited by the system size and any blackout inhibition mechanisms; current research seeks to understand the blackout inhibition mechanisms.) The idea is to statistically estimate $\lambda $ from simulated or real failure data. Essentially this approach seeks to approximate and fit the data with a branching process model. The ability to estimate $\lambda $ and any other parameters of the branching process model would allow the efficient computation of the corresponding distribution of blackout size probability and hence estimates of the blackout risk. Our emphasis on limiting the propagation of system failures after they are initiated is complementary to more standard methods of mitigating the risk of cascading failure by reducing the risk of the first few likely failures caused by an initial disturbance as for example in using the $n\u22121$ criterion or in Ni *et al.*^{39}

## VII. SELF-ORGANIZATION AND SLOW DYNAMICS OF NETWORK EVOLUTION

### A. Qualitative description of self-organization

We qualitatively describe how the forces shaping the evolution of the power network could give rise to self-organizing dynamics.^{6} The power system contains many components such as generators, transmission lines, transformers and substations. Each component experiences a certain loading each day and when all the components are considered together they experience some pattern or vector of loadings. The pattern of component loadings is determined by the power system operating policy and is driven by the aggregated customer loads at substations. The power system operating policy includes short-term actions such as generator dispatch as well as longer-term actions such as improvements in procedures and planned outages for maintenance. The operating policy seeks to satisfy the customer loads at least cost. The aggregated customer load has daily and seasonal cycles and a slow secular increase of about 2% per year.

The probability of component failure generally increases with component loading. Each failure is a limiting or zeroing of load in a component and causes a redistribution of power flow in the network and hence a discrete increase in the loading of other system components. Thus, failures can cascade. If a cascade of events includes limiting or zeroing the load at substations, it is a blackout. A stressed power system experiencing an event must either redistribute power flows satisfactorily or shed some load at substations in a blackout.

Utility engineers make prodigious efforts to avoid blackouts and especially to avoid repeated blackouts with similar causes. These engineering responses to a blackout occur on a range of time scales longer than one day. Responses include repair of damaged equipment, more frequent maintenance, changes in operating policy away from the specific conditions causing the blackout, installing new equipment to increase system capacity, and adjusting or adding system alarms or controls. The responses reduce the probability of events in components related to the blackout, either by lowering their probabilities directly or by reducing component loading by increasing component capacity or by transferring some of the loading to other components. The responses are directed towards the components involved in causing the blackout. Thus, the probability of a similar blackout occurring is reduced, at least until load growth degrades the improvements made. There are similar, but less intense responses to unrealized threats to system security such as near misses and simulated blackouts.

The pattern or vector of component loadings may be thought of as a system state. Maximum component loadings are driven up by the slow increase in customer loads via the operating policy. High loadings increase the chances of cascading events and blackouts. The loadings of components involved in the blackout are reduced or relaxed by the engineering responses to security threats and blackouts. However, the loadings of some components not involved in the blackout may increase. These opposing forces driving the component loadings up and relaxing the component loadings are a reflection of the standard tradeoff between satisfying customer loads economically and security. The opposing forces apply over a range of time scales. We suggest that the opposing forces, together with the underlying growth in customer load and diversity give rise to a dynamic complex system equilibrium. Moreover, we suggest that in this dynamic equilibrium cascading blackouts occur with a frequency governed approximately by a power law relationship between blackout probability and blackout size. That is, these forces drive the system to a dynamic equilibrium near a critical point.

The load increase is a force weakening the power system (reducing operating margin) and the system upgrades are a force strengthening the system (increasing operating margin). If the power system is weak, then there will be more blackouts and hence more upgrades of the lines involved in the blackout and this will strengthen the power system. If the power system is strong, then there will be fewer blackouts and fewer line upgrades, and the load increase will weaken the system. Thus, the opposing forces drive the system to a dynamic equilibrium that keeps the system near a certain pattern of operating margins relative to the load. Note that engineering improvements and load growth are driven by strong, underlying economic and societal forces that are not easily modified.

These ideas of complex dynamics by which the network evolves are inspired by the corresponding concepts of self-organized criticality in statistical physics.^{55–57}

### B. OPA blackout model for a slowly evolving network

The OPA blackout model^{8,58–60} represents the essentials of slow load growth, cascading line outages, and the increases in system capacity caused by the engineering responses to blackouts. Cascading line outages leading to blackout are regarded as fast dynamics and are modeled as described in Sec. IV and the lines involved in a blackout are computed. The slow dynamics model the growth of the load demand and the engineering response to the blackout by upgrades to the grid transmission capability. The slow dynamics represents an idealized form of the complex dynamics outlined in subsection A. The slow dynamics is carried out by the following small changes applied each time a potential cascading failure is simulated: All loads are multiplied by a fixed parameter that represents the rate of increase in electricity demand. If a blackout occurs, then the lines involved in the blackout have their line flow limits increased slightly. The grid topology remains fixed in the upgrade of the lines for model simplicity. In upgrading a grid it is important to maintain coordination between the upgrade of generation and transmission. The generation is increased at randomly selected generators subject to coordination with the limits of nearby lines when the generator capacity margin falls below a threshold. The OPA model is “top-down” and represents the processes in greatly simplified forms, although the interactions between these processes still yield complex (and complicated!) behaviors. The simple representation of the processes is desirable both to study only the main interactions governing the complex dynamics and for pragmatic reasons of model tractability and simulation run time.

An example of the evolving average line margins and power served is shown in Fig. 8. The power served follows the exponential increase in load except when it is reduced by blackouts. The line loading is averaged over the lines and the figure shows the average line loading converging to a steady state of approximately 70%.

Moreover, when the generator upgrade process is suitably coordinated with the line upgrades and load increase, OPA results show power law regions in the pdf of blackout sizes. For example, OPA results for the IEEE 118-bus network and an artificial 382-bus tree-like network are shown in Fig. 9. Both the power law region of the pdf and the consistency with the NERC blackout data are evident. This result was achieved by the internal dynamics modeled in the system and is in this sense self-organizing to a critical point.

### C. Blackout mitigation

The success of mitigation efforts in self-organized critical systems is strongly influenced by the dynamics of the system. Unless the mitigation efforts alter the self-organization forces driving the system, the system will be pushed to a critical point. To alter those forces with mitigation efforts may be quite difficult because the engineering and economic forces are an intrinsic part of our society. The mitigation efforts can then move the system to a new dynamic equilibrium while remaining near a critical point and preserving the power law dependence. Thus, while the absolute frequency of blackouts of all sizes may be reduced, the underlying forces can still cause the relative frequency of large blackouts to small blackouts to remain the same.

Indeed apparently sensible efforts to reduce the risk of smaller blackouts can sometimes increase the risk of large blackouts.^{13} This occurs because the large and small blackouts are not independent but are strongly coupled by the dynamics. For example, the longer-term response to small blackouts can influence the frequency of large blackouts in such a way that measures to reduce the frequency of small blackouts can eventually reposition the system to have an increased risk of large blackouts. The possibility of an overall adverse effect on risk from apparently sensible mitigation efforts shows the importance of accounting for complex system dynamics when devising mitigation schemes.^{13} For example, Figure 10 shows the results of inhibiting small numbers of line outages using the OPA model with self-organization on the IEEE 118-bus system.^{13} One of the causes of line outages in OPA is the outage of lines with a probability $p1$ when the line is overloaded. The results show the effect of inhibiting these outages when the number of overloaded lines is less than 10. The inhibition corresponds to more effective system operation to resolve these overloads. Blackout size is measured by number of line overloads. The inhibition is, as expected, successful in reducing the smaller numbers of line outages, but eventually, after the system has repositioned to its dynamic equilibrium, the number of larger blackouts has increased. The results shown in Fig. 10 are distributions of blackouts in the self-organized dynamic equilibrium and reflect the long-term effects of the inhibition of line outages. It is an interesting open question to what extent power transmission systems are near their dynamic equilibrium, but operation near dynamic equilibrium is the simplest assumption at the present stage of knowledge of these complex dynamics.

Similar effects are familiar and intuitive in other complex systems. For example, more effectively fighting small forest fires allows the forest system to readjust with increased brush levels and closer tree spacing so that when a forest fire does happen by some chance to progress to a larger fire, a huge forest fire is more likely.^{13}

## VIII. CONCLUSIONS

We have summarized and explained an approach to series of cascading failure blackouts at a global systems level. This way of studying blackouts is complementary to existing detailed analyses of particular blackouts and offers some new insights into blackout risk, the nature of cascading failure, the occurrence and significance of critical points, and the complex system dynamics of blackouts.

The power law region in the distribution of blackout sizes in observed blackout data^{6,19–23} has been reproduced by power system blackout models^{7–10} and some abstract models of cascading failure^{11,12} and engineering design.^{47} The power law profoundly affects the risk of large blackouts, making this risk comparable to, or even exceeding the risk of small blackouts. The power law also precludes many conventional statistical models with exponential-tailed distributions and new approaches to the risk analysis of blackouts need to be developed.^{11,12,25,32,33}

We think that the power law region in the distribution of blackout sizes arises from cascading failure when the power system is loaded near a critical point. Several power system blackout models^{7,9,10} and abstract models of cascading failure^{11,12} show evidence of a critical loading at which the probability of cascading failure sharply increases. Determining the proximity to critical loading and the overall blackout risk from power system simulations or data is an important problem. The current approaches include Monte Carlo simulation methods to compute the proximity to critical loading^{7,9,38} and ways of quantifying propagation of failure $\lambda $ using branching process models of cascading failure. We are pursuing practical methods of estimating $\lambda $ from real or simulated failure data.^{12,30,32,33} It is also of interest to find quantities that influence (at least in the short term) the distribution of blackout sizes such as loading level, spinning reserve, hidden failure probability, and control actions.^{7,9,36}

A novel and much larger view of the power system dynamics considers the opposing forces of growing load and the upgrade of the transmission network in response to real or simulated blackouts. Our simulation results show that these complex dynamics can self-organize the system to be near a critical point.^{8} These complex dynamics are driven by strong societal and economic forces and the difficulties or tradeoffs in achieving long-term displacement of the power system away from the complex systems equilibrium caused by these forces should not be underestimated. Indeed we have simulated a simple example of a blackout mitigation method that successfully limits the frequency of small blackouts, but in the long term increases the frequency of large blackouts as the transmission system readjusts to its complex systems equilibrium.^{13} In the light of this example, we suggest that the blackout prevention problem be reframed as jointly mitigating the probabilities of small, medium, and large blackouts.

If the power system is self-organizing to near a critical point in response to strong economic, societal, and engineering forces, this may limit the ways in which the distribution of blackout sizes may be readily changed. For example, the self-organization may make it hard in the long term to change the approximate power law form of the distribution of blackout size. It could be easier to preserve the power law form but reduce the frequency of blackouts of all sizes by the same fraction. It is of interest to find quantities that influence the long-term, steady-state distribution of blackout sizes such as component reliability, redundancy, and generator margin.^{13,61} While it is conceivable that structural change to our society such as draconian energy saving measures or widespread local electricity generation could change or at least temporarily suspend the annual load increase on the transmission system that is a driver of the self-organization, a conservative assumption for North America is that the historically robust annual growth rate in electricity usage will continue and the transmission system will continue to provide most of the electrical energy to customers.

A quantitative overall risk analysis of blackouts is only now emerging and it is not yet clear for optimizing the power system whether a power law region in the distribution of blackout size is desirable, undesirable, or, as suggested by self-organization, inevitable. Once practical methods for quantifying blackout risk become established, they can be used to assess the change in risk when specific improvements are made to the power system. This ability to quantify the reliability benefits of improvements would be particularly helpful in evaluating and trading off the costs and benefits of proposed reliability policies or standards. In any case, in optimizing the blackout risk by considering the costs and benefits of reliability improvements, instead of only considering the short-term reliability of a fixed power grid with an improvement, we suggest also examining the long-term reliability of an evolving power grid that is governed by the complex dynamics of its upgrade process.

There are good prospects for extracting engineering and scientific value from the further development of models, simulations and computations and we hope that the overview in this paper encourages further developments and practical applications in this emerging and exciting area of research. There is an opportunity for systems research to make a substantial contribution to understanding and managing the risk of cascading failure blackouts.

## ACKNOWLEDGMENTS

We gratefully acknowledge support in part from National Science Foundation Grant Nos. ECCS-0606003, ECCS-0605848, SES-0623985, and SES-0624361. I.D. gratefully acknowledges that this paper is an account of work sponsored in part by the Power Systems Engineering Research Center (PSERC). Part of this research has been carried out at Oak Ridge National Laboratory, managed by UTBattelle, LLC, for the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

## REFERENCES

The lower voltage distribution systems are usually studied separately because of their different characteristics. For example, transmission networks tend to be meshed networks with multiple parallel paths for power flow, whereas distribution networks are usually operated in a radial fashion.