In many fields, accurate prediction of cascade outbreaks during their early stages of propagation is of paramount importance. Based on percolation theory, we propose a global propagation probability algorithm that effectively estimates the probability of information spreading from source nodes to the giant component. Building on this, we further introduce an early prediction method for cascade outbreaks, which provides quantitative predictions of both the probability and scope of cascade outbreaks by fully considering the network structure data and propagation dynamics. Through our research, we observe that cascade outbreaks resemble a phase transition. When approaching the critical point of an outbreak, a few specific activating nodes typically facilitate the transmission of information throughout the entire network, thus enabling early inference of a cascading outbreak. To validate our findings, we conducted experiments on diverse network structures using a classical propagation model and applied our proposed method to analyze a real microblog cascade dataset. The experimental results robustly demonstrate the superiority of our approach over baseline methods in terms of effectively predicting cascade outbreaks with high precision and early detection capability.

In today’s interconnected world, the rapid spread of information through social networks or other underlying networks has the potential to trigger cascades of events with far-reaching consequences. In recent years, numerous scientists have dedicated their efforts to studying the patterns of cascading propagation and modeling the dynamics of transmission to predict future evolutionary trends. However, there is still a lack of effective research methods for assessing the global propagation potential of infectious nodes based solely on the limited early-stage data. To address this challenge, we investigate the association between early infected nodes and the giant component based on network percolation theory and propose a qualitative method to effectively estimate the probability of information spreading from the source to the entire network. This pioneering research not only introduces a new method for early prediction of cascade outbreaks but also sheds light on the underlying dynamics of these phenomena, unveiling the global spread potential via early infected nodes. Therefore, our approach offers meaningful early warning and guidance in the risk assessment and proactive management of significant events.

## I. INTRODUCTION

With the rapid development of radio communications and the Internet, as well as the miniaturization of mobile devices, people will re-share the received information with their neighbors, so various types of information can be quickly spread among the crowd. This process of spreading information through re-sharing is called cascade.^{1} Most information can often be localized, dissipating in a short time or within a small area. However, a small amount of information can be transmitted from generation to generation like a virus, eventually spreading to a large population of people. This chain reaction has been observed in various contexts, including advertising and marketing,^{2} social networks,^{3,4} email communication,^{5} and epidemic spreading.^{6} Therefore, it is of great significance to accurately predict cascade outbreaks and their ranges in many scenarios.

The dissemination of information often depends on the underlying social network, because only connected users can disseminate information. Although various forecasting methods have been proposed for cascade outbreaks,^{6–10} they mainly focus on the topological importance of activated nodes in the network or the short-term propagation potential, which limits the exploitation of the global structural information of the network. However, as cascade explosions are global phenomena, utilizing the network’s global characteristics for predicting presents a new research direction. In addition, traditional machine learning methods such as Refs. 11–13 utilize early time series data or topological importance as features to predict cascade outbreaks. However, these methods face challenges in training models that effectively incorporate both global structure and temporal features. Recently, graph neural networks (GNNs) have shown promise for predicting spatiotemporal features in cascade propagation.^{14–16} These methods model propagation dynamics on complex networks through neighborhood aggregation and predict cascade propagation. However, GNN-based propagation modeling is a regression task, and early propagation data may be insufficient for training parameters. To supplement these methods, we aim to employ early spread data combined with a network structure and prior information from classic propagation models^{17–19} to predict cascade outbreaks, which can be treated as a classification task.

Network percolation theory is an important tool for studying explosive dissemination^{20,21} due to its equivalence with the classical propagation model and its ability to utilize global network structure information for analyzing cascade bursts. In many complex networks, the overall interconnection frameworks rely on a specific set of structural nodes. Activation of these nodes can spread information throughout the entire network.^{22} At the early stage of probability propagation, a small number of activated nodes may activate the entire network through some key nodes with a certain probability. This probability can help predict and explain possible future cascade bursts. The probability can improve the predictability and comprehensibility of potential cascade bursts. Building on prior research, we incorporate percolation theory to forecast cascade outbreaks and mainly make the following three contributions:

Based on message passing methods

^{23}and percolation theory, we propose a global propagation probability (GPP) algorithm that accurately reflects the likelihood of the activated nodes to activate the entire network. This index establishes a connection between local and global network structures in the sense of percolation.Based on the GPP algorithm, we present a novel method for Cascade Outbreak Prediction (COP), which we denote as GPP-COP method. This method can monitor the probability of cascade outbreaks online and estimate the final outbreak range. GPP-COP utilizes prior information on network structure and propagation dynamics, providing a new scientific tool for cascade outbreak research.

We conduct experiments on various simulated and real network structures. Experimental results demonstrate that the GPP-COP method can provide quantitative and deterministic inference in the early stages of propagation, exhibiting strong potential for practical applications.

## II. PRELIMINARY

### A. Related works

In recent years, considerable effort has been directed toward developing solutions for the problem of early prediction of cascade bursts. Several cascade prediction methods, which do not rely on the underlying network, have achieved promising results. These methods have attempted to predict viral cascades through the use of information content,^{24} temporal dynamics,^{25,26} and influential users.^{27,28} Specifically, Leskovec *et al.*^{29} selected some optimal nodes as sensors to predict cascade outbreaks. Subsequently, Cui *et al.*^{11} proposed Orthogonal Sparse LOgistic Regression (OSLOR) method to evaluate node weights for outbreak prediction. Agarwal *et al.*^{30} and Zhao *et al.*^{31} proposed methods that learn a parameterized rate function for predicting cascade outbreaks. Pinto *et al.*^{12} performed multiple linear regression on time-series data of information cascades to predict future outbreaks. More recently, Gou *et al.*^{13} employed a Long Short-Term Memory (LSTM) neural network to learn the sequential features.

However, with the continuous improvement of computing power and the emphasis on constructing datasets, an increasing number of network structure data are published. The relationship between the cascade bursts and the specific network structure remains an urgent problem to be solved. Ugander *et al.*^{9} utilized local neighbor structures to predict the global information outburst on Facebook. Weng *et al.*^{10} believed that the user’s community structure is an important factor affecting global cascades. Zhang *et al.*^{32} focused on the influence of friends on self-reposting behavior and incorporated the locality of social influence into the factor graph model to better predict the spread of reposting behavior on large microblog networks. Moreover, Cao *et al.*^{14} proposed the CoupledGNN method to capture the change in node state and influence, and Chen *et al.*^{15} proposed the Multiscale Graph Capsule Network method to learn the latent features of cascade graphs from a multi-scale perspective. These methods require a sufficient amount of historical data of cascade propagation to model the propagation dynamics on the network.

Network percolation theory is a fundamental tool for the study of networks.^{33,34} In particular, Hu *et al.*^{35} found that the final propagation range of information obeys a bimodal distribution, either confined to a very small portion or spanning the entire network. They were the first to propose the concept of global propagation probability and designed a multi-simulation method to estimate it. Our earlier work^{36} applied the global propagation probability to the edge percolation model and proposed an approximate calculation method.

### B. Equivalence between the classical propagation models and the edge percolation model

Propagation dynamics models play a crucial role in the study of the contagion of diseases, rumors, and other forms of information. These models are not only capable of modeling the mechanisms of transmission dynamics at an individual level, but also reveal macroscopic phenomena and patterns within complex systems through large-scale simulations. Consequently, such models are instrumental in enhancing our understanding, prediction, and management of cascade outbreaks. Without loss of generality, our study is grounded in the Susceptible-Infected-Recovered (SIR)^{18} model to examine the potential of early-stage infected nodes for disseminating information throughout the whole network. Nodes in the SIR model are categorized into three states: susceptible ( $S$), infected ( $I$), and recovered ( $R$). Specifically, node $u$ in state $I$ will infect a neighboring node $v$ in state $S$ with probability $ \varphi u v$ during a unit time interval $\delta t$. We assume that the infection cycle $ \tau 0$ contains $ m 0$ unit times such that $ \tau 0= m 0\delta t$. Once the infection cycle completes, the node in state $I$ transitions to state $R$, losing the ability to infect others and becoming immune to further infection.

The theory of network percolation is a powerful tool for dynamic problems such as cascade failure, disease dissemination, and traffic flow. For a given network $G$, we consider each edge to be retained with probability $\beta $ and deleted with probability $1\u2212\beta $. There exists a critical probability threshold $ \beta c$ such that when $\beta > \beta c$, the largest connected branch of the remaining network $ G r$ is of the same order as the entire network and is referred to as the giant component.

Many classic epidemic models are closely related to the edge percolation theory. In the SIR model, the probability that a node $u$ in state $I$ will infect its neighbor node $v$ in state $S$ during the entire infection cycle is $ \beta u v=1\u2212 ( 1 \u2212 \varphi u v ) m 0$, that is, the edge $ e u v$ is retained with the probability $ \beta u v$. Therefore, if the source node belongs to the giant component, the information will spread throughout the network; otherwise, the information can only propagate locally. Thus, the SIR model is equivalent to the network edge percolation model, and the edge percolation model can be used to study global outbreaks in some classical propagation modes.

## III. THEORY AND MODEL

Consider a network $G=(V,E)$ consisting of $N$ nodes connected by $M$ edges, where the node set $V={ v 1, v 2,\u2026, v N}$, and the edge set $E={ e i j |1\u2264i,j\u2264N}$. To remain mathematically or computationally tractable, we assume that the transmission probability between nodes $u$ and $v$ is denoted as $ \beta u v$, representing the probability that an active node $u$ activates its neighbor $v$ within the infection period $ \tau 0$. This hypothesis is widely applicable across many propagation models and consistent with a broad array of real-world scenarios. Accordingly, we focus on the following two primary tasks: (1) predicting the occurrence of a cascade based on early propagation data, and (2) estimating the ultimate extent of the information impact via the early infected nodes. To accomplish these tasks, we first propose the GPP algorithm to calculate the global propagation probability (equivalent to the probability of cascade bursts). Building on this, we introduce the early prediction method for the online monitoring of the cascade outbreaks.

### A. Approximate algorithm of global propagation probability

In early literature, it can be observed that the ultimate extent from a source node $u$ approximates a bimodal distribution,^{35} indicating that the information propagated by $u$ either diffuses throughout the entire network or remains confined to a small local subset. In fact, the source node $u$ disseminates information to the giant component $ G g$ of the network with a probability $p(u, s \u221e)$, where $ s \u221e$ denotes the size of $ G g$ and $p(u, s \u221e)$ represents the global propagation probability. In our previous work,^{36} we proposed the original first-order and second-order iterative methods for estimating $p(u, s \u221e)$. However, these methods exhibit limitations in terms of predictive accuracy (see Appendix A for details). To achieve more accurate and efficient computation, we introduce a novel iterative algorithm below.

We consider the local tree-like structure [Fig. 1(a)] to derive our method; however, deviations from this assumption may introduce computational errors. Our objective is to minimize this error through some refinements. As shown in Fig. 1(b), there are five situations that disrupt the local tree-like structure (here, we consider a local structure encompassing first- and second-order neighbors), where edges denoted by the black dashed lines destroy the local tree-like characteristics of node $u$. We call these edges inner edges. When inner edges are present, the probabilities of reaching the giant component via different paths may not be independent, thereby introducing errors into our iterative algorithm. Moreover, in calculating the probability $ p v 0 \u2192 u$, inner edges closer to the topological position of node $u$ exert a greater influence on our algorithm. Consequently, our primary attention is devoted to the inner edges among first-order neighbors, as depicted in case (1) of Fig. 1(b) (for additional cases, refer to Appendix C).

### B. Early prediction method for cascade outbreaks

Cascade Outbreak Prediction (COP) refers to the forecasting of whether information can propagate throughout a network. Given a network structure $G$ and propagation probability $ \beta u v$, we propose the GPP-COP method to quantitatively assess the probability and range of cascade outbreaks.

According to Eqs. (5)–(8), we can predict the probability and range of the cascade outbreak at an early stage. This prediction method is referred to as the GPP-COP method, and the pseudocode is provided in Algorithm 2 in Appendix D.

### C. Algorithmic complexity analysis

To elucidate the procedural implementation of our method and to analyze its complexity, we provide the pseudocode for the GPP algorithm and GPP-COP method in Appendix D.

For the GPP algorithm, assuming the average degree of the network is $ d \xaf$, the number of iterations required for convergence is $m$, and the number of calculations needed to calculate the discount coefficient $\lambda $ is $H$, then the average time complexity of the GPP algorithm is $2mH ( d \xaf \u2212 1 ) 2M$. Specifically, here we restrict our consideration to the inner edges among first-order neighbors, whereby Eq. (4) is employed directly as the discounting coefficient $\lambda $ [refer to Eq. (5)]. At this time, the calculation number $H$ is a small finite value, then the time complexity of the GPP algorithm is linear, represented as $(O(M))$.

For the GPP-COP method, we need to execute the GPP algorithm at each time point, resulting in a computational complexity that is linear, proportional to the number of time points $t$. Consequently, our approach demonstrates a high level of computational efficiency.

## IV. EXPERIMENTS AND RESULTS

### A. Dataset

To validate the effectiveness of the proposed algorithm in this paper, we conducted experiments on two synthetic and three real-world networks. (1) ER (Erdös-Rényi).^{37} We generate an ER random network with 10 000 nodes and has a probability of 0.0003 for any two nodes to be connected. (2) BA (Barabasi-Albert).^{37} We add three edges in each iteration until we obtain a network with 10 000 nodes. (3) CA-HepPh.^{38} This is a collaboration network of Arxiv High Energy Physics category with 12 008 nodes and 237 010 edges. (4) Deezer_Europe.^{39} A social network of Deezer users which was collected from the public API in March 2020. It contains 28 281 nodes and 92 752 edges. (5) Facebook_Large.^{40} This webgraph is a page-page graph of verified Facebook sites with 22 470 nodes and 171 002 edges.

Furthermore, we also test our method on a real cascaded propagation dataset (Weibo^{41}). This dataset is from Sina Weibo, the most popular microblogging platform in China. The dataset collects all messages generated on June 1, 2016 and tracks retweets over the next 24 h. After removing messages with less than 10 retweets, there are a total of 119 313 messages. Our method infers the future popularity of tweets based on the number of early retweets.

### B. The effect of GPP algorithm

The GPP algorithm calculates the global propagation probability of each node in the network by iteratively solving $2M$ equations. Given knowledge of the network structure and the propagation probability of each edge, this algorithm can estimate the probability of any set of propagation sources successfully transmitting information throughout the network. In this section, we focus on verifying the effectiveness of the GPP algorithm for estimating the global propagation probability.

To verify the effectiveness of the GPP algorithm, we conducted multiple simulations of the propagation source and take the frequency of the global propagation as the true value of the global propagation probability. We compared the GPP algorithm with the original first-order and second-order iterative algorithms of prior work^{36} in three networks: CA-HepPh, Deezer_Europe, and Facebook_Large. Specifically, we randomly select 100 nodes in the network and apply three methods to calculate the global propagation probability of these nodes. The errors between the estimated values and the true values are presented in Fig. 3. The results indicate that the GPP algorithm is significantly more accurate than the original first- and second-order iterative algorithms, thereby reducing the average error. In these three real networks, the average error of our GPP algorithm is controlled within 0.01. This error is partly due to the random error introduced by multiple simulations, demonstrating our algorithm’s high computational accuracy.

In addition, to investigate the effect of the GPP algorithm in mitigating the influence of inner edges, we arbitrarily selected two nodes in the Deezer_Europe network as the propagation source and depicted its second-order local structure diagram (without inner edges) in Figs. 4(a) and 4(b). Then, we add inner edges to destroy the local tree-like characteristics and compare the calculation effects of different methods under different propagation probabilities, as shown in Figs. 4(c)–4(f) and 4(g)–4(j). In addition to the first-order and second-order methods, we also conduct ablation experiments on the discount factor $\lambda $ of the proposed method, denoted as “benchmark” [that is, $\lambda =1$ in Eqs. (5) and (6)]. Based on the experimental results depicted in Fig. 4, it has been observed that the discount factor $\lambda $ effectively reduces errors caused by inner edges, particularly in networks with a higher number of inner edges and lower average degrees. However, when the network structure closely approximates a local tree-like topology, both the “benchmark” method and the proposed method with discount factor exhibits enhanced detection performance. Therefore, our GPP algorithm can accurately estimate the global propagation probability of all nodes in the network.

### C. Early prediction of cascade outbreaks based on GPP-COP method

The percolation theory is an effective method for estimating cascade outbreaks. When the underlying network is known, information will spread to the giant component of the network through a cascade. In this section, we experimentally validate the efficacy of our method in the early prediction task of cascade outbreaks.

#### 1. Baselines and evaluation metrics

To better illustrate the effectiveness of our method, we experimentally compare it with the following five most popular methods.

**Rand**:^{42}We randomly select several nodes as sensors and use ridge regression to predict cascade outbreaks.**Degree**:^{42}We select several nodes with the highest degree as sensors and use ridge regression to predict cascade outbreaks.**OSLOR**^{11}(Orthogonal sparse logistic regression): it is a logistic regression and jointly optimizes node selection and outbreak prediction. This method can be defined aswhere $ X ~ t$ is the cascade status matrix, $ X ~ i \u22c5 t$ is the $i$th cascade status at time t, $\theta $ is the weight parameter, which is obtained by minimizing the following loss:$h( X ~ i \u22c5 t)= 1 1 + exp ( \u2212 \theta 0 \u2212 X ~ i \u22c5 \theta t ),$where $L(\theta )$ is the likelihood function of $h( X ~ i \u22c5 t)$.$F(\theta )=\u2212 logL(\theta )+ \beta 4 \u2211 i , j ( \theta i X ~ \u22c5 i T X ~ \u22c5 j \theta j ) 2+\gamma \Vert \theta \Vert ,$**MLR**^{12}(Multivariate linear regression): The model can be defined as $ x f=\Theta \u22c5 X e t$, where $\Theta $ is the learning parameter and $ X t e$ is the time series data of the infections number in the early stage ( $t\u2264 t e$). When the indicator $ x f$ exceeds a predefined threshold, a prediction of cascading outbursts is made.**LSTMIC**^{13}(Long short-term memory for information cascade), this method uses a recurrent neural network with long short-term memory units to directly learn sequential patterns.

#### 2. Early prediction of cascade outbreak probability and range

Based on the classic SIR propagation model, we apply the GPP-COP method to the networks of BA, ER, and Deezer_Europe and make predictions about the probability and range of cascade outbreaks in the early stages of transmission. Subsequently, we evaluate the performance of our online monitoring method through experiments.

We randomly select a single propagation source in each network for simulation propagation based on the SIR model. As illustrated in Fig. 5, two types of behavior are observed. For the case of Fig. 5(a), the global propagation probability $p(seed, s \u221e)$ of the propagation source fluctuates up and down in the early stage of propagation. When it reaches the burst threshold $1\u2212\alpha $ (with $\alpha =0.01$), we make an inference of a cascade burst. The green dotted line in the figure is the outbreak warning line. In addition, we can also infer the explosion range $ s \u221e$, as shown by the black dashed line in Fig. 5(a), which shows that our predicted explosion range is in close agreement with the simulated value. For the situation in Fig. 5(c), the global propagation probability remains below the outbreak warning line and eventually decays to zero. In this situation, the information ceases to propagate after reaching a small fraction of the network.

In the same way, we consider selecting multiple nodes as propagation sources for experiments and can obtain similar conclusions. As demonstrated in Figs. 5(b) and 5(d), it can be observed that the multi-source cases exhibit a faster and easier global breakout than the single-source cases. Figure 5(e) illustrates the early spread in the Deezer_Europe network, and it marks six stages of propagation. Figure 5(f) displays the network diagram of these six stages, in which the nodes that have received information are indicated in red. When the propagation reaches the fifth stage, the global propagation probability exceeds the warning threshold, so we can make inferences of cascade outbreaks.

#### 3. Predicted effects in simulation models

To validate the predictive efficacy of the proposed method for cascade outbreaks, we conduct experiments based on SIR propagation model and compare the proposed method with the baseline methods. Furthermore, to facilitate its applicability in real-world scenarios, we thoroughly discussed the experimental conditions and verified the robustness of our method.

We simulate propagation based on the SIR model in ER, BA, and Deezer_Europe networks, generating 1000 training data and 100 test data for each network. Subsequently, we apply the above five benchmark methods and the proposed method to the cascade dataset. To ensure a fair comparison, we used appropriate parameters for the baseline methods. Specifically, we selected 500 sensors for the Rand and Degree methods. For the OSLOR method, we choose appropriate $\beta $ and $\gamma $ values under different experimental conditions. For the Rand, Degree, and MLR methods, we used ridge regression with a regularization factor of 0.1 for the final predictions. For the LSTMIC method, we set the batch size to 16 and the number of epochs to 30. The test results under different methods are shown in Fig. 6, which shows that our proposed method outperforms the other methods in predicting cascade outbreaks across all three experiments. In the binary classification task of cascade outbreaks, the proposed method achieves the highest accuracy as well as good precision and recall metrics, resulting in the highest f1 score compared to the baseline methods. It is noteworthy that our baseline methods are all supervised methods. However, in the aforementioned experiments, our method, which does not rely on historical data, still achieves the best performance. This is mainly due to its effective utilization of the structural data and cascade propagation probability of the network.

In addition, the GPP-COP method can estimate the outbreak range. To verify the accuracy of the predictions, we conducted multiple SIR simulation experiments on the Facebook_Large network with different propagation probabilities. As shown in Fig. 7(a), we use the outbreak range of 100 experiments as simulation values at each probability. We can find that the calculated values of our method are in close agreement with the mean of simulated values. Furthermore, Fig. 7(a) indicates a small amount of fluctuation, as evidenced by the triple standard deviation of the simulated value. As such, our method can effectively predict the outbreak range.

However, in many practical scenarios, the cascade propagation probability and network structure cannot be obtained directly, but need to be inferred from the system operating mechanism or historical data. This kind of inference has been widely studied in historical work. Here, we adopt a rough estimation method and assume that the cascade propagation probabilities in the network are all the same. Then, according to the $ s ~ \u221e$ in the historical data, we can estimate the cascade propagation probability $ \beta ^= g \u2212 1( s ~ \u221e)$ [see Fig. 7(a)]. Of course, this estimation will bring a certain error $\Delta \beta = \beta ^\u2212\beta $. Then, we test our method under different $\Delta \beta $. Figures 7(b)–7(d) demonstrate that, as $\Delta \beta $ increases, the precision score decreases while the recall score increases, with the best f1 score achieved when $\Delta \beta =0$. To minimize the impact of $\Delta \beta $, we obtain the optimal burst threshold $ p \u2217$ (equal to $1\u2212\alpha $) through training. Figure 7(b) illustrates the prediction results when $\Delta \beta =\u22120.02, e t=10$, and we obtain $ p \u2217=0.26$ through training. As shown in Fig. 7(c), by choosing the burst threshold after training, our prediction effect is greatly improved, and the effect is close to the case of $\Delta \beta =0$. In fact, when the estimation of the network structure also has errors, we can improve the predictive ability of the method by selecting the appropriate cascade propagation probability and burst threshold through the training set. Our results demonstrate that the proposed method exhibits strong robustness and can be applied to more scenarios.

#### 4. Experimental effect in Sina Weibo dataset

This dataset collects all messages on June 1, 2016 and records retweets over the next 24 h. For each message, the data record the publisher’s id and publishing time, and the forwarder’s id and forwarding time. We use $s(24h)$ to approximate the final popularity $ s \u221e$. Our task is to predict the eventual occurrence of a cascade outbreak using early data.

Before conducting experiments, we pre-process the cascaded propagation data. First, since this dataset has no network structure, we take the message forwarding network in the historical data as the underlying network, which contains a total of 6 738 040 nodes and 15 293 817 edges. Then, we take the data whose forwarding volume is in the top 5% as the case of cascade outbreak, and the rest as the case of local spread. Finally, to evaluate the proposed method, we randomly select 2500 cascade outbreak data and 2500 local spread data for experiments, with 4000 used for the training set and 1000 for the test set.

In addition, in previous simulation experiments, the propagation dynamics of different cascades were identical, and the cascade outbreak was a probabilistic event dependent on the topological location. However, in the real Weibo data, different information contents lead to great differences in dissemination probability, which can be roughly reflected by the number of retweets in the early stage. Therefore, we roughly combined the GPP-COP method and the early infected number to predict the cascade outbreak. Specifically, we take the early infected number $n( e t)$ and the global propagation probability $ p s ( e t )$ as features and use a simple linear regression model to predict cascade outbreaks. Then, we can learn the optimal cascade propagation probability $\beta $ and the weights of the linear regression model through the training set. Finally, we get the new spatiotemporal prediction model, which is still denoted by GPP-COP.

We conducted experiments on the GPP-COP method, as well as baseline methods, and report prediction results in Table I. Due to the $O( n 2)$ algorithm complexity, the OSLOR method is not suitable for large-scale data and is therefore not considered. For the Rand and Degree methods, we selected 10 000 sensors. But it does not perform well due to the large number of nodes in the network and the limited number of training sets. For the MLR and LSTMIC methods, we count the number of retweets every 5 min to get early time series data. The two methods fully consider the early time series data and exhibit advantages in the prediction of cascade outbreaks in social networks. On this basis, our method also considers the topological positions of early infected users within the network, so this method utilizing spatiotemporal information demonstrates superior performance.

. | $ e t=10 min(%)$ . | $ e t=20 min(%)$ . | $ e t=60 min(%)$ . | ||||||
---|---|---|---|---|---|---|---|---|---|

Method . | Precision . | Recall . | f1 . | Precision . | Recall . | f1 . | Precision . | Recall . | f1 . |

Rand | 78.6 | 2.2 | 4.3 | 85.7 | 6 | 11.2 | 91.8 | 13.4 | 23.4 |

Degree | 82.6 | 29.6 | 43.6 | 85.4 | 29.2 | 43.5 | 87.9 | 40.8 | 55.7 |

MLR | 87.2 | 68.3 | 76.6 | 88.6 | 73.2 | 80.2 | 92.6 | 79.8 | 85.7 |

LSTMIC | 83.8 | 72.6 | 77.8 | 86.5 | 75.8 | 80.8 | 90.9 | 80.6 | 85.4 |

GPP-COP | 84.1 | 75.2 | 79.9 | 87.9 | 76.3 | 81.7 | 92.2 | 80.4 | 85.9 |

. | $ e t=10 min(%)$ . | $ e t=20 min(%)$ . | $ e t=60 min(%)$ . | ||||||
---|---|---|---|---|---|---|---|---|---|

Method . | Precision . | Recall . | f1 . | Precision . | Recall . | f1 . | Precision . | Recall . | f1 . |

Rand | 78.6 | 2.2 | 4.3 | 85.7 | 6 | 11.2 | 91.8 | 13.4 | 23.4 |

Degree | 82.6 | 29.6 | 43.6 | 85.4 | 29.2 | 43.5 | 87.9 | 40.8 | 55.7 |

MLR | 87.2 | 68.3 | 76.6 | 88.6 | 73.2 | 80.2 | 92.6 | 79.8 | 85.7 |

LSTMIC | 83.8 | 72.6 | 77.8 | 86.5 | 75.8 | 80.8 | 90.9 | 80.6 | 85.4 |

GPP-COP | 84.1 | 75.2 | 79.9 | 87.9 | 76.3 | 81.7 | 92.2 | 80.4 | 85.9 |

Through experiments, it is not difficult to find that our method has the best performance. Although the network structure and propagation probability in the Weibo dataset are roughly estimated from historical data, the proposed method still outperforms the baseline methods. Therefore, our method holds the potential to expand to more real-world scenarios.

## V. DISCUSSIONS

Although our GPP algorithm requires the local structure of the network to be tree like, the introduction of the discount coefficient $\lambda $ can well expand the application scope of the algorithm. We conduct experiments in a variety of different network scenarios, including ER random network, BA random network, and three complex networks in real-world scenarios. The experimental results demonstrate the efficacy of the GPP algorithm for accurately estimating the global propagation probability of any activated node set. Furthermore, our method does not necessitate that each edge possess equal propagation probabilities, rendering it applicable to heterogeneous networks.

Our GPP-COP method employs the GPP algorithm as a component to online monitor the probability of cascade outbreaks and estimate the final brust range. Experimental results demonstrate that our method outperforms the baseline methods significantly. This is primarily due to the fact that our proposed method maximizes the utilization of network structure and propagation dynamics. In addition, we find that fewer specific nodes can propagate information to the network globally, and this is closely related to the combination of topological positions where these nodes are located. Our method can quantify this combination of topological positions into probability values, allowing predictions about cascade outbreaks from a network-global perspective.

Of course, our method exhibits certain limitations that warrant further research in future studies. First, our method relies on pre-existing knowledge of cascade propagation probabilities and network structures. Thus, we need to focus on integrating advanced strategies for estimation of network structure and propagation probability. Second, the community structure and multi-layer attributes of networks are also key factors that influence the effectiveness of the proposed method. This issue was partially addressed through a strategy in an earlier study.^{36} Future work could integrate these structured attributes into the proposed framework to enhance the accuracy of cascade outbreak predictions. Third, in recent years, the higher-order interactions within complex networks have been increasingly identified across a multitude of domains, including physics, biology, and sociology. Extending our methodology to accommodate the analysis of higher-order networks represents a significant and promising avenue of research. Fourth, our experimental findings indicate that optimal performance in cascade outbreak prediction can be achieved by fully leveraging temporal and spatial information. Consequently, future efforts should focus on the enhanced integration of these two critical dimensions. Finally, it is worth noting that our study employs the simplified propagation models such as the SIR model, and thus there is a need to incorporate more sophisticated propagation patterns and prior knowledge into domain-specific applications. Looking ahead, efforts will be directed toward integrating this approach into specific fields with the aim of facilitating early prediction of large-scale information dissemination within the real-world systems.

## VI. CONCLUSIONS

In summary, this paper focuses on the prediction problem of cascade outbreaks. We first propose the GPP algorithm, which can calculate the global propagation probability of nodes in the network. On this basis, we propose the GPP-COP method. In simulation experiments of the SIR model, our method can quantitatively predict the probability and range of cascade outbreaks online at an early stage of propagation, and its effect is significantly better than that of the baseline methods. Furthermore, we briefly analyze the robustness of the proposed method and apply it to a real Weibo dataset. Experimental results show that our method fully utilizes the structural and dynamic features of the network to achieve better predictions.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Xin Li:** Writing – original draft (equal). **Huichun Li:** Data curation (equal); Writing – review & editing (equal). **Xue Zhang:** Writing – review & editing (equal). **Chengli Zhao:** Supervision (equal). **Xiaojun Duan:** Validation (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX A: ERROR ANALYSIS ABOUT THE PREVIOUS METHOD

^{36}and perform error analysis. We denote the propagation probability from node $u$ to $v$ as $ \beta u v$ and the global propagation probability of node $u$ as $p(u, s \u221e)$, that is, the probability that node $u$ belongs to the giant component of the network. As shown in Fig. 1(b) in the main text, considering the first-order case, the probability that node u belongs to the giant component is equal to the probability that node $u$ is connected to the giant component through at least one of its neighbors $v$, so we obtain

### APPENDIX B: CONVERGENCE ANALYSIS OF THE ITERATIVE ALGORITHM

For the sake of convenience for further analysis, we assume that the propagation probability within the network is uniformly $\beta \u2208(0,1)$, that is, $b=(\beta ,\beta ,\u2026,\beta )\u22a4$. Subsequently, we conduct some necessary analysis on $J( w \u2217)$. When $\beta $ falls below the percolation threshold, the network does not have a giant component. It is evident that all elements of $w$ are equal to 1. Consequently, the Jacobian matrix can be expressed as $J(1)=\beta B$. It is evident that Eq. (B2) exhibits stable convergence to $1$ when $\beta $ is less than the reciprocal of the principal eigenvalue of $B$. Otherwise, $ w \u2217=1$ represents an unstable fixed point, where minor perturbations will cause the system (B4) to converge toward an alternate new fixed point.

### APPENDIX C: ANALYSIS OF DISCOUNT COEFFICIENT $\lambda $

In order to reduce the errors caused by inner edges, we introduce the discount coefficient $\lambda $. The specific calculation process is as follows.

We use $ T u \u2192 v$ to indicate that the event that node u connects to the giant component through the first-order neighbor $v$ occurs and use $ T u \u2192 v \u2192 w$ to indicate the event that node $u$ connects to the giant component through the first-order neighbor $v$ and the second-order neighbor $w$ occurs. On the contrary, use $ F \u2217$ to indicate the $\u2217$ event does not occur. At this time, $ F u \u2192 v \u2192 w$ in the networks of Fig. 1(c) are not approximately independent of each other, and direct calculation with Eq. (2) will introduce non-negligible errors. Below we discuss the possible errors in these situations and their improvement methods.

**Case 1:** This situation is the main source of error, which we analyze in the main text Eq. (3), and the discount coefficient is recorded as $ \lambda 1( v 1, v 2)=(1\u2212 \beta u v 1 p u , v 2 \u2192 v 1)/[(1\u2212 \beta u v 1 p u \u2192 v 1) p v 2 \u2192 v 1]$, where $ p u , v 2 \u2192 v 1$.

**Case 2:**As shown in the subgraph (1) of Fig. 1(c), when the edge $ e u v 2$ exists, under the condition of $ F u \u2192 v 1 \u2192 v 2$, the probability of $ T v 2 \u2192 w 21$ is

**Case 3:**As shown in the subgraph (2) of Fig. 1(c), the inner edge connects the first-order and second-order neighbors of the propagation source. When the edge $ e u v 2$ exists, in the case of $ F u \u2192 v 1 \u2192 w 1 2$, the conditional probability of $ T v 2 \u2192 w 1 2$ is

**Case 4:**As shown in the subgraph (3) of Fig. 1(c), when the edge $ e u v 2$ exists, in the case of $ F u \u2192 v 1 \u2192 v 2$ and $ F u \u2192 v 1 \u2192 w 12$, the probability of $ T v 2 \u2192 w 12$ is

**Case 5:** As shown in the subgraph (4) of Fig. 1(c), when the inner edges connect two second-order neighbors of the propagation source, the impact at this time is smaller than the second and third cases, and much smaller than the impact of the first case, so we ignore its impact.

### APPENDIX D: PSEUDO CODE OF ALGORITHM

In this section, we give the pseudo code form of the GPP algorithm and the GPP-COP method. The meanings of the symbols in the algorithms below are consistent with the main text.

Algorithm 1 introduces in detail the process of GPP algorithm to calculate the global propagation probability of all nodes in the network. In order to improve operational efficiency, the discount coefficient $\lambda $ in the sixth line can also be calculated by Eq. (3).

Algorithm 2 introduces in detail the process of GPP-COP method to predict cascade outbreaks. For any early time $t$, we can calculate its global propagation probability $ p S$ and outbreak range $ s \u221e$ by the proposed method. Then combining the three inference ideas in Sec. III B of the main text, we can make inferences about cascade outbreaks in the early stages of propagation for different scenarios.

## REFERENCES

*Proceedings of the International AAAI Conference on Web and Social Media*(AAAI Press, 2014), Vol. 8, pp. 101–110.

*Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(ACM, 2003), pp. 137–146.

*Proceedings of the 2017 ACM on Conference on Information and Knowledge Management*(ACM, 2017), pp. 467–476.

*Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(ACM, 2013), pp. 901–909.

*Proceedings of the Sixth ACM International Conference on Web Search and Data Mining*(ACM, 2013), pp. 365–374.

*Proceedings of the 13th International Conference on Web Search and Data Mining*(ACM, 2019).

*Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(ACM, 2002), pp. 61–70.

*Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(ACM, 2009), pp. 497–506.

*Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(ACM, 2007), pp. 420–429.

*Proceedings of the 18th International Conference on World Wide Web*(ACM, 2009), pp. 21–30.

*Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(ACM, 2015), pp. 1513–1522.

*Network Science: an Introduction*

*Proceedings of the 29th ACM International Conference on Information & Knowledge Management*(ACM, 2020), pp. 1325–1334.

*Proceedings of the 2017 ACM on Conference on Information and Knowledge Management*(ACM, 2017), pp. 1149–1158.