In many fields, accurate prediction of cascade outbreaks during their early stages of propagation is of paramount importance. Based on percolation theory, we propose a global propagation probability algorithm that effectively estimates the probability of information spreading from source nodes to the giant component. Building on this, we further introduce an early prediction method for cascade outbreaks, which provides quantitative predictions of both the probability and scope of cascade outbreaks by fully considering the network structure data and propagation dynamics. Through our research, we observe that cascade outbreaks resemble a phase transition. When approaching the critical point of an outbreak, a few specific activating nodes typically facilitate the transmission of information throughout the entire network, thus enabling early inference of a cascading outbreak. To validate our findings, we conducted experiments on diverse network structures using a classical propagation model and applied our proposed method to analyze a real microblog cascade dataset. The experimental results robustly demonstrate the superiority of our approach over baseline methods in terms of effectively predicting cascade outbreaks with high precision and early detection capability.

In today’s interconnected world, the rapid spread of information through social networks or other underlying networks has the potential to trigger cascades of events with far-reaching consequences. In recent years, numerous scientists have dedicated their efforts to studying the patterns of cascading propagation and modeling the dynamics of transmission to predict future evolutionary trends. However, there is still a lack of effective research methods for assessing the global propagation potential of infectious nodes based solely on the limited early-stage data. To address this challenge, we investigate the association between early infected nodes and the giant component based on network percolation theory and propose a qualitative method to effectively estimate the probability of information spreading from the source to the entire network. This pioneering research not only introduces a new method for early prediction of cascade outbreaks but also sheds light on the underlying dynamics of these phenomena, unveiling the global spread potential via early infected nodes. Therefore, our approach offers meaningful early warning and guidance in the risk assessment and proactive management of significant events.

With the rapid development of radio communications and the Internet, as well as the miniaturization of mobile devices, people will re-share the received information with their neighbors, so various types of information can be quickly spread among the crowd. This process of spreading information through re-sharing is called cascade.1 Most information can often be localized, dissipating in a short time or within a small area. However, a small amount of information can be transmitted from generation to generation like a virus, eventually spreading to a large population of people. This chain reaction has been observed in various contexts, including advertising and marketing,2 social networks,3,4 email communication,5 and epidemic spreading.6 Therefore, it is of great significance to accurately predict cascade outbreaks and their ranges in many scenarios.

The dissemination of information often depends on the underlying social network, because only connected users can disseminate information. Although various forecasting methods have been proposed for cascade outbreaks,6–10 they mainly focus on the topological importance of activated nodes in the network or the short-term propagation potential, which limits the exploitation of the global structural information of the network. However, as cascade explosions are global phenomena, utilizing the network’s global characteristics for predicting presents a new research direction. In addition, traditional machine learning methods such as Refs. 11–13 utilize early time series data or topological importance as features to predict cascade outbreaks. However, these methods face challenges in training models that effectively incorporate both global structure and temporal features. Recently, graph neural networks (GNNs) have shown promise for predicting spatiotemporal features in cascade propagation.14–16 These methods model propagation dynamics on complex networks through neighborhood aggregation and predict cascade propagation. However, GNN-based propagation modeling is a regression task, and early propagation data may be insufficient for training parameters. To supplement these methods, we aim to employ early spread data combined with a network structure and prior information from classic propagation models17–19 to predict cascade outbreaks, which can be treated as a classification task.

Network percolation theory is an important tool for studying explosive dissemination20,21 due to its equivalence with the classical propagation model and its ability to utilize global network structure information for analyzing cascade bursts. In many complex networks, the overall interconnection frameworks rely on a specific set of structural nodes. Activation of these nodes can spread information throughout the entire network.22 At the early stage of probability propagation, a small number of activated nodes may activate the entire network through some key nodes with a certain probability. This probability can help predict and explain possible future cascade bursts. The probability can improve the predictability and comprehensibility of potential cascade bursts. Building on prior research, we incorporate percolation theory to forecast cascade outbreaks and mainly make the following three contributions:

  1. Based on message passing methods23 and percolation theory, we propose a global propagation probability (GPP) algorithm that accurately reflects the likelihood of the activated nodes to activate the entire network. This index establishes a connection between local and global network structures in the sense of percolation.

  2. Based on the GPP algorithm, we present a novel method for Cascade Outbreak Prediction (COP), which we denote as GPP-COP method. This method can monitor the probability of cascade outbreaks online and estimate the final outbreak range. GPP-COP utilizes prior information on network structure and propagation dynamics, providing a new scientific tool for cascade outbreak research.

  3. We conduct experiments on various simulated and real network structures. Experimental results demonstrate that the GPP-COP method can provide quantitative and deterministic inference in the early stages of propagation, exhibiting strong potential for practical applications.

In recent years, considerable effort has been directed toward developing solutions for the problem of early prediction of cascade bursts. Several cascade prediction methods, which do not rely on the underlying network, have achieved promising results. These methods have attempted to predict viral cascades through the use of information content,24 temporal dynamics,25,26 and influential users.27,28 Specifically, Leskovec et al.29 selected some optimal nodes as sensors to predict cascade outbreaks. Subsequently, Cui et al.11 proposed Orthogonal Sparse LOgistic Regression (OSLOR) method to evaluate node weights for outbreak prediction. Agarwal et al.30 and Zhao et al.31 proposed methods that learn a parameterized rate function for predicting cascade outbreaks. Pinto et al.12 performed multiple linear regression on time-series data of information cascades to predict future outbreaks. More recently, Gou et al.13 employed a Long Short-Term Memory (LSTM) neural network to learn the sequential features.

However, with the continuous improvement of computing power and the emphasis on constructing datasets, an increasing number of network structure data are published. The relationship between the cascade bursts and the specific network structure remains an urgent problem to be solved. Ugander et al.9 utilized local neighbor structures to predict the global information outburst on Facebook. Weng et al.10 believed that the user’s community structure is an important factor affecting global cascades. Zhang et al.32 focused on the influence of friends on self-reposting behavior and incorporated the locality of social influence into the factor graph model to better predict the spread of reposting behavior on large microblog networks. Moreover, Cao et al.14 proposed the CoupledGNN method to capture the change in node state and influence, and Chen et al.15 proposed the Multiscale Graph Capsule Network method to learn the latent features of cascade graphs from a multi-scale perspective. These methods require a sufficient amount of historical data of cascade propagation to model the propagation dynamics on the network.

Network percolation theory is a fundamental tool for the study of networks.33,34 In particular, Hu et al.35 found that the final propagation range of information obeys a bimodal distribution, either confined to a very small portion or spanning the entire network. They were the first to propose the concept of global propagation probability and designed a multi-simulation method to estimate it. Our earlier work36 applied the global propagation probability to the edge percolation model and proposed an approximate calculation method.

Propagation dynamics models play a crucial role in the study of the contagion of diseases, rumors, and other forms of information. These models are not only capable of modeling the mechanisms of transmission dynamics at an individual level, but also reveal macroscopic phenomena and patterns within complex systems through large-scale simulations. Consequently, such models are instrumental in enhancing our understanding, prediction, and management of cascade outbreaks. Without loss of generality, our study is grounded in the Susceptible-Infected-Recovered (SIR)18 model to examine the potential of early-stage infected nodes for disseminating information throughout the whole network. Nodes in the SIR model are categorized into three states: susceptible ( S), infected ( I), and recovered ( R). Specifically, node u in state I will infect a neighboring node v in state S with probability ϕ u v during a unit time interval δ t. We assume that the infection cycle τ 0 contains m 0 unit times such that τ 0 = m 0 δ t. Once the infection cycle completes, the node in state I transitions to state R, losing the ability to infect others and becoming immune to further infection.

The theory of network percolation is a powerful tool for dynamic problems such as cascade failure, disease dissemination, and traffic flow. For a given network G, we consider each edge to be retained with probability β and deleted with probability 1 β. There exists a critical probability threshold β c such that when β > β c, the largest connected branch of the remaining network G r is of the same order as the entire network and is referred to as the giant component.

Many classic epidemic models are closely related to the edge percolation theory. In the SIR model, the probability that a node u in state I will infect its neighbor node v in state S during the entire infection cycle is β u v = 1 ( 1 ϕ u v ) m 0, that is, the edge e u v is retained with the probability β u v. Therefore, if the source node belongs to the giant component, the information will spread throughout the network; otherwise, the information can only propagate locally. Thus, the SIR model is equivalent to the network edge percolation model, and the edge percolation model can be used to study global outbreaks in some classical propagation modes.

Consider a network G = ( V , E ) consisting of N nodes connected by M edges, where the node set V = { v 1 , v 2 , , v N }, and the edge set E = { e i j | 1 i , j N }. To remain mathematically or computationally tractable, we assume that the transmission probability between nodes u and v is denoted as β u v, representing the probability that an active node u activates its neighbor v within the infection period τ 0. This hypothesis is widely applicable across many propagation models and consistent with a broad array of real-world scenarios. Accordingly, we focus on the following two primary tasks: (1) predicting the occurrence of a cascade based on early propagation data, and (2) estimating the ultimate extent of the information impact via the early infected nodes. To accomplish these tasks, we first propose the GPP algorithm to calculate the global propagation probability (equivalent to the probability of cascade bursts). Building on this, we introduce the early prediction method for the online monitoring of the cascade outbreaks.

In early literature, it can be observed that the ultimate extent from a source node u approximates a bimodal distribution,35 indicating that the information propagated by u either diffuses throughout the entire network or remains confined to a small local subset. In fact, the source node u disseminates information to the giant component G g of the network with a probability p ( u , s ), where s denotes the size of G g and p ( u , s ) represents the global propagation probability. In our previous work,36 we proposed the original first-order and second-order iterative methods for estimating p ( u , s ). However, these methods exhibit limitations in terms of predictive accuracy (see  Appendix A for details). To achieve more accurate and efficient computation, we introduce a novel iterative algorithm below.

Let β u v denote the propagation probability from node u to v, and let p u represent the global propagation probability of node u. Based on the idea of message passing, we let p v 0 u denote the probability that node u is connected to the giant component after removing node v 0. Unlike the original first-order and second-order methods, we give the iterative equations based on the edge probability p v 0 u rather than node probability p ( u , s ). Specifically, for a network with local tree structure, p v 0 u is equal to the probability that node u is connected to the giant component through at least one of its neighbors except v 0, so we can get the first-order form equation
p v 0 u = 1 v u v 0 ( 1 β u v p u v ) ,
(1)
where u v 0 represents the set of neighbors of u excluding v 0. For an undirected edge e v 0 u, information propagation can be deemed bidirectional, so we can derive two distinct equations for p v 0 u and p u v 0. Consequently, we can obtain 2 M equations for undirected network, where M is the number of edges in the network. These equations can be solved iteratively to obtain the probability p v 0 u. To improve the algorithm, we first extend Eq. (1) to a second-order form by further expanding p u v = 1 w v u ( 1 β v w p v w ) according to Eq. (1), i.e.,
p v 0 u = 1 v u v 0 [ 1 β u v ( 1 w v u ( 1 β v w p v w ) ) ] ,
(2)
which also can be solved by iteration. Moreover, we also conduct a convergence analysis of the iterative algorithm, refer to  Appendix B for details.

We consider the local tree-like structure [Fig. 1(a)] to derive our method; however, deviations from this assumption may introduce computational errors. Our objective is to minimize this error through some refinements. As shown in Fig. 1(b), there are five situations that disrupt the local tree-like structure (here, we consider a local structure encompassing first- and second-order neighbors), where edges denoted by the black dashed lines destroy the local tree-like characteristics of node u. We call these edges inner edges. When inner edges are present, the probabilities of reaching the giant component via different paths may not be independent, thereby introducing errors into our iterative algorithm. Moreover, in calculating the probability p v 0 u, inner edges closer to the topological position of node u exert a greater influence on our algorithm. Consequently, our primary attention is devoted to the inner edges among first-order neighbors, as depicted in case (1) of Fig. 1(b) (for additional cases, refer to  Appendix C).

FIG. 1.

Network structure diagram of information dissemination. (a) The local network structure of node u. (b) The local tree structure is destroyed by inner edges (dashed lines) in several situations.

FIG. 1.

Network structure diagram of information dissemination. (a) The local network structure of node u. (b) The local tree structure is destroyed by inner edges (dashed lines) in several situations.

Close modal
We denote the event in which node u is connected to the giant component via its first-order neighbor v by T u v, and we represent the negation of this event with F u v. In the case (1) of Fig. 1(b), the inner edge e v 1 v 2 (black dashed lines) connects two first-order neighbors of node u. Assuming the occurrence of event T u v 2, then the events T v 2 v 1 and T u v 1 are not independent. Therefore, when formulating Eq. (1) or (2), one should consider the following conditional probabilities to calculate the probability of T v 2 v 1, given by
P ( T v 2 v 1 | F u v 1 ) = ( 1 β u v 1 ) β v 2 v 1 p u , v 2 v 1 1 β u v 1 p u v 1 = ( 1 β u v 1 ) p u , v 2 v 1 ( 1 β u v 1 p u v 1 ) p v 2 v 1 β v 2 v 1 p v 2 v 1 ,
(3)
where p u , v 2 v 1 represents the probability that v 1 remains connected to the giant component after removing nodes u and v 2, which is given by p u , v 2 v 1 = 1 w v 1 { u , v 2 } ( 1 β v 1 w p v 1 w ). Compared with Eq. (2), we introduce an additional discount factor, denoted by λ. This factor is defined as
λ = ( 1 β u v 1 p u , v 2 v 1 ) / [ ( 1 β u v 1 p u v 1 ) p v 2 v 1 ] .
(4)
Introducing the discount coefficient λ, Eq. (2) becomes
p v 0 u = 1 v u v 0 [ 1 β u v ( 1 w v u ( 1 λ β v w p v w ) ) ] ,
(5)
where λ is related to the specific local structure of the network. Then, we can get the global propagation probability
p u = 1 v u [ 1 β u v ( 1 w v u ( 1 λ β v w p v w ) ) ] .
(6)
By utilizing Eqs. (5) and (6), we are able to compute the global propagation probability of any nodes with greater precision. We refer to this algorithm as the GPP algorithm (for the pseudocode of this algorithm, refer to Algorithm 1 in  Appendix D).

Cascade Outbreak Prediction (COP) refers to the forecasting of whether information can propagate throughout a network. Given a network structure G and propagation probability β u v, we propose the GPP-COP method to quantitatively assess the probability and range of cascade outbreaks.

In the case of a single propagation source, we use the notation I ( t ) to denote the set of the infected nodes at time t, and R ( t ) to represent the set of the recovered nodes by time t. To calculate the probability of cascade outbreaks, we combine the nodes from I ( t ) and R ( t ) into a new, merged node denoted by s ( t ). In this process, for any node u I ( t ), we consider any node v belonging to the set u ( I ( t ) R ( t ) ) to be the neighbor of s ( t ), and we set the propagation probability from s ( t ) to v as β s ( t ) v, which is equal to β u v. Figure 2 shows a simple example of the initial three steps of a propagation process, where the red color indicates that the node has received the information. At t = 1, we have I ( 1 ) = { 1 } and R ( 1 ) = , which implies that the merged node s ( 1 ) = I ( 1 ) R ( 1 ) = { 1 }. The neighbor set is denoted as s ( 1 ) = { 2 , 3 , 10 , 11 }. For any v s ( 1 ), the propagation probability β s ( 1 ) v is equal to β 1 v. Similarly, at time t = 2, we have s ( 2 ) = I ( 2 ) R ( 2 ) = { 1 , 2 , 3 }, s ( 2 ) = n I ( 2 ) n = { 4 , 8 , 9 }. In addition, for any n I ( 2 ) and v n, we have β s ( 2 ) v = β n v. Following the aforementioned procedure, we can obtain the network encompassing the merged node s ( t ) at any time. Thus, we can calculate the cascade outbreak probability p s ( t ) using the GPP algorithm. Furthermore, the outbreak range can be estimated, which yields
s n V , n ( I ( t ) R ( t ) ) p n + c a r d ( I ( t ) R ( t ) ) ,
(7)
where c a r d ( I R ) denotes the cardinality of the set I R and p n denotes the global propagation probability of the node n. In this way, we can quantitatively predict the situation of future cascade outbreaks in real time based on the early dissemination data. In practical applications, we introduce a confidence probability α. If p s ( t ) > 1 α, we give a deterministic prediction of the cascade explosion at time t. If p s ( t ) < α, we predict that the propagation will be localized. Otherwise, we need further observational data. Additionally, in the presence of observational uncertainties in network structure or cascade propagation probabilities, the optimal threshold α can be selected based on historical data.
FIG. 2.

Schematic diagram of early cascade propagation. (a) Local structure of the early propagation. (b) The merged mapping of the propagation sources in the first three steps.

FIG. 2.

Schematic diagram of early cascade propagation. (a) Local structure of the early propagation. (b) The merged mapping of the propagation sources in the first three steps.

Close modal
For the case of multiple propagation sources, the fundamental idea is unchanged from that of a single source. Consider a situation with r propagation sources. If the infection regions of two sources overlap at a certain time, they merge into a single effective source, ensuring that the infection regions of all effective sources remain separate from one another. We denote the number of active sources at time t by r ( t ). Within the infection region of the ith effective source, we denote the set of nodes in state I by I i ( t ) and denote the set of nodes in state R by R i ( t ). Similarly, we merge the nodes in I i ( t ) and R i ( t ) into a new node s i ( t ). For any node u i I i ( t ), we consider any node v i belonging to the set u i ( I i ( t ) R i ( t ) ) as the neighbor of s i ( t ), and we set the probability as β s i ( t ) v i, witch is equal to β u i v i. Furthermore, we merge all { s i ( t ) | i = 1 , 2 , , r ( t ) } into a single node S ( t ), and consider all nodes { v s i ( t ) | i = 1 , 2 , , r ( t ) } to be the neighbors of S ( t ). When inheriting probabilities, for any v S ( t ), we let Q = { i | v s i ( t ) }, where Q has at least one element. When the set Q contains multiple elements, v serves as a common neighbor of multiple effective sources. In this case, the inheritance probability is
β S ( t ) v = 1 j Q ( 1 β s j ( t ) v ) .
(8)
Then we can employ the GPP algorithm, akin to the single-source scenario, to estimate the probability and extent of cascade outbreaks.

According to Eqs. (5)–(8), we can predict the probability and range of the cascade outbreak at an early stage. This prediction method is referred to as the GPP-COP method, and the pseudocode is provided in Algorithm 2 in  Appendix D.

To elucidate the procedural implementation of our method and to analyze its complexity, we provide the pseudocode for the GPP algorithm and GPP-COP method in  Appendix D.

For the GPP algorithm, assuming the average degree of the network is d ¯, the number of iterations required for convergence is m, and the number of calculations needed to calculate the discount coefficient λ is H, then the average time complexity of the GPP algorithm is 2 m H ( d ¯ 1 ) 2 M. Specifically, here we restrict our consideration to the inner edges among first-order neighbors, whereby Eq. (4) is employed directly as the discounting coefficient λ [refer to Eq. (5)]. At this time, the calculation number H is a small finite value, then the time complexity of the GPP algorithm is linear, represented as ( O ( M ) ).

For the GPP-COP method, we need to execute the GPP algorithm at each time point, resulting in a computational complexity that is linear, proportional to the number of time points t. Consequently, our approach demonstrates a high level of computational efficiency.

To validate the effectiveness of the proposed algorithm in this paper, we conducted experiments on two synthetic and three real-world networks. (1) ER (Erdös-Rényi).37 We generate an ER random network with 10 000 nodes and has a probability of 0.0003 for any two nodes to be connected. (2) BA (Barabasi-Albert).37 We add three edges in each iteration until we obtain a network with 10 000 nodes. (3) CA-HepPh.38 This is a collaboration network of Arxiv High Energy Physics category with 12 008 nodes and 237 010 edges. (4) Deezer_Europe.39 A social network of Deezer users which was collected from the public API in March 2020. It contains 28 281 nodes and 92 752 edges. (5) Facebook_Large.40 This webgraph is a page-page graph of verified Facebook sites with 22 470 nodes and 171 002 edges.

Furthermore, we also test our method on a real cascaded propagation dataset (Weibo41). This dataset is from Sina Weibo, the most popular microblogging platform in China. The dataset collects all messages generated on June 1, 2016 and tracks retweets over the next 24 h. After removing messages with less than 10 retweets, there are a total of 119 313 messages. Our method infers the future popularity of tweets based on the number of early retweets.

The GPP algorithm calculates the global propagation probability of each node in the network by iteratively solving 2 M equations. Given knowledge of the network structure and the propagation probability of each edge, this algorithm can estimate the probability of any set of propagation sources successfully transmitting information throughout the network. In this section, we focus on verifying the effectiveness of the GPP algorithm for estimating the global propagation probability.

To verify the effectiveness of the GPP algorithm, we conducted multiple simulations of the propagation source and take the frequency of the global propagation as the true value of the global propagation probability. We compared the GPP algorithm with the original first-order and second-order iterative algorithms of prior work36 in three networks: CA-HepPh, Deezer_Europe, and Facebook_Large. Specifically, we randomly select 100 nodes in the network and apply three methods to calculate the global propagation probability of these nodes. The errors between the estimated values and the true values are presented in Fig. 3. The results indicate that the GPP algorithm is significantly more accurate than the original first- and second-order iterative algorithms, thereby reducing the average error. In these three real networks, the average error of our GPP algorithm is controlled within 0.01. This error is partly due to the random error introduced by multiple simulations, demonstrating our algorithm’s high computational accuracy.

FIG. 3.

The calculation error of the global propagation probability under different algorithms. [(a)–(c)] The calculation errors of different algorithms in CA-HepPH, Deezer_Europe, and Facebook_Large networks, respectively. (d) The average error of the three algorithms under different networks.

FIG. 3.

The calculation error of the global propagation probability under different algorithms. [(a)–(c)] The calculation errors of different algorithms in CA-HepPH, Deezer_Europe, and Facebook_Large networks, respectively. (d) The average error of the three algorithms under different networks.

Close modal

In addition, to investigate the effect of the GPP algorithm in mitigating the influence of inner edges, we arbitrarily selected two nodes in the Deezer_Europe network as the propagation source and depicted its second-order local structure diagram (without inner edges) in Figs. 4(a) and 4(b). Then, we add inner edges to destroy the local tree-like characteristics and compare the calculation effects of different methods under different propagation probabilities, as shown in Figs. 4(c)4(f) and 4(g)4(j). In addition to the first-order and second-order methods, we also conduct ablation experiments on the discount factor λ of the proposed method, denoted as “benchmark” [that is, λ = 1 in Eqs. (5) and (6)]. Based on the experimental results depicted in Fig. 4, it has been observed that the discount factor λ effectively reduces errors caused by inner edges, particularly in networks with a higher number of inner edges and lower average degrees. However, when the network structure closely approximates a local tree-like topology, both the “benchmark” method and the proposed method with discount factor exhibits enhanced detection performance. Therefore, our GPP algorithm can accurately estimate the global propagation probability of all nodes in the network.

FIG. 4.

The calculation errors in the case of different inner edges in Deezer_Europe network. [(a) and (b)] The second-order neighbor structure diagram of node s 1 and s 2. [(c)–(j)] The calculation errors of four methods under different inner edges.

FIG. 4.

The calculation errors in the case of different inner edges in Deezer_Europe network. [(a) and (b)] The second-order neighbor structure diagram of node s 1 and s 2. [(c)–(j)] The calculation errors of four methods under different inner edges.

Close modal

The percolation theory is an effective method for estimating cascade outbreaks. When the underlying network is known, information will spread to the giant component of the network through a cascade. In this section, we experimentally validate the efficacy of our method in the early prediction task of cascade outbreaks.

1. Baselines and evaluation metrics

To better illustrate the effectiveness of our method, we experimentally compare it with the following five most popular methods.

  1. Rand:42 We randomly select several nodes as sensors and use ridge regression to predict cascade outbreaks.

  2. Degree:42 We select several nodes with the highest degree as sensors and use ridge regression to predict cascade outbreaks.

  3. OSLOR11 (Orthogonal sparse logistic regression): it is a logistic regression and jointly optimizes node selection and outbreak prediction. This method can be defined as
    h ( X ~ i t ) = 1 1 + exp ( θ 0 X ~ i θ t ) ,
    where X ~ t is the cascade status matrix, X ~ i t is the ith cascade status at time t, θ is the weight parameter, which is obtained by minimizing the following loss:
    F ( θ ) = log L ( θ ) + β 4 i , j ( θ i X ~ i T X ~ j θ j ) 2 + γ θ ,
    where L ( θ ) is the likelihood function of h ( X ~ i t ).
  4. MLR12 (Multivariate linear regression): The model can be defined as x f = Θ X e t, where Θ is the learning parameter and X t e is the time series data of the infections number in the early stage ( t t e). When the indicator x f exceeds a predefined threshold, a prediction of cascading outbursts is made.

  5. LSTMIC13 (Long short-term memory for information cascade), this method uses a recurrent neural network with long short-term memory units to directly learn sequential patterns.

Because cascade burst prediction is a binary classification problem, we use a c c u r a c y, p r e c i s i o n, r e c a l l, and f 1 metrics to evaluate the performances of the proposed method and baselines. Let T denote the set of testing samples. Then,
a c c u r a c y = 1 i T | p i y i | i T 1 , p r e c i s i o n = i T p i y i i T p i , r e c a l l = i T p i y i i T y i , f 1 = 2 p r e c i s i o n r e c a l l p r e c i s i o n + r e c a l l ,
where p i and y i represent the predicted value and true value of the ith test data, respectively. When the value is equal to 1, it indicates the occurrence of a cascade outbreak, while a value of 0 denotes a local spread.

2. Early prediction of cascade outbreak probability and range

Based on the classic SIR propagation model, we apply the GPP-COP method to the networks of BA, ER, and Deezer_Europe and make predictions about the probability and range of cascade outbreaks in the early stages of transmission. Subsequently, we evaluate the performance of our online monitoring method through experiments.

We randomly select a single propagation source in each network for simulation propagation based on the SIR model. As illustrated in Fig. 5, two types of behavior are observed. For the case of Fig. 5(a), the global propagation probability p ( s e e d , s ) of the propagation source fluctuates up and down in the early stage of propagation. When it reaches the burst threshold 1 α (with α = 0.01), we make an inference of a cascade burst. The green dotted line in the figure is the outbreak warning line. In addition, we can also infer the explosion range s , as shown by the black dashed line in Fig. 5(a), which shows that our predicted explosion range is in close agreement with the simulated value. For the situation in Fig. 5(c), the global propagation probability remains below the outbreak warning line and eventually decays to zero. In this situation, the information ceases to propagate after reaching a small fraction of the network.

FIG. 5.

Early prediction of the cascade outbreak. Panels (a) and (b) show the situations of global outbreak with single-source and multi-source, respectively. Panels (c) and (d) show the situations of local spread with single-source and multi-source, respectively. Panel (f) shows the six propagation stages corresponding to panel (e), and stage five signifies a global explosion of information. Here, we consider a recovery period m 0 = 10, along with infection rates per unit time of 0.03, 0.015, and 0.015 for the ER, BA, and Deezer_Europe networks, respectively.

FIG. 5.

Early prediction of the cascade outbreak. Panels (a) and (b) show the situations of global outbreak with single-source and multi-source, respectively. Panels (c) and (d) show the situations of local spread with single-source and multi-source, respectively. Panel (f) shows the six propagation stages corresponding to panel (e), and stage five signifies a global explosion of information. Here, we consider a recovery period m 0 = 10, along with infection rates per unit time of 0.03, 0.015, and 0.015 for the ER, BA, and Deezer_Europe networks, respectively.

Close modal

In the same way, we consider selecting multiple nodes as propagation sources for experiments and can obtain similar conclusions. As demonstrated in Figs. 5(b) and 5(d), it can be observed that the multi-source cases exhibit a faster and easier global breakout than the single-source cases. Figure 5(e) illustrates the early spread in the Deezer_Europe network, and it marks six stages of propagation. Figure 5(f) displays the network diagram of these six stages, in which the nodes that have received information are indicated in red. When the propagation reaches the fifth stage, the global propagation probability exceeds the warning threshold, so we can make inferences of cascade outbreaks.

3. Predicted effects in simulation models

To validate the predictive efficacy of the proposed method for cascade outbreaks, we conduct experiments based on SIR propagation model and compare the proposed method with the baseline methods. Furthermore, to facilitate its applicability in real-world scenarios, we thoroughly discussed the experimental conditions and verified the robustness of our method.

We simulate propagation based on the SIR model in ER, BA, and Deezer_Europe networks, generating 1000 training data and 100 test data for each network. Subsequently, we apply the above five benchmark methods and the proposed method to the cascade dataset. To ensure a fair comparison, we used appropriate parameters for the baseline methods. Specifically, we selected 500 sensors for the Rand and Degree methods. For the OSLOR method, we choose appropriate β and γ values under different experimental conditions. For the Rand, Degree, and MLR methods, we used ridge regression with a regularization factor of 0.1 for the final predictions. For the LSTMIC method, we set the batch size to 16 and the number of epochs to 30. The test results under different methods are shown in Fig. 6, which shows that our proposed method outperforms the other methods in predicting cascade outbreaks across all three experiments. In the binary classification task of cascade outbreaks, the proposed method achieves the highest accuracy as well as good precision and recall metrics, resulting in the highest f1 score compared to the baseline methods. It is noteworthy that our baseline methods are all supervised methods. However, in the aforementioned experiments, our method, which does not rely on historical data, still achieves the best performance. This is mainly due to its effective utilization of the structural data and cascade propagation probability of the network.

FIG. 6.

The predictive performance of different methods with different early stage time.

FIG. 6.

The predictive performance of different methods with different early stage time.

Close modal

In addition, the GPP-COP method can estimate the outbreak range. To verify the accuracy of the predictions, we conducted multiple SIR simulation experiments on the Facebook_Large network with different propagation probabilities. As shown in Fig. 7(a), we use the outbreak range of 100 experiments as simulation values at each probability. We can find that the calculated values of our method are in close agreement with the mean of simulated values. Furthermore, Fig. 7(a) indicates a small amount of fluctuation, as evidenced by the triple standard deviation of the simulated value. As such, our method can effectively predict the outbreak range.

FIG. 7.

Experiments and method improvements in the presence of error Δ β. (a) The relationship between the cascade propagation probability and the outbreak range. (b) Prediction results in the case of Δ β = 0.02 , e t = 10. (c) Comparison of prediction effects before and after training p . (d)–(f) The prediction effect under different Δ β.

FIG. 7.

Experiments and method improvements in the presence of error Δ β. (a) The relationship between the cascade propagation probability and the outbreak range. (b) Prediction results in the case of Δ β = 0.02 , e t = 10. (c) Comparison of prediction effects before and after training p . (d)–(f) The prediction effect under different Δ β.

Close modal

However, in many practical scenarios, the cascade propagation probability and network structure cannot be obtained directly, but need to be inferred from the system operating mechanism or historical data. This kind of inference has been widely studied in historical work. Here, we adopt a rough estimation method and assume that the cascade propagation probabilities in the network are all the same. Then, according to the s ~ in the historical data, we can estimate the cascade propagation probability β ^ = g 1 ( s ~ ) [see Fig. 7(a)]. Of course, this estimation will bring a certain error Δ β = β ^ β. Then, we test our method under different Δ β. Figures 7(b)7(d) demonstrate that, as Δ β increases, the precision score decreases while the recall score increases, with the best f1 score achieved when Δ β = 0. To minimize the impact of Δ β, we obtain the optimal burst threshold p (equal to 1 α) through training. Figure 7(b) illustrates the prediction results when Δ β = 0.02 , e t = 10, and we obtain p = 0.26 through training. As shown in Fig. 7(c), by choosing the burst threshold after training, our prediction effect is greatly improved, and the effect is close to the case of Δ β = 0. In fact, when the estimation of the network structure also has errors, we can improve the predictive ability of the method by selecting the appropriate cascade propagation probability and burst threshold through the training set. Our results demonstrate that the proposed method exhibits strong robustness and can be applied to more scenarios.

4. Experimental effect in Sina Weibo dataset

This dataset collects all messages on June 1, 2016 and records retweets over the next 24 h. For each message, the data record the publisher’s id and publishing time, and the forwarder’s id and forwarding time. We use s ( 24 h ) to approximate the final popularity s . Our task is to predict the eventual occurrence of a cascade outbreak using early data.

Before conducting experiments, we pre-process the cascaded propagation data. First, since this dataset has no network structure, we take the message forwarding network in the historical data as the underlying network, which contains a total of 6 738 040 nodes and 15 293 817 edges. Then, we take the data whose forwarding volume is in the top 5% as the case of cascade outbreak, and the rest as the case of local spread. Finally, to evaluate the proposed method, we randomly select 2500 cascade outbreak data and 2500 local spread data for experiments, with 4000 used for the training set and 1000 for the test set.

In addition, in previous simulation experiments, the propagation dynamics of different cascades were identical, and the cascade outbreak was a probabilistic event dependent on the topological location. However, in the real Weibo data, different information contents lead to great differences in dissemination probability, which can be roughly reflected by the number of retweets in the early stage. Therefore, we roughly combined the GPP-COP method and the early infected number to predict the cascade outbreak. Specifically, we take the early infected number n ( e t ) and the global propagation probability p s ( e t ) as features and use a simple linear regression model to predict cascade outbreaks. Then, we can learn the optimal cascade propagation probability β and the weights of the linear regression model through the training set. Finally, we get the new spatiotemporal prediction model, which is still denoted by GPP-COP.

We conducted experiments on the GPP-COP method, as well as baseline methods, and report prediction results in Table I. Due to the O ( n 2 ) algorithm complexity, the OSLOR method is not suitable for large-scale data and is therefore not considered. For the Rand and Degree methods, we selected 10 000 sensors. But it does not perform well due to the large number of nodes in the network and the limited number of training sets. For the MLR and LSTMIC methods, we count the number of retweets every 5 min to get early time series data. The two methods fully consider the early time series data and exhibit advantages in the prediction of cascade outbreaks in social networks. On this basis, our method also considers the topological positions of early infected users within the network, so this method utilizing spatiotemporal information demonstrates superior performance.

TABLE I.

The prediction effect of different methods. Here, boldface values indicate the best f1-score.

e t = 10 min ( % ) e t = 20 min ( % ) e t = 60 min ( % )
Method Precision Recall f1 Precision Recall f1 Precision Recall f1
Rand  78.6  2.2  4.3  85.7  11.2  91.8  13.4  23.4 
Degree  82.6  29.6  43.6  85.4  29.2  43.5  87.9  40.8  55.7 
MLR  87.2  68.3  76.6  88.6  73.2  80.2  92.6  79.8  85.7 
LSTMIC  83.8  72.6  77.8  86.5  75.8  80.8  90.9  80.6  85.4 
GPP-COP  84.1  75.2  79.9  87.9  76.3  81.7  92.2  80.4  85.9 
e t = 10 min ( % ) e t = 20 min ( % ) e t = 60 min ( % )
Method Precision Recall f1 Precision Recall f1 Precision Recall f1
Rand  78.6  2.2  4.3  85.7  11.2  91.8  13.4  23.4 
Degree  82.6  29.6  43.6  85.4  29.2  43.5  87.9  40.8  55.7 
MLR  87.2  68.3  76.6  88.6  73.2  80.2  92.6  79.8  85.7 
LSTMIC  83.8  72.6  77.8  86.5  75.8  80.8  90.9  80.6  85.4 
GPP-COP  84.1  75.2  79.9  87.9  76.3  81.7  92.2  80.4  85.9 

Through experiments, it is not difficult to find that our method has the best performance. Although the network structure and propagation probability in the Weibo dataset are roughly estimated from historical data, the proposed method still outperforms the baseline methods. Therefore, our method holds the potential to expand to more real-world scenarios.

Although our GPP algorithm requires the local structure of the network to be tree like, the introduction of the discount coefficient λ can well expand the application scope of the algorithm. We conduct experiments in a variety of different network scenarios, including ER random network, BA random network, and three complex networks in real-world scenarios. The experimental results demonstrate the efficacy of the GPP algorithm for accurately estimating the global propagation probability of any activated node set. Furthermore, our method does not necessitate that each edge possess equal propagation probabilities, rendering it applicable to heterogeneous networks.

Our GPP-COP method employs the GPP algorithm as a component to online monitor the probability of cascade outbreaks and estimate the final brust range. Experimental results demonstrate that our method outperforms the baseline methods significantly. This is primarily due to the fact that our proposed method maximizes the utilization of network structure and propagation dynamics. In addition, we find that fewer specific nodes can propagate information to the network globally, and this is closely related to the combination of topological positions where these nodes are located. Our method can quantify this combination of topological positions into probability values, allowing predictions about cascade outbreaks from a network-global perspective.

Of course, our method exhibits certain limitations that warrant further research in future studies. First, our method relies on pre-existing knowledge of cascade propagation probabilities and network structures. Thus, we need to focus on integrating advanced strategies for estimation of network structure and propagation probability. Second, the community structure and multi-layer attributes of networks are also key factors that influence the effectiveness of the proposed method. This issue was partially addressed through a strategy in an earlier study.36 Future work could integrate these structured attributes into the proposed framework to enhance the accuracy of cascade outbreak predictions. Third, in recent years, the higher-order interactions within complex networks have been increasingly identified across a multitude of domains, including physics, biology, and sociology. Extending our methodology to accommodate the analysis of higher-order networks represents a significant and promising avenue of research. Fourth, our experimental findings indicate that optimal performance in cascade outbreak prediction can be achieved by fully leveraging temporal and spatial information. Consequently, future efforts should focus on the enhanced integration of these two critical dimensions. Finally, it is worth noting that our study employs the simplified propagation models such as the SIR model, and thus there is a need to incorporate more sophisticated propagation patterns and prior knowledge into domain-specific applications. Looking ahead, efforts will be directed toward integrating this approach into specific fields with the aim of facilitating early prediction of large-scale information dissemination within the real-world systems.

In summary, this paper focuses on the prediction problem of cascade outbreaks. We first propose the GPP algorithm, which can calculate the global propagation probability of nodes in the network. On this basis, we propose the GPP-COP method. In simulation experiments of the SIR model, our method can quantitatively predict the probability and range of cascade outbreaks online at an early stage of propagation, and its effect is significantly better than that of the baseline methods. Furthermore, we briefly analyze the robustness of the proposed method and apply it to a real Weibo dataset. Experimental results show that our method fully utilizes the structural and dynamic features of the network to achieve better predictions.

The authors have no conflicts to disclose.

Xin Li: Writing – original draft (equal). Huichun Li: Data curation (equal); Writing – review & editing (equal). Xue Zhang: Writing – review & editing (equal). Chengli Zhao: Supervision (equal). Xiaojun Duan: Validation (equal).

The data that support the findings of this study are available from the corresponding author upon reasonable request.

In this section, we introduce the original first-order and second-order iterative algorithms36 and perform error analysis. We denote the propagation probability from node u to v as β u v and the global propagation probability of node u as p ( u , s ), that is, the probability that node u belongs to the giant component of the network. As shown in Fig. 1(b) in the main text, considering the first-order case, the probability that node u belongs to the giant component is equal to the probability that node u is connected to the giant component through at least one of its neighbors v, so we obtain
p ( u , s ) = 1 v u ( 1 β u v p ( v , s ) ) ,
(A1)
where u represents the set of neighbors of node u. Therefore, N equations are obtained, which can be solved efficiently by iterative calculation. Considering the second-order case, if the node u is connected to the giant component through v 1, then the node v 1 is connected to the giant component through at least one of its neighbors except u, so we have
p ( u , s ) = 1 v u [ 1 β u v ( 1 w v u ( 1 β v w p ( w , s ) ) ) ] .
(A2)
Similarly, it can be solved by iteration. However, under certain circumstances, this algorithm still has a non-negligible error. As shown in Fig. 1(b), we record the probability that node v is still connected to the giant component after removing node u as p u v. Equation (A1) makes p u v p ( v , s ), so the part where the node v connects to the giant component through u is ignored. In order to reduce the error, Eq. (A2) makes
p u v = 1 w v u ( 1 β v w p ( w , s ) .
(A3)
Through simple derivation, the errors of Eqs. (A1) and (A2) are
e 1 = β v u | p ( v , s ) p u v | + β 2 s , t u , s t | p ( s , s ) p ( t , s ) p u s p u t | + o ( β 2 | p ( s , s ) p ( t , s ) p u s p u t | ) ,
(A4)
e 2 = β 2 v u w v | p ( w , s ) p v w | + o ( β 2 | p ( w , s ) p v w | ) .
(A5)
Equations (A4) and (A5) show that the second-order method will eliminate the first-order error. Therefore, when β is small and the average degree is high, that is, when p ( w , s ) p v w, Eq. (A2) has high calculation accuracy. However, when the network does not satisfy the local tree structure or the average degree is small, the calculation of Eq. (A2) still has a non-negligible error.
In the context of a network whose scale tends toward infinity and conforms to a tree-like structure, Eqs. (1) and (2) in the main text exhibit equivalence. Without loss of generality, we consider the convergence properties of Eq. (1) in this main text. By performing a straightforward manipulation and applying the natural logarithm of Eq. (1), the operation is transformed from multiplication to addition, as expressed by
ln ( 1 p v 0 u ) = v u v 0 ln ( 1 β u v p u v ) .
(B1)
Let w v 0 u = ( 1 p v 0 u ), it follows that p v 0 u = 1 w v 0 u. Subsequently, in conjunction with Eq. (B1), we have
w v 0 u = exp [ v u v 0 ln ( 1 β u v + β u v w u v ) ] ,
(B2)
which is equivalent to Eq. (1) in the main text, and w v 0 u ( 0 , 1 ]. In a network, the 2 M directed edges are sequentially numbered as e 1 , e 2 , , e 2 M, and the corresponding 2 M { w u v | e u v { e 1 , e 2 , , e 2 M } } form the vector w = ( w 1 , w 2 , , w 2 M ) . Similarly, the corresponding 2 M transmission probabilities are denoted as a vector b = ( β 1 , β 2 , , β 2 M ) . Then, we can transform the set of the above 2 M equations into matrix form, given by
w = exp [ B ln ( b w + 1 b ) ] ,
(B3)
where denotes the element-wise multiplication, exp ( ) represents the element-wise exponential operation, l n denotes the element-wise natural logarithm operation, 1 is a 2 M-dimensional column vector with all elements equal to 1, and B is known as the Hashimoto or non-backtracking matrix of dimension 2 M × 2 M with element B k l , i j = δ l i ( 1 δ k j ) ( δ is the Kronecker delta function). In fact, during the iterative solving process, this procedure can be regarded as the following discrete nonlinear dynamics:
w t + 1 = exp [ B ln ( b w t + 1 b ) ] .
(B4)
Subsequently, we can compute the Jacobian matrix of the dynamics, given by
J ( w ) = diag { exp [ B ln ( b w + 1 b ) ] } B diag ( b b w + 1 b ) ,
(B5)
where denotes element-wise division and diag ( x ) denotes a diagonal matrix with the entries of vector x on its diagonal. To ensure convergence of Eq.(B2), it is sufficient to consider that the dominant eigenvalue λ m a x of the Jacobian matrix J ( w ) is less than 1, where w is the fixed point.

For the sake of convenience for further analysis, we assume that the propagation probability within the network is uniformly β ( 0 , 1 ), that is, b = ( β , β , , β ) . Subsequently, we conduct some necessary analysis on J ( w ). When β falls below the percolation threshold, the network does not have a giant component. It is evident that all elements of w are equal to 1. Consequently, the Jacobian matrix can be expressed as J ( 1 ) = β B. It is evident that Eq. (B2) exhibits stable convergence to 1 when β is less than the reciprocal of the principal eigenvalue of B. Otherwise, w = 1 represents an unstable fixed point, where minor perturbations will cause the system (B4) to converge toward an alternate new fixed point.

In the presence of a giant component, the rigorous proof of the convergence of Eq. (B2) becomes considerably complex. Herein, we offer a brief analysis. Considering the maximum out-degree of nodes in the network is a finite value D o, based on Eq. (B2) of the main text, we can deduce that there exists a lower bound w inf > 0 for all probability w v 0 u, i.e.,
w v 0 u = exp [ v u v 0 ln ( 1 β + β w u v ) ] exp [ D o ln ( 1 β + β × 0 ) ] = w inf .
(B6)
Let the initial value be denoted as w 0 = ( w 0 , 1 , w 0 , 2 , , w 0 , 2 M ) = ( w inf , w inf , , w inf ). In the following, we first demonstrate that each element within w exhibits a monotonically increasing behavior. According to Eq. (B6), it can be deduced that for any i { 1 , 2 , , 2 M }, we have w 0 , i w 1 , i. According to Eqs. (B4) and (B6), it can be observed that for any i , j { 1 , 2 , , 2 M }, w t + 1 , i is monotonically increasing with respect to w t , j. Therefore, for any i { 1 , 2 , , 2 M }, it follows that w t + 1 , i > w t , i. This indicates that each variable within w exhibits monotonic growth. Given that each variable within w is bounded above by 1, it follows from the Monotone Convergence Theorem that w is convergent. We denote this converging value as w . Furthermore, the fixed point 1 is unstable and contradicts the assumption of a giant component within the network; therefore, w is a fixed point distinct from 1. It is indeed evident that the aforementioned convergence property remains valid across different edge propagation probabilities.

In order to reduce the errors caused by inner edges, we introduce the discount coefficient λ. The specific calculation process is as follows.

We use T u v to indicate that the event that node u connects to the giant component through the first-order neighbor v occurs and use T u v w to indicate the event that node u connects to the giant component through the first-order neighbor v and the second-order neighbor w occurs. On the contrary, use F to indicate the event does not occur. At this time, F u v w in the networks of Fig. 1(c) are not approximately independent of each other, and direct calculation with Eq. (2) will introduce non-negligible errors. Below we discuss the possible errors in these situations and their improvement methods.

Case 1: This situation is the main source of error, which we analyze in the main text Eq. (3), and the discount coefficient is recorded as λ 1 ( v 1 , v 2 ) = ( 1 β u v 1 p u , v 2 v 1 ) / [ ( 1 β u v 1 p u v 1 ) p v 2 v 1 ], where p u , v 2 v 1.

Case 2: As shown in the subgraph (1) of Fig. 1(c), when the edge e u v 2 exists, under the condition of F u v 1 v 2, the probability of T v 2 w 21 is
P ( T v 2 w 21 | F u v 1 v 2 ) = 1 β u v 1 β v 1 v 2 1 β u v 1 β v 1 v 2 p u , v 1 v 2 β v 2 w 21 p v 2 w 21 .
(C1)
The discount factor λ 2 ( v 1 , v 2 ) = ( 1 β u v 1 β v 1 v 2 ) / [ 1 β u v 1 β v 1 v 2 p u , v 1 v 2 ] is introduced here. Obviously, its influence is smaller than λ 1.
Case 3: As shown in the subgraph (2) of Fig. 1(c), the inner edge connects the first-order and second-order neighbors of the propagation source. When the edge e u v 2 exists, in the case of F u v 1 w 1 2, the conditional probability of T v 2 w 1 2 is
P ( T v 2 w 12 | F u v 1 w 12 ) = 1 β u v 1 + β u v 1 ( 1 β v 1 w 12 ) p v 1 , v 2 w 12 p v 2 w 12 1 β u v 1 β v 1 w 12 p v 1 w 12 × β v 2 w 12 p v 2 w 12 .
(C2)
Therefore, the discount factor λ 3 ( v 1 , v 2 , w 12 ) = [ 1 β u v 1 + β u v 1 ( 1 β v 1 w 12 ) p v 1 , v 2 w 12 p v 2 w 12 ] / ( 1 β u v 1 β v 1 w 12 p v 1 w 12 ) is introduced here, its influence of this item is between λ 1 and λ 2.
Case 4: As shown in the subgraph (3) of Fig. 1(c), when the edge e u v 2 exists, in the case of F u v 1 v 2 and F u v 1 w 12, the probability of T v 2 w 12 is
P ( T v 2 w 12 | F u v 1 w 12 ) λ 2 ( v 1 , v 2 ) × λ 3 ( v 1 , w 12 ) β v 2 , w 12 p v 2 w 12 .
(C3)
In order to simplify the calculation, we consider case 4 as a combination of case 2 and case 3 and use λ 2 ( v 1 , v 2 ) × λ 3 ( v 1 , w 12 ) to approximate the discount factor.

Case 5: As shown in the subgraph (4) of Fig. 1(c), when the inner edges connect two second-order neighbors of the propagation source, the impact at this time is smaller than the second and third cases, and much smaller than the impact of the first case, so we ignore its impact.

In summary, there are mainly three situations where the algorithm needs to be changed as follows. As shown in the subgraph (5) of Fig. 1(c), factor 1: when e u v exists and v is connected to u’s first-order neighbor w, then P ( T v w | F u w v ) = λ 1 ( w , v ) P ( T v w ). Factor 2: when e u v exists and v is connected to u’s first-order neighbor a i, then P ( T v w | F u a i v ) = λ 2 ( a i , v ) P ( T v w ). Factor 3: when e u v exists and w is connected to u’s first-order neighbor b i, then P ( T v w | F u b i w ) = λ 3 ( b i , v , w ) P ( T v w ). Of course, for more complex situations, the above three factors of introducing discount coefficients may occur at the same time, and the second and third factors may occur multiple times. At this time, the local structure of the network is far from meeting the requirements of the local tree. In order to facilitate calculations and minimize errors, we consider these effects to be independent of each other, so the final discount coefficient is
λ = λ 1 ( w , v ) i = 1 k 1 λ 2 ( a i , v ) i = 1 k 2 λ 3 ( b i , v , w ) ,
(C4)
where k 1 and k 2 are the number of occurrences of factor 2 and factor 3, respectively. In fact, among the above three factors, the first factor has the greatest greater impact on the calculation accuracy of the original algorithm, and the discount coefficient λ 1 needs to be added to the iterative algorithm. The latter two factors have a certain impact on the calculation accuracy of the original algorithm, but are less than the first case. In order to further reduce the calculation error, the total discount coefficient λ can also be added to the algorithm to obtain a new iterative equation.

In this section, we give the pseudo code form of the GPP algorithm and the GPP-COP method. The meanings of the symbols in the algorithms below are consistent with the main text.

Algorithm 1 introduces in detail the process of GPP algorithm to calculate the global propagation probability of all nodes in the network. In order to improve operational efficiency, the discount coefficient λ in the sixth line can also be calculated by Eq. (3).

Algorithm 2 introduces in detail the process of GPP-COP method to predict cascade outbreaks. For any early time t, we can calculate its global propagation probability p S and outbreak range s by the proposed method. Then combining the three inference ideas in Sec. III B of the main text, we can make inferences about cascade outbreaks in the early stages of propagation for different scenarios.

Algorithm 1

  The execution process of GPP algorithm

 
 

Algorithm 2

  The execution process of GPP-COP algorithm

 
 

1.
A.
Friggeri
,
L.
Adamic
,
D.
Eckles
, and
J.
Cheng
, “Rumor cascades,” in Proceedings of the International AAAI Conference on Web and Social Media (AAAI Press, 2014), Vol. 8, pp. 101–110.
2.
D.
Kempe
,
J.
Kleinberg
, and
É.
Tardos
, “Maximizing the spread of influence through a social network,” in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2003), pp. 137–146.
3.
S.
Goel
,
A.
Anderson
,
J.
Hofman
, and
D. J.
Watts
, “
The structural virality of online diffusion
,”
Manage. Sci.
62
,
180
196
(
2016
).
4.
F.
Zhou
,
X.
Xu
,
G.
Trajcevski
, and
K.
Zhang
, “
A survey of information cascade analysis: Models, predictions, and recent advances
,”
ACM Comput. Surveys (CSUR)
54
,
1
36
(
2021
).
5.
B.
Golub
and
M. O.
Jackson
, “
Using selection bias to explain the observed structure of internet diffusions
,”
Proc. Natl. Acad. Sci.
107
,
10833
10836
(
2010
).
6.
L.
Zhao
,
J.
Chen
,
F.
Chen
,
F.
Jin
,
W.
Wang
,
C.-T.
Lu
, and
N.
Ramakrishnan
, “
Online flu epidemiological deep modeling on disease contact network
,”
GeoInformatica
24
,
443
475
(
2020
).
7.
S.
Gupta
,
R.
Kambli
,
S.
Wagh
, and
F.
Kazi
, “
Support-vector-machine-based proactive cascade prediction in smart grid using probabilistic framework
,”
IEEE Trans. Indus. Electron.
62
,
2478
2486
(
2014
).
8.
C.
Ma
,
Z.
Yan
, and
C. W.
Chen
, “Larm: A lifetime aware regression model for predicting YouTube video popularity,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (ACM, 2017), pp. 467–476.
9.
J.
Ugander
,
L.
Backstrom
,
C.
Marlow
, and
J.
Kleinberg
, “
Structural diversity in social contagion
,”
Proc. Natl. Acad. Sci.
109
,
5962
5966
(
2012
).
10.
L.
Weng
,
F.
Menczer
, and
Y.-Y.
Ahn
, “
Virality prediction and community structure in social networks
,”
Sci. Rep.
3
,
2522
(
2013
).
11.
P.
Cui
,
S.
Jin
,
L.
Yu
,
F.
Wang
,
W.
Zhu
, and
S.
Yang
, “Cascading outbreak prediction in networks: a data-driven approach,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2013), pp. 901–909.
12.
H.
Pinto
,
J. M.
Almeida
, and
M. A.
Gonçalves
, “Using early view patterns to predict the popularity of youtube videos,” in Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (ACM, 2013), pp. 365–374.
13.
C.
Gou
,
H.
Shen
,
P.
Du
,
D.
Wu
,
Y.
Liu
, and
X.
Cheng
, “
Learning sequential features for cascade outbreak prediction
,”
Knowled. Inform. Syst.
57
,
721
739
(
2018
).
14.
Q.
Cao
,
H.
Shen
,
J.
Gao
,
B.
Wei
, and
X.
Cheng
, “Popularity prediction on social platforms with coupled graph neural networks,” in Proceedings of the 13th International Conference on Web Search and Data Mining (ACM, 2019).
15.
X.
Chen
,
F.
Zhang
,
F.
Zhou
, and
M. M.
Bonsangue
, “
Multi-scale graph capsule with influence attention for information cascades prediction
,”
Int. J. Intell. Syst.
37
,
2584
2611
(
2021
).
16.
Y.
Wang
,
X.
Wang
,
R.
Michalski
,
Y.
Ran
, and
T.
Jia
, “
Casseqgcn: Combining network structure and temporal sequence to predict information cascades
,”
Expert Syst. Appl.
206
,
117693
(
2021
).
17.
S. V.
Buldyrev
,
R.
Parshani
,
G.
Paul
,
H. E.
Stanley
, and
S.
Havlin
, “
Catastrophic cascade of failures in interdependent networks
,”
Nature
464
,
1025
1028
(
2010
).
18.
W. O.
Kermack
and
A. G.
McKendrick
, “
A contribution to the mathematical theory of epidemics
,”
Proc. R. Soc. London, Ser. A
115
,
700
721
(
1927
).
19.
S.
He
,
Y.
Peng
, and
K.
Sun
, “
Seir modeling of the covid-19 and its dynamics
,”
Nonlinear Dyn.
101
,
1667
1680
(
2020
).
20.
R.
Pastor-Satorras
and
A.
Vespignani
, “
Epidemic spreading in scale-free networks
,”
Phys. Rev. Lett.
86
,
3200
(
2001
).
21.
M. E. J.
Newman
, “
Spread of epidemic disease on networks
,”
Phys. Rev. E
66
,
016128
(
2002
).
22.
M.
Richardson
and
P.
Domingos
, “Mining knowledge-sharing sites for viral marketing,” in Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2002), pp. 61–70.
23.
B.
Karrer
,
M. E.
Newman
, and
L.
Zdeborová
, “
Percolation on sparse networks
,”
Phys. Rev. Lett.
113
,
208702
(
2014
).
24.
J.
Berger
and
K. L.
Milkman
, “
What makes online content viral?
,”
J. Market. Res.
49
,
192
205
(
2012
).
25.
J.
Leskovec
,
L.
Backstrom
, and
J.
Kleinberg
, “Meme-tracking and the dynamics of the news cycle,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2009), pp. 497–506.
26.
G.
Szabo
and
B. A.
Huberman
, “
Predicting the popularity of online content
,”
Commun. ACM
53
,
80
88
(
2010
).
27.
S.
Aral
and
D.
Walker
, “
Creating social contagion through viral product design: A randomized trial of peer influence in networks
,”
Manage. Sci.
57
,
1623
1639
(
2011
).
28.
D. M.
Boyd
and
N. B.
Ellison
, “
Social network sites: Definition, history, and scholarship
,”
J. Comput.-Med. Commun.
13
,
210
230
(
2007
).
29.
J.
Leskovec
,
A.
Krause
,
C.
Guestrin
,
C.
Faloutsos
,
J.
VanBriesen
, and
N.
Glance
, “Cost-effective outbreak detection in networks,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2007), pp. 420–429.
30.
D.
Agarwal
,
B.-C.
Chen
, and
P.
Elango
, “Spatio-temporal models for estimating click-through rate,” in Proceedings of the 18th International Conference on World Wide Web (ACM, 2009), pp. 21–30.
31.
Q.
Zhao
,
M. A.
Erdogdu
,
H. Y.
He
,
A.
Rajaraman
, and
J.
Leskovec
, “Seismic: A self-exciting point process model for predicting tweet popularity,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2015), pp. 1513–1522.
32.
J.
Zhang
,
J.
Tang
,
J.
Li
,
Y.
Liu
, and
C.
Xing
, “
Who influenced you? predicting retweet via social influence locality
,”
ACM Trans. Knowledge Discovery Data (TKDD)
9
,
1
26
(
2015
).
33.
F.
Morone
and
H. A.
Makse
, “
Influence maximization in complex networks through optimal percolation
,”
Nature
524
,
65
68
(
2015
).
34.
R.
Pastor-Satorras
,
C.
Castellano
,
P.
Van Mieghem
, and
A.
Vespignani
, “
Epidemic processes in complex networks
,”
Rev. Mod. Phys.
87
,
925
(
2015
).
35.
Y.
Hu
,
S.
Ji
,
Y.
Jin
,
L.
Feng
,
H. E.
Stanley
, and
S.
Havlin
, “
Local structure can identify and quantify influential global spreaders in large scale social networks
,”
Proc. Natl. Acad. Sci.
115
,
7468
7472
(
2018
).
36.
X.
Li
,
X.
Zhang
,
C.
Zhao
,
D.
Yi
, and
G.
Li
, “
Identifying highly influential nodes in multilayer networks based on global propagation
,”
Chaos
30
,
061107
(
2020
).
37.
X.-F.
Wang
,
X.
Li
, and
G.-R.
Chen
,
Network Science: an Introduction
, Vol.
4
(
Higher Education Press
,
Beijing
,
2012
), pp.
95
142
.
38.
J.
Leskovec
,
J.
Kleinberg
, and
C.
Faloutsos
, “
Graph evolution: Densification and shrinking diameters
,”
ACM Trans. Knowledge Discovery Data (TKDD)
1
,
2–es
(
2007
).
39.
B.
Rozemberczki
and
R.
Sarkar
, “Characteristic functions on graphs: Birds of a feather, from statistical descriptors to parametric models,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management (ACM, 2020), pp. 1325–1334.
40.
B.
Rozemberczki
,
C.
Allen
, and
R.
Sarkar
, “
Multi-scale attributed node embedding
,”
J. Complex Netw.
9
,
cnab014
(
2021
).
41.
Q.
Cao
,
H.
Shen
,
K.
Cen
,
W.
Ouyang
, and
X.
Cheng
, “Deephawkes: Bridging the gap between prediction and understanding of information cascades,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (ACM, 2017), pp. 1149–1158.
42.
R.
Pastor-Satorras
and
A.
Vespignani
, “
Immunization of complex networks
,”
Phys. Rev. E.
65
,
036104
(
2002
).