We analyze the dataset of confirmed cases of severe acute respiratory syndrome coronavirus 2 (COVID-19) in the Republic of Korea, which contains transmission information on who infected whom as well as temporal information regarding when the infection possibly occurred. We derive time series of mesoscopic transmission networks using the location and age of each individual in the dataset to see how the structure of these networks changes over time in terms of clustering and link prediction. We find that the networks are clustered to a large extent, while those without weak links could be seen as having a tree structure. It is also found that triad-based link predictability using the network structure could be improved when combined with additional information on mobility and age-stratified contact patterns. Abundant triangles in the networks can help us better understand mixing patterns of people with different locations and age groups, hence the spreading dynamics of infectious disease.

Spreading dynamics of infectious disease, such as COVID-19, can reveal the mixing pattern of people from different locations and age groups. In particular, triangles in the mesoscopic transmission networks, where nodes are locations and/or age groups, show a mixing pattern of people beyond pairwise interaction between nodes. It turns out that the networks are clustered to a large extent, while those without weak links could be seen as having a tree structure. Abundant triangles in a network at a given time enable to predict the creation of links in a future when using triad-based similarity indexes for pairs of nodes. These findings can help us to better understand the spreading dynamics as well as to hopefully mitigate it.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2 or COVID-19) was first reported in Wuhan, China in December 2019.1 On 30 January 2020, the World Health Organization (WHO) declared a public health emergency of international concern regarding the outbreak of COVID-19.2 Due to the pandemic lasting almost three years, a number of people have lost their lives, and we are facing various social and economic threats to life. Governments over the world have made efforts to contain the COVID-19 by various means, such as social distancing, bans on gatherings, contact tracing, and active testing. The Republic of Korea (Korea hereafter) has performed successful quarantine measures by quickly finding and isolating infected people as well as by tracing contacts with them.3 

Since the first COVID-19 case was confirmed in Korea on 19 January 2020, the Korea Disease Control and Prevention Agency (KDCA) has been collecting the dataset for confirmed cases with demographic and temporal information, more importantly, with information on who infected confirmed individuals through active contact tracing. This kind of dataset enables us to study a detailed transmission network between individuals,4 although we focus on a coarse-grained picture of COVID-19 spreading in Korea. In this paper, we derive mesoscopic transmission networks; nodes of the network are defined using locations and age groups associated with confirmed individuals, while links are formed between nodes when infections occur between individuals associated with those nodes. By analyzing such mesoscopic networks in terms of clustering behavior and triad-based link prediction, we gain insights into the role of triangles of the transmission networks in the spreading dynamics of infectious disease. We find that those networks have a number of triangles but mostly with weak links. Link prediction performs well when using triad-based similarity indexes between nodes, and it performs even better when the network structure is combined with mobility and age-stratified contact patterns, which were obtained from a Korean telecommunication company and from the literature,5,6 respectively. This implies that mobility and contact patterns are also informative for link prediction and complementary for the network structure.

Since the transmission network is essentially a subgraph of the substrate network in which the spreading takes place, the triangles observed in the transmission network may have different implications than those in the substrate network. While triangles of substrate networks are well-known to influence spreading patterns in the literature,7–10 triangles of transmission networks are far from being fully understood except that such triangles could reveal a mixing pattern of people with different locations and age groups. Thus, our findings regarding the role of triangles in the transmission networks might be important to better understand the spreading dynamics of infectious disease.

Korea Disease Control and Prevention Agency has been collecting the dataset for confirmed cases of COVID-19 in Korea.11–13 Each confirmed case is assigned by a unique ID and associated with demographic information on age, sex, residence location at the level of district, etc., along with the date of symptom onset, the date of confirmation, and the date of report. Here, the date of report is the date when the confirmation was reported to the Korean government. In most cases, dates of confirmation and of report are identical. Most importantly, each confirmed individual is related to a precedent confirmed individual who is thought to infect the confirmed individual.

The dataset we have accessed contains 6 70 483 confirmed cases that have been reported from 19 January 2020 to 11 January 2022. Among them, the residence location is missing for 72 143 confirmed cases, while the date of report is missing for 3878 confirmed cases. The number of confirmed cases having information on the precedent confirmed cases is N=166098; among them, 385 confirmed cases have two precedent confirmed cases, while 92 and 2 confirmed cases have three and four precedent confirmed cases, respectively. This is probably because it is not clear who among multiple precedent confirmed cases really is an infector. We assume that all precedent confirmed cases are infectors. Finally, after removing infectors with an unknown location from the data, we end up with W=164095 infections in total with information on who infected whom.

We introduce terms used in the paper. A set of locations at the level of district in Korea is denoted by L with |L|=250. To take the age information of confirmed individuals into account, we consider five life stages or age groups: each confirmed individual is either 0–19 years old (indexed by 1), 20–29 years old (2), 30–49 years old (3), 50–64 years old (4), or 65 years old (5), which essentially follows the scheme of life stages used by Statistics Korea, the central government organization for statistics in Korea.14 Then, A denotes a set of age groups, implying |A|=5. Each confirmed individual u for u=1,,N is associated with her/his location luL and age group auA, equivalently, with a tuple of location and age group (lu,au)L×A. Here, × indicates a Cartesian product of two sets; thus, |L×A|=1250. We denote by t the number of days elapsed since 19 January 2020, implying that the first and last dates in the dataset are t=0 and t=T=722, respectively.

Each confirmed individual, denoted by v, with information on the precedent confirmed individual, denoted by u, enables us to define an infection event euv=(u,v,tuv), in which tuv is the number of days between 19 January 2020 and the date of report for v. Here, tuv is not necessarily the date of infection but can be used as a proxy for it.

Information on locations and age groups of confirmed individuals enables us to derive two kinds of mesoscopic networks. The first kind is called a location network GL, in which each node i is a location in L; i.e., iL. The second kind is called a loc×age network GLA, in which each node i is defined as a tuple of location and age group, i.e., i=(l,a)L×A. In the case of the location network, a link between a pair of locations i and j is considered to exist when at least one infection event occurs between an individual in one location i and another individual in the other location j. The link weight between two locations i and j is defined as the number of infection events occurred for the entire period as follows:

wij=|{euv|(lu,lv)=(i,j)or(j,i)}|.
(1)

Similarly, the link weight for the loc×age network is defined as

wij=|{euv|((lu,au),(lv,av))=(i,j)or(j,i)}|.
(2)

See Fig. 1 for the schematic diagram. Note that the infection from an infector to an infectee is directed and binary, hence resulting in a microscopic, individual transmission network that is also directed and binary. On the other hand, the derived mesoscopic network is undirected and weighted by our definition despite that it is also possible to derive a directed mesoscopic network. It is because clustering and link prediction have been studied mostly for undirected networks and the analysis for directed networks requires more complicated methodology. Hence, we leave the study on directed networks for future work.

FIG. 1.

Schematic diagram for deriving a mesoscopic network, being either a location network or a loc×age network (see the main text for their definition) from infection events between individuals. In panel (a), each infection between individuals (filled circles) is denoted by an arrow from the infector to the infectee. Colors of filled circles denote locations or the tuples of location and age group of individuals. Individuals associated with the same location or the same tuple of location and age group are grouped into nodes of the mesoscopic network in panel (b). Link weights between those nodes are determined by the number of infection events as defined in Eqs. (1) and (2).

FIG. 1.

Schematic diagram for deriving a mesoscopic network, being either a location network or a loc×age network (see the main text for their definition) from infection events between individuals. In panel (a), each infection between individuals (filled circles) is denoted by an arrow from the infector to the infectee. Colors of filled circles denote locations or the tuples of location and age group of individuals. Individuals associated with the same location or the same tuple of location and age group are grouped into nodes of the mesoscopic network in panel (b). Link weights between those nodes are determined by the number of infection events as defined in Eqs. (1) and (2).

Close modal

We also remark that a considerable fraction of infection events occurred within nodes in mesoscopic networks: iwii is 70% of W for the location network, while it is 31% of W for the loc×age network.

The location network is visualized in Fig. 2(a) in which the nodes are located in the geographical space of the Korean peninsula. Links are colored according to their weights. Inspired by the gravity model,16 we plot weights of the location network against geographical distances between nodes in Fig. 2(b) to find an overall negative correlation between weights and distances. In Figs. 2(c) and 2(d), we also show degree distributions and weight distributions, respectively, for location and loc×age networks. In both cases, degree distributions show exponential or thin tails, while weight distributions are heavy-tailed but with finite-size effects.

FIG. 2.

(a) Visualization of the location network embedded in the geographical space of the Korean peninsula that is derived for the entire period of the dataset. Links are colored according to their weight. The ocean coastline shape file was downloaded from Ref. 15. (b) Scatterplot showing an overall negative correlation between a link weight in the location network and the geographical distance between nodes connected by the link. (c) and (d) Complementary cumulative distribution functions (CCDFs) of degree (c) and weight (d) for the location and loc×age networks derived for the entire period of the dataset.

FIG. 2.

(a) Visualization of the location network embedded in the geographical space of the Korean peninsula that is derived for the entire period of the dataset. Links are colored according to their weight. The ocean coastline shape file was downloaded from Ref. 15. (b) Scatterplot showing an overall negative correlation between a link weight in the location network and the geographical distance between nodes connected by the link. (c) and (d) Complementary cumulative distribution functions (CCDFs) of degree (c) and weight (d) for the location and loc×age networks derived for the entire period of the dataset.

Close modal

To exploit the temporal information of infection events, we define location and loc× age networks on a date t from a set of infection events occurred for a certain period τ until t, i.e., [tτ+1,t]. Precisely, we derive a location network on a date t[τ1,T], denoted by GL,t, by defining the link weight between two locations, say i and j, as the number of infection events for the period of [tτ+1,t] as follows:

wij,t=|{euv|((lu,lv)=(i,j)or(j,i))andtτ+1tuvt}|.
(3)

The pair of locations i and j is considered unconnected if wij,t=0. We also define a loc×age network GLA,t on a date t[τ1,T] in a similar manner. In our work, we set τ=14 days, i.e., two weeks.

We remark that since both GL,t and GLA,t vary with time, they can be studied in the framework of temporal networks.17–19 A temporal network could be understood either as a time series of (static) networks or as a network of time series associated with each node or link. In our work, we take the former approach, i.e., a time series of networks, implying that we analyze GL,t and GLA,t for each t to see how they change over time.

We analyze the networks Gα,t for α{L,LA} for the entire range of t to see how the structure of these networks varies with time. The structural properties can be characterized in terms of various quantities and measures developed in network science,20–22 such as degree distribution, degree assortativity, clustering coefficient, and average path length, to name a few. Among them, we focus on the average clustering coefficient that characterizes the abundance of triangles in the network. It is because triangles in the mesoscopic network can tell us about a mixing pattern of people with different locations and/or age groups beyond pairwise interaction between nodes.

Precisely, for each node i in Gα,t for α{L,LA}, we calculate the local clustering coefficient (CC) by the following formula:23 

ci=2Eiki(ki1),
(4)

where ki is the degree of node i and Ei is the number of links between node i’s neighbors. Taking the average of ci over all nodes, one gets the average CC of the network α as

cα=1Nαici,
(5)

where Nα is the size of the network α; i.e., NL=|L|=250 and NLA=|L×A|=1250. Note that the definition in Eq. (5) ignores the information on the link weights. Several generalizations of CC have been introduced to incorporate the link weights into the clustering measure.24 In particular, we calculate the weighted CC from Gα,t for α{L,LA} defined as25 

c~i=1ki(ki1)j,jΓi(wijwijwjj)1/3max{w},
(6)

where Γi denotes the set of node i’s neighbors and ki=|Γi|. Then, we take the average of c~i over all nodes in the network to get

c~α=1Nαic~i.
(7)

If wij=max{w} for all links ij, one gets c~i=ci; hence, c~α=cα; otherwise, c~α<cα. On the other hand, one would get c~ici if at least one of link weights in Eq. (6) is much smaller than max{w}, implying a weak link, for every triangle around the node i.

The results of the average CCs in Eqs. (5) and (7) are shown in Figs. 3(a) and 3(b), respectively, denoted by “binary” and “weight.” For both location and loc×age networks, the values of cα are overall much larger than those of c~α for t300, indicating that there are a number of triangles containing at least one weak link. Together with negligible values of c~α, it may also imply that the networks without such weak links can be approximately seen as trees. Temporal patterns for average CCs are compared to the time series of the daily number of confirmed cases in Fig. 3(c). We find overall synchronous peak times between the average CC for binary versions of location and loc×age networks and the daily number of confirmed cases. It may imply that there are more triads closed during peak times of waves.

FIG. 3.

Temporal patterns of average clustering coefficients cα in Eq. (5) (solid line) and c~α in Eq. (7) (dashed line) for location networks (α=L) (a) and for loc×age networks (α=LA) (b). (c) Time series of the daily number of all confirmed cases (solid line) and that of cases with information on the infector (dashed line). In all panels, t denotes the number of days elapsed since 19 January 2020.

FIG. 3.

Temporal patterns of average clustering coefficients cα in Eq. (5) (solid line) and c~α in Eq. (7) (dashed line) for location networks (α=L) (a) and for loc×age networks (α=LA) (b). (c) Time series of the daily number of all confirmed cases (solid line) and that of cases with information on the infector (dashed line). In all panels, t denotes the number of days elapsed since 19 January 2020.

Close modal

We discuss implications of observed clustering behaviors of the mesoscopic transmission networks. We first note that at the individual level, the triangular infection could be observed, e.g., if an individual u infects v and v before v infects v again. For the related discussion, refer to Refs. 26 and 27. However, such a triangular infection may not dominate the spreading dynamics, and our dataset has no information to test the validity of such cases either. In contrast, triangles in the mesoscopic networks could appear, e.g., when an individual u of a location i infects v of j and v of j and then v infects v of j, leading to the triangle of locations i, j, and j. As we have shown above, such triangles, when ignoring the direction of infection events between individuals, are quite abundant in the mesoscopic networks. Also, it should be noted that the infection between individuals of different locations i and j can occur either in a location i, in a location j, or even in a third location that is neither i nor j. Therefore, triangles in the mesoscopic networks could characterize the mixing pattern of people from different locations and/or age groups.

To forecast spreading and hopefully mitigate it, it is important to predict which nodes would be connected with each other in a future based on the past network structure. For this, we apply link prediction methods28 to the location and loc×age networks on a date t, i.e., Gα,t for α{L,LA}. The link prediction is based on the assumption that the more similar two nodes are, the more likely they are connected to each other. For quantifying the similarity between two nodes, we employ four similarity indexes among others,28,29 namely, the preferential attachment (PA) index,30 the resource allocation (RA) index,31 the common neighbors (CN) index,32 and the weighted common neighbors (WCN) index,29 which are, respectively, defined as

sijPA=kikj,
(8)
sijRA=jΓiΓj1kj,
(9)
sijCN=|ΓiΓj|,
(10)
sijWCN=jΓiΓjwijwjj.
(11)

The PA index considers only degrees of nodes i and j, hence it can be called degree-based, while the other three indexes, i.e., RA, CN, and WCN indexes, are triad-based as they utilize information on common neighbors of nodes i and j, denoted by ΓiΓj. Also, the first three indexes consider only the topological structure of networks, while the last WCN index takes link weights into account. If all link weights are the same as 1, sijWCN reduces to sijCN.

Let us briefly consider what the triad-based similarity between nodes could mean in the mesoscopic transmission network. A common neighbor i of two unconnected nodes j and j indicates that people from i and j mix together, so do people from i and j, which could increase the possibility of infections between people from j and j. Therefore, one can expect that the more common neighbors nodes j and j have, i.e., the more structurally similar they are, the more likely a new link is to be formed between those nodes.

Link prediction methods have been used to predict which pair of unconnected nodes would be missing in the given dataset or connected with each other in a future.28 Our interest is to predict the creation of links between unconnected nodes in the future. We first denote the set of all pairs of unconnected nodes in Gα,t by Uα,t for α{L,LA}. Then, the similarity indexes are calculated for each pair of unconnected nodes in Uα,t. We check if each pair of nodes in Uα,t are connected with each other for the subsequent week, i.e., [t+1,t+7]. The set of pairs that turn out to be connected in the period [t+1,t+7] is denoted by Eα,+, while the set of pairs that remain unconnected in the period [t+1,t+7] is denoted by Uα,+=Uα,tEα,+.

To evaluate the quality of the link prediction method, standard metrics, such as the area under the receiver operating characteristic curve (AUC) and precision, have been used.28 These metrics have focused mainly on whether unconnected pairs are connected or not, rather than absolute values of similarity indexes. Here, we take an alternative approach to introduce a novel predictability measure concerning absolute values of similarity indexes. We calculate average values of a similarity index for pairs of nodes in Eα,+ and for pairs of nodes in Uα,+, which are, respectively, denoted by

sE=sijβijEα,+,
(12)
sU=sijβijUα,+,
(13)

where β{PA,RA,CN,WCN}. Then, the novel predictability measure is defined as

pαβ,t=sEsUsE+sU,
(14)

where α{L,LA} and β{PA,RA,CN,WCN}. If pαβ,t is significantly larger than 0, it means that the similarity index β can indeed be used to predict the link creation for the network α. On the other hand, pαβ,t significantly smaller than 0 points toward the opposite tendency, while pαβ,t0 implies that the similarity index is irrelevant to the link prediction.

Temporal patterns of the predictability measure pαβ,t for each combination of α and β for t[30,715] are shown in Figs. 4(a) and 4(b). The period of t<30 was ignored as the networks are too sparse to get any meaningful results. We find that for both location and loc×age networks, RA, CN, and WCN indexes outperform the PA index for most of the time, in particular, since t300, i.e., around the middle of December 2020. Note that t300 coincides with the time when the average clustering coefficients for the binary version of networks considerably increased; see Figs. 3(a) and 3(b). As there are more triangles in the network, the triad-based similarity indexes tend to perform better than the degree-based PA index, which appears to be consistent with the previous results.28 It also turns out that the WCN index shows systematically higher values of the predictability measure than the CN index in both kinds of networks, implying the relevance of link weights to the link prediction. Finally, the values of the predictability measure are overall higher in the loc×age networks than in the location networks, indicating that the interaction between different age groups plays an important role in the spreading dynamics, considering a number of infections between parents and children within households.

FIG. 4.

Temporal patterns of the predictability measure pαβ,t in Eq. (14) for location networks with α=L [(a) and (c)] and for loc×age networks with α=LA [(b) and (d)] using similarity indexes β{PA,RA,CN,WCN} [(a) and (b)], β{WCN,OD,CN+OD} (c), and β{WCN,ODCM,CN+ODCM} (d). For definitions of similarity indexes, see the main text. In all panels, t denotes the number of days elapsed since 19 January 2020.

FIG. 4.

Temporal patterns of the predictability measure pαβ,t in Eq. (14) for location networks with α=L [(a) and (c)] and for loc×age networks with α=LA [(b) and (d)] using similarity indexes β{PA,RA,CN,WCN} [(a) and (b)], β{WCN,OD,CN+OD} (c), and β{WCN,ODCM,CN+ODCM} (d). For definitions of similarity indexes, see the main text. In all panels, t denotes the number of days elapsed since 19 January 2020.

Close modal

Disease spreading has been known to strongly depend on mobility and age-stratified contact patterns.33,34 Therefore, it is natural to expect a better performance of link prediction methods with information on mobility and contact patterns in addition to the topological structure on location and loc×age networks. For a mobility pattern, we have access to the origin-destination (OD) dataset obtained from the KT Corporation, one of the major telecommunication operators in Korea. This dataset has been derived using the mobility data of subscribers for a month of June 2020. The OD dataset is given as an |L|×|L| matrix M. Each element of M indicates the number of trips from a location i to a location j for all pairs of i,jL; hence, it is asymmetric. Since our mesoscopic networks are undirected by definition, in order to use M for the link prediction of undirected networks, we symmetrize M as Msym=(M+MT)/2.

To perform a link prediction only using the mobility pattern, we take the element of Msym as the “similarity” index between nodes i and j; that is,

sijOD=Mijsym.
(15)

This index is called the OD index, which is used for the link prediction for the location network GL,t. Note that the OD index is constant of time, while it can be used to predict the link creation in time-varying networks. Thus, the value of the predictability measure in Eq. (14) also varies with time. We observe in Fig. 4(c) that the OD index overall outperforms the WCN index, which is the best performing index among others studied in Sec. III D. This might imply that the mobility pattern carries more relevant information to the link prediction than the structure of location networks does.

We now incorporate information on the mobility pattern into the purely topological similarity index, say, the common neighbors. We extend the definition of the common neighbors in Eq. (10) as follows:

sijCN+OD=jΓiΓjsijODsjjOD,
(16)

where Γi denotes the set of neighbors of node i in GL,t as usual. If sijOD=1 for all pairs of nodes ij, sijCN+OD reduces to sijCN in Eq. (10). As shown in Fig. 4(c), it turns out that the extended CN index using the OD index performs slightly better than the OD and WCN indexes for most of the time.

Similarly, the age-stratified contact pattern along with the mobility pattern can be used for the link prediction of the loc×age networks GLA,t. For this, we adopt a 16×16 contact matrix C derived for Korea.5,6 Here, all ages have been categorized into five-year age intervals, i.e., 0–4 years old (indexed by 1), 5–9 years old (2), 10–14 years old (3), and so on. Each element of C indicates the estimated contact frequency between age groups i and j for all pairs of i,j[1,16]. As we consider five life stages as mentioned in Sec. III A, we coarse-grain C to get a 5×5 contact matrix C whose element is given as

Cij=igijgjCij,
(17)

where gi is a set of 5-year-interval age groups in the ith life stage for i=1,,5. The coarse-grained contact matrix is symmetrized as Csym=(C+CT)/2, which is again for the application to undirected networks. Then, we combine the elements of Csym as well as of Msym to define the “similarity” index between nodes i=(l,a) and j=(l,a) as

sijODCM=MllsymCaasym,
(18)

which we call the ODCM index. Here, we have simply multiplied Mllsym by Caasym as we have no information on the correlation between the location and age group. By performing the link prediction using the ODCM index, we find in Fig. 4(d) that the ODCM index overall outperforms the WCN index, implying that mobility and contact patterns, when combined, carry more relevant information to the link prediction than the structure of loc×age networks does. We also define the extended CN index using the ODCM index,

sijCN+ODCM=jΓiΓjsijODCMsjjODCM,
(19)

which turns out to perform systematically better than the ODCM and WCN indexes [Fig. 4(d)].

In order to get some insight into the reason why the mobility and contact patterns are informative for the link prediction of transmission networks, we investigate the structural similarity between mobility and contact patterns and transmission networks. For comparison, we define a cumulative location network up to the date t by aggregating all infection events occurred no later than t. The link weight of the cumulative network up to t is defined as

wij,tc=|{euv|((lu,lv)=(i,j)or(j,i))andtuvt}|,
(20)

enabling to define the weighted adjacency matrix WL,t. Then, we calculate the Pearson correlation coefficient (PCC) between elements of Msym, i.e., [sijOD], and the weighted adjacency matrix WL,t. We also calculate the PCC between the matrix [sijODCM] and the weighted adjacency matrix WLA,t of the cumulative loc×age network up to the date t.

The value of PCC as a function of t shows overall increasing behavior for both cases with the location network and the loc×age network, as shown in Fig. 5(a). These results indicate that the cumulative location network tends to converge to the mobility pattern as more infection events occurred, while the cumulative loc×age network converges to the mobility and contact patterns combined. We also find that the values of PCC between the ODCM index and the loc×age network are smaller than those between the OD index and the location network. This might be due to the missing information on the correlation between locations and age groups; see Eq. (18). The curve of PCC between the OD index and the cumulative location network starts to slightly decrease for t>550. Such decreasing behavior might be partly due to the fact that the number of cases with information on infectors starts to deviate from the number of all cases for t>550, as shown in Fig. 3(c). Finally, we find the symmetrized OD matrix (contact matrix) comparable to the location (age) dependence of the number of infection events over the entire period of the dataset in Figs. 5(b)5(e).

FIG. 5.

(a) Temporal patterns of the Pearson correlation coefficient (PCC) between the link weight in the cumulative location network and the OD index (solid line) and between the link weight in the cumulative loc×age network and the ODCM index (dashed line). Here, t denotes the number of days elapsed since 19 January 2020. (b) Heatmap of the number of infection events as a function of infector’s location and infectee’s location for the entire period of the dataset. (c) Symmetrized OD matrix obtained from the KT Corporation in Korea. (d) Heatmap of the number of infection events as a function of infector’s age and infectee’s age for the entire period of the dataset. (e) Symmetrized contact matrix for Korea between age groups, where all ages are categorized into five-year old intervals.5 

FIG. 5.

(a) Temporal patterns of the Pearson correlation coefficient (PCC) between the link weight in the cumulative location network and the OD index (solid line) and between the link weight in the cumulative loc×age network and the ODCM index (dashed line). Here, t denotes the number of days elapsed since 19 January 2020. (b) Heatmap of the number of infection events as a function of infector’s location and infectee’s location for the entire period of the dataset. (c) Symmetrized OD matrix obtained from the KT Corporation in Korea. (d) Heatmap of the number of infection events as a function of infector’s age and infectee’s age for the entire period of the dataset. (e) Symmetrized contact matrix for Korea between age groups, where all ages are categorized into five-year old intervals.5 

Close modal

We have analyzed the detailed dataset of confirmed cases of COVID-19 in the Republic of Korea that was collected over the two years since the first confirmation on 19 January 2020. From the dataset, we derive mesoscopic transmission networks; nodes in the network are based on locations and age groups associated with confirmed individuals, while links between nodes are formed when infection events occur between individuals associated with those nodes. We analyze such mesoscopic networks in terms of clustering behavior and triad-based link prediction to gain insights into the role of triangles of the transmission networks in the spreading dynamics of infectious disease. First, by measuring clustering coefficients in both binary and weighted versions of networks, we find that the networks consist of a number of triangles but mostly with weak links. Second, using several similarity indexes between nodes, we measure to what extent unconnected pairs of nodes in a network at a given time are to be connected in the near future. We find that triad-based similarity indexes outperform a degree-based index, implying that the triad/triangle structure of transmission networks is relevant to predict link creation in the future. It turns out that the predictability was improved when the mobility and age-stratified contact patterns are combined with the network structure.

We remark that the mobility and contact patterns also reveal mixing patterns of people to some extent; hence, link prediction using only mobility and contact patterns performs systematically better than that using only the network structure. However, mobility and contact patterns used in our work do not capture the dynamics of the time-varying network structure, in particular, the dynamics of triadic closure in the transmission networks. The roles of triangles in the mesoscopic transmission networks should be studied in more detail.

In addition, we have derived undirected mesoscopic networks from the directed individual transmission network. It is also possible to derive the directed mesoscopic networks, while the analysis of them in terms of clustering and link prediction may require more complicated methodology than for the undirected networks, which is left for future work. Finally, one can adopt alternative definitions for the similarity between nodes, such as effective distance,35 for the link prediction.

The authors thank Kimmo Kaski and János Kertész for fruitful discussions. O.K. was supported by the National Institute for Mathematical Sciences (NIMS) grant funded by the Koran government (MSIT) (No. NIMS-B22730000). H.-H.J. acknowledges financial support by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1A2C1007358).

The authors have no conflicts to disclose.

Okyu Kwon: Conceptualization (equal); Data curation (lead); Funding acquisition (equal); Methodology (lead); Software (equal); Visualization (lead); Writing – original draft (supporting); Writing – review & editing (equal). Hang-Hyun Jo: Conceptualization (equal); Data curation (supporting); Funding acquisition (equal); Methodology (supporting); Software (equal); Visualization (supporting); Writing – original draft (lead); Writing – review & editing (equal).

The data used in this study are non-public and have been shared with the COVID-19 mathematical modeling task force team of the Korean Mathematical Society for use in analyzing the basis for establishing the public health policy of the Central Disease Control Headquarters (Korea Disease Control and Prevention Agency).

1.
Q.
Li
,
X.
Guan
,
P.
Wu
,
X.
Wang
,
L.
Zhou
,
Y.
Tong
,
R.
Ren
,
K. S.
Leung
,
E. H.
Lau
,
J. Y.
Wong
,
X.
Xing
,
N.
Xiang
,
Y.
Wu
,
C.
Li
,
Q.
Chen
,
D.
Li
,
T.
Liu
,
J.
Zhao
,
M.
Liu
,
W.
Tu
,
C.
Chen
,
L.
Jin
,
R.
Yang
,
Q.
Wang
,
S.
Zhou
,
R.
Wang
,
H.
Liu
,
Y.
Luo
,
Y.
Liu
,
G.
Shao
,
H.
Li
,
Z.
Tao
,
Y.
Yang
,
Z.
Deng
,
B.
Liu
,
Z.
Ma
,
Y.
Zhang
,
G.
Shi
,
T. T.
Lam
,
J. T.
Wu
,
G. F.
Gao
,
B. J.
Cowling
,
B.
Yang
,
G. M.
Leung
, and
Z.
Feng
, “
Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia
,”
N. Engl. J. Med.
382
,
1199
1207
(
2020
).
2.
WHO, “Statement on the second meeting of the International Health Regulations (2005) Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV)” (2020); see https://www.who.int/news/item/30-01-2020-statement-on-the-second-meeting-of-the-international-health-regulations-(2005)-emergency-committee-regarding-the-outbreak-of-novel-coronavirus-(2019-ncov).
3.
W.
Lee
,
S.-S.
Hwang
,
I.
Song
,
C.
Park
,
H.
Kim
,
I.-K.
Song
,
H. M.
Choi
,
K.
Prifti
,
Y.
Kwon
,
J.
Kim
,
S.
Oh
,
J.
Yang
,
M.
Cha
,
Y.
Kim
,
M. L.
Bell
, and
H.
Kim
, “
COVID-19 in South Korea: Epidemiological and spatiotemporal patterns of the spread and the role of aggressive diagnostic tests in the early phase
,”
Int. J. Epidemiol.
49
,
1106
1116
(
2020
).
4.
J. C.
Taube
,
P. B.
Miller
, and
J. M.
Drake
, “
An open-access database of infectious disease transmission trees to explore superspreader epidemiology
,”
PLoS Biol.
20
,
e3001685
(
2022
).
5.
K.
Prem
,
A. R.
Cook
, and
M.
Jit
, “
Projecting social contact matrices in 152 countries using contact surveys and demographic data
,”
PLoS Comput. Biol.
13
,
e1005697
(
2017
).
6.
Y.
Choi
,
J. S.
Kim
,
H.
Choi
,
H.
Lee
, and
C. H.
Lee
, “
Assessment of social distancing for controlling COVID-19 in Korea: An age-structured modeling approach
,”
Int. J. Environ. Res. Public Health
17
,
7474
(
2020
).
7.
M. E. J.
Newman
, “
Random graphs with clustering
,”
Phys. Rev. Lett.
103
,
058701
(
2009
).
8.
J. C.
Miller
, “
Percolation and epidemics in random clustered networks
,”
Phys. Rev. E
80
,
020901
(
2009
).
9.
J. C.
Miller
, “
Spread of infectious disease through clustered populations
,”
J. R. Soc. Interface
6
,
1121
1134
(
2009
).
10.
D. J. P.
O’Sullivan
,
G. J.
O’Keeffe
,
P. G.
Fennell
, and
J. P.
Gleeson
, “
Mathematical modeling of complex contagion on clustered networks
,”
Front. Phys.
3
,
71
(
2015
).
11.
Y.
Ko
,
J.
Lee
,
Y.
Kim
,
D.
Kwon
, and
E.
Jung
, “
COVID-19 vaccine priority strategy using a heterogenous transmission model based on maximum likelihood estimation in the Republic of Korea
,”
Int. J. Environ. Res. Public Health
18
,
6469
(
2021
).
12.
J.
Jeon
,
C.
Han
,
T.
Kim
, and
S.
Lee
, “
Evolution of responses to COVID-19 and epidemiological characteristics in South Korea
,”
Int. J. Environ. Res. Public Health
19
,
4056
(
2022
).
13.
E.
Shim
,
W.
Choi
,
D.
Kwon
,
T.
Kim
, and
Y.
Song
, “
Transmission potential of the Omicron variant of severe acute respiratory syndrome coronavirus 2 in South Korea, 25 November 2021–8 January 2022
,”
Open Forum Infect. Dis.
9
,
ofac248
(
2022
).
14.
See http://kostat.go.kr/ for “Statistics Korea.”
15.
See http://data.nsdi.go.kr/dataset/20180927ds0050 for Korea National Spatial Data Infrastructure Portal.
16.
G. K.
Zipf
, “
The P1 P2/D hypothesis: On the intercity movement of persons
,”
Am. Sociol. Rev.
11
,
677
686
(
1946
).
17.
P.
Holme
and
J.
Saramäki
, “
Temporal networks
,”
Phys. Rep.
519
,
97
125
(
2012
).
18.
N.
Masuda
and
R.
Lambiotte
, A Guide to Temporal Networks, Series on Complexity Science (World Scientific, Hackensack, NJ, 2016).
19.
Temporal Network Theory, Computational Social Sciences, edited by P. Holme and J. Saramäki (Springer International Publishing, Cham, 2019).
20.
R.
Albert
and
A.-L.
Barabási
, “
Statistical mechanics of complex networks
,”
Rev. Mod. Phys.
74
,
47
97
(
2002
).
21.
M. E. J.
Newman
,
Networks
, 2nd ed. (
Oxford University Press
,
Oxford, UK
,
2018
).
22.
F.
Menczer
,
S.
Fortunato
, and
C. A.
Davis
,
A First Course in Network Science
(
Cambridge University Press
,
Cambridge
,
2020
).
23.
D. J.
Watts
and
S. H.
Strogatz
, “
Collective dynamics of ‘small-world’ networks
,”
Nature
393
,
440
442
(
1998
).
24.
J.
Saramäki
,
M.
Kivelä
,
J.-P.
Onnela
,
K.
Kaski
, and
J.
Kertész
, “
Generalizations of the clustering coefficient to weighted complex networks
,”
Phys. Rev. E
75
,
027105
(
2007
).
25.
J.-P.
Onnela
,
J.
Saramäki
,
J.
Kertész
, and
K.
Kaski
, “
Intensity and coherence of motifs in weighted complex networks
,”
Phys. Rev. E
71
,
065103
(
2005
).
26.
G.
Manzo
, “
Complex social networks are missing in the dominant COVID-19 epidemic models
,”
Sociologica
14
,
31
49
(
2020
).
27.
B.
Szendrói
and
G.
Csányi
, “
Polynomial epidemics and clustering in contact networks
,”
Proc. R. Soc. London, Ser. B
271
,
S364
S366
(
2004
).
28.
L.
and
T.
Zhou
, “
Link prediction in complex networks: A survey
,”
Physica A
390
,
1150
1170
(
2011
).
29.
M.
Liu
,
Y.
Wang
,
J.
Chen
, and
Y.
Zhang
, “
Link prediction model for weighted networks based on evidence theory and the influence of common neighbours
,”
Complexity
2022
,
1
16
.
30.
A.-L.
Barabási
and
R.
Albert
, “
Emergence of scaling in random networks
,”
Science
286
,
509
512
(
1999
).
31.
T.
Zhou
,
L.
, and
Y.-C.
Zhang
, “
Predicting missing links via local information
,”
Eur. Phys. J. B
71
,
623
630
(
2009
).
32.
M. E. J.
Newman
, “
Clustering and preferential attachment in growing networks
,”
Phys. Rev. E
64
,
025102
(
2001
).
33.
S.
Chang
,
E.
Pierson
,
P. W.
Koh
,
J.
Gerardin
,
B.
Redbird
,
D.
Grusky
, and
J.
Leskovec
, “
Mobility network models of COVID-19 explain inequities and inform reopening
,”
Nature
589
,
82
87
(
2021
).
34.
D.
Mistry
,
M.
Litvinova
,
A.
Pastore y Piontti
,
M.
Chinazzi
,
L.
Fumanelli
,
M. F. C.
Gomes
,
S. A.
Haque
,
Q.-H.
Liu
,
K.
Mu
,
X.
Xiong
,
M. E.
Halloran
,
I. M.
Longini
,
S.
Merler
,
M.
Ajelli
, and
A.
Vespignani
, “
Inferring high-resolution human mixing patterns for disease modeling
,”
Nat. Commun.
12
,
323
(
2021
).
35.
D.
Brockmann
and
D.
Helbing
, “
The hidden geometry of complex, network-driven contagion phenomena
,”
Science
342
,
1337
1342
(
2013
).