We introduce a clustering coefficient for nondirected and directed hypergraphs, which we call the *quad clustering coefficient*. We determine the average quad clustering coefficient and its distribution in real-world hypergraphs and compare its value with those of random hypergraphs drawn from the configuration model. We find that real-world hypergraphs exhibit a nonnegligible fraction of nodes with a maximal value of the quad clustering coefficient, while we do not find such nodes in random hypergraphs. Interestingly, these highly clustered nodes can have large degrees and can be incident to hyperedges of large cardinality. Moreover, highly clustered nodes are not observed in an analysis based on the pairwise clustering coefficient of the associated projected graph that has binary interactions, and hence higher order interactions are required to identify nodes with a large quad clustering coefficient.

Real-world networks exhibit, so-called, higher order interactions, which are relations that involve more than two parties. Such higher order interactions can be represented by hyperedges, and a collection of nodes and hyperedges is called a hypergraph. The question arises what are the topological properties of real-world systems that have higher order interactions such as social collaboration networks or product composition networks. This problem is challenging as real-world networks can consist of a large number of nodes and hyperedges. Moreover, hyperedges in real-world networks can connect up to hundreds of nodes. To address the topological properties of hypergraphs, we introduce in this paper a clustering coefficient that determines the density of quads incident to a node, and which we call the quad clustering coefficient. Comparing the quad clustering coefficients of nodes in real-world networks with those in random networks, we find that real-world systems have topological properties that are significantly different from those of random systems. Notably, real-world hypergraphs have a large fraction of nodes with a maximal value of the quad clustering coefficient. This feature is only observed when accounting for the higher order interactions and is not seen in a classical network analysis based on binary interactions. We believe that these results are interesting for developing more accurate null models for real-world networks with higher order interactions.

## I. INTRODUCTION

Networks consist of nodes, representing components of a system, and relations between those nodes. When the relations are binary, they can be represented as links in a graph.^{1–3} However, in real-world systems, relations often include three or more vertices, and these are called higher order interactions.^{4} For example, a protein–protein interaction network can be seen as a network of binary relations, where two proteins are connected when they bind to each other, or it can be seen as a network with higher order interactions where a protein complex of $\chi $ proteins corresponds to a higher order interaction of cardinality $\chi $.

Although in a first approximation real-world networks appear to be random, random networks have a smaller number of cliques than what is observed in real-world networks.^{1–3} Indeed, the average clustering coefficient of a random graph, measuring the density of triangles^{5} (the smallest possible clique), decreases linearly as a function of the number of nodes in the graph. On the other hand, the average clustering coefficient of real-world networks is larger and approximately independent of $N$.^{6} Because of this observation, more realistic models for real-world networks have been developed that are based on a hierarchical network^{7} or a small-world network structure.^{5,8}

For systems with higher order interactions, Refs. 9–12 define a clustering coefficient that measures the degree of local transitivity and corresponds with quantifying clustering of nodes in the projected graph associated with a higher order network. However, contrarily to the case of simple graphs, the clustering coefficients of Refs. 9–13 do not capture the density of the shortest cycles in hypergraphs.

In this paper, we propose an alternative observable for clustering in hypergraphs that quantifies the density of the shortest possible simple cycle. The shortest simple cycle of a hypergraph is a quad. In a bipartite representation of a hypergraph, where nodes and hyperedges represent the two parties of the bipartite graph, a quad is a closed path of length four consisting of an alternating sequence of two nodes and two hyperedges. The quad clustering coefficient that we introduce in this paper quantifies the density of quads and it is reminiscent of clustering coefficients that quantify densities of squares in bipartite graphs, see Refs. 14–16, but there are also some notable distinctions. For example, as we show here, the quad clustering coefficient is more effective in quantifying the density of quads in a hypergraph than coefficients defined previously in the literature. After a comparison with these previous works, we study clustering of quads in random graphs and real-world networks.

The paper is structured as follows. In Sec. II, we define hypergraphs and introduce the notation used in this paper. In Sec. III, we define the quad clustering coefficient and compare this coefficient with similar coefficients studied in the context of bipartite graphs. In Sec. IV, we derive exact expressions of the ensemble average of the quad clustering coefficient in a random hypergraph model. In Sec. V, we compare the results of Sec. IV with real-world hypergraphs and discuss notable distinctions between real-world networks and random graphs. In Sec. VI, we extend the quad clustering coefficient to directed hypergraphs and make a corresponding study for real-world networks. Conclusions are given in Sec. VII, and the paper ends with several appendixes containing technical details on the calculations in this paper.

## II. PRELIMINARIES ON HYPERGRAPHS

A nondirected, hypergraph is a triplet $ H=( V, W, E)$ consisting of a set $ V$ of $N= | V |$ nodes, a set of $ W$ of $M= | W |$ hyperedges, and a set $ E$ of links. We denote nodes by roman indices, $i,j\u2208 V$, and hyperedges by Greek indices $\alpha ,\beta \u2208 W$. The set of links $ E$ consists of pairs $(i,\alpha )$ with $i\u2208 V$ and $\alpha \u2208 W$. We say that the hypergraph is *simple* when each pair $(i,\alpha )$ occurs at most once in the set $ E$.

*incidence matrix*of dimensions $N\xd7M$ that is defined by

*degree*of node $i\u2208 V$ is defined by

*cardinality*of a hyperedge $\alpha $ by

*modified degree*

*projected graph*by the adjacency matrix $ A proj$ with entries

## III. QUAD CLUSTERING COEFFICIENT: DEFINITION AND MOTIVATION

^{5}

Since a triangle is the shortest cycle in a simple graph, the clustering coefficient $ C i pi$ is the density of shortest cycles incident to a node $i$, and we use this property of the clustering coefficient for graphs with pairwise interactions to derive a clustering coefficient valid for hypergraphs. To this aim, we represent a hypergraph as a bipartite graph, see Fig. 1. In this bipartite representation, there exist no triangles, and instead the cycle of shortest length is a *quad* consisting of two nodes and two hyperedges, see the motif illustrated in magenta in Fig. 1 for an illustration of the quad. Specifically, the quad is a simple cycle of four links forming an alternating sequence of nodes and hyperedges.

### A. Definition of the quad clustering coefficient

The quad clustering coefficient $ C i q$ has two useful properties. First, for fixed degrees $ k i( I;\chi )$, the quad clustering coefficient is a *linear function* of $ Q i$. Second, the proportionality factor is such that $ C i q\u2208[0,1]$, and $ C i q=1$ is attained when the number of quads around the node $i$ is maximal. As will become evident, these properties do not hold for clustering coefficients of bipartite graphs considered previously in the literature.

Note that quads quantify the multitude of ways neighboring nodes interact with each other, and in simple graphs we need higher order interactions to have multiple interaction paths. In the case of simple graphs (i.e., all hyperedges have cardinality $2$ and for each pair of nodes there is at most one hyperedge connecting them), the quad clustering coefficient is zero, as the only way to create multiple interactions between two nodes is through multiple edges, which are absent when the graph is simple.

In Subsections III B and III C, we compare the quad clustering coefficient with two other clustering coefficients for bipartite graphs, namely, Lind’s clustering coefficient^{14} in Sec. III B and Zhang’s clustering coefficient^{15} in Sec. III C. As we will see, Lind’s and Zhang’s clustering coefficients are not functions of $ Q i$, except when $ k i=2$, and in the latter case Lind’s and Zhang’s clustering coefficients are nonlinear functions in $ Q i$. In addition to Lind’s and Zhang’s clustering coefficients, other clustering coefficients have been defined in the litureature, see Refs. 17–21, but since these are significantly different from the quad clustering coefficient we do not discuss them here. Specifically, the clustering coefficients in Refs. 17 and 18 apply to nodes in standard networks without higher order interactions, the clustering coefficient in Ref. 19 has a denominator that does not depend on the cardinalities of the hyperedges incident to the considered node, and the coefficients in Refs. 20 and 21 do not count the number of quads.

### B. Lind’s clustering coefficient

The difference between the formulas for $ C i Lind( I)$ and $ C i q( I)$, given by Eqs. (13) and (18), respectively, is in the definition of the maximal possible number of quads. For Lind’s clustering coefficient, $ q i , max Lind$ is the sum of the existing quads $ q i$ and the number of ways $( \chi \alpha (I)\u2212 \eta i \alpha \beta (I))( \chi \beta (I)\u2212 \eta i \alpha \beta (I))$ that the remaining edges can be combined to form quads. In general, the number $ q i , max Lind$ overcounts significantly the number of possible quads. For example, in Fig. 2, $ q i , max Lind=3$, even though $ q max=2$.

For nodes with a degree $ k i>2$, Lind’s clustering coefficient is not a function of $ Q i$, contrarily to the quad clustering coefficient, as $ q i , max Lind$ depends on all $ q i \alpha \beta $, with $\alpha ,\beta \u2208 W$. For the simplest case of $ k i=3$, we illustrate this feature in the lower panel of Fig. 3. The circles and squares denote $ C i Lind$ for two different assignments for $ q i \alpha \beta $, $ q i \alpha \gamma $, and $ q i \beta \gamma $, as detailed in Appendix C. As Fig. 3(b) shows, the two curves for $ C i Lind$ are different for different prescriptions on the $q$’s, indicating that $ C i Lind$ is not a function of $ Q i$.

### C. Zhang’s clustering coefficient

*et al.*introduce the clustering coefficient

^{16}

^{22}

Comparing $ C i Zhang( I)$ with $ C i Lind( I)$ and $ C i q( I)$, we see that Zhang *et al.* considered yet another way of counting the maximal, possible number of quads. In the example of Fig. 2, we get $ C i Zhang( I)=0$ for (a), $ C i Zhang( I)=1/4$ for (b), and $ C i Zhang( I)=2/3$ for (c).

For nodes with degrees $ k i>2$, $ C i Zhang$ is not a function of $ Q i$, as $ q i , max Zhang$ depends on $ q i \alpha \beta $ for all $\alpha ,\beta \u2208 W$.

## IV. AVERAGE QUAD CLUSTERING COEFFICIENT FOR RANDOM HYPERGRAPHS

In this section, we determine the average quad clustering coefficients for random hypergraphs. First, in Sec. IV A, we derive the ensemble averaged clustering coefficient in random hypergraph models with regular cardinalities, i.e., $ \chi \alpha ( I)=\chi $ for all $\alpha \u2208 W$. For these models, we obtain compact expressions for the ensemble averaged quad clustering coefficient in terms of the model parameters. Subsequently, in Sec. IV B, we deal with models that are biregular in the cardinalities, i.e., $ \chi \alpha ( I)\u2208 { \chi 1 , \chi 2}$, and, as will become evident, the calculations in biregular models are significantly more difficult than those in models with regular cardinalities.

### A. Regular cardinalities

We consider three random hypergraph models with regular cardinalities, i.e., for which $ \chi \alpha ( I)=\chi $ for all $\alpha \u2208 W$. The three models are distinguished by the fluctuations in their degrees $ k i( I)$. In the $\chi $-regular ensemble, considered in Sec. IV A 1, the degrees are unconstrained; in the $(k,\chi )$-regular ensemble, considered in Sec. IV A 2, the degrees are regular, i.e., $ k i( I)=k$ for all $i\u2208 V$; lastly, in the $( k \u2192,\chi )$-regular ensemble, considered in Sec. IV A 3, the degrees are prescribed by the sequence $ k \u2192$, i.e., $ k i( I)= k i$ for all $i\u2208 V$.

#### 1. *χ*-regular ensemble

#### 2. (*c*, *χ*)-regular ensemble

*χ*

#### 3. $( k \u2192,\chi )$-regular ensemble

Notice that the first term in Eq. (36) diverges when the degree distribution $ p deg(k)$ has a diverging second moment, indicating that the average clustering coefficient of random hypergraphs with diverging second moments decreases slower than $1/N$ as a function of $N$. This result is compatible with what is known for random graphs, as the average number of cycles of finite length diverges with the second moment of the degree distribution [see Eq. (9) in Ref. 23].

### B. Biregular cardinalities

We have not been able to simplify the expression (41)–(43) further, not even in the sparse limit. Hence, although models with degree fluctuations are analytical tractable, as shown in Sec. IV A 3, it is significantly more difficult to deal with models with heterogeneous cardinalities.

Setting $ \chi 1= \chi 2=\chi $ in Eq. (41), we find Eq. (29). Hence, formula (41) generalizes Eq. (29).

We understand each term in Eq. (41) as follows: the first and last terms consider quads consisting of two hyperedges with the same cardinality, and the middle term considers the case where the two hyperedges have different cardinalities.

## V. QUAD CLUSTERING COEFFICIENT IN REAL-WORLD HYPERGRAPHS

Having established a theoretical understanding of quad clustering coefficients in random hypergraphs, we focus now our attention on the quad clustering coefficient in real-world hypergraphs. To this aim, we build hypergraphs out of six datasets, which are related to Github, Youtube, NDC-subtances, food recipes, Wallmart, and crime involvement. As detailed in Table I, the real-world hypergraphs have diverse topologies: their order ranges from $N\u2248 10 3$ to $N\u2248 10 5$, their mean degree ranges from $ k \xaf\u22483$ to $ k \xaf\u224860$, and their mean cardinality ranges from $ \chi \xaf\u22483$ to $ \chi \xaf\u224810$ (see Appendix F for more detailed information about these datasets).

Dataset . | N
. | M
. | $ k \xaf$ . | $ \chi \xaf$ . | $ C \xaf q( I real)$ . | $ C \xaf Lind( I real)$ . | $ C \xaf Zhang( I real)$ . | $\u27e8 C \xaf q(I)\u27e9$ . | $\u27e8 C \xaf Lind(I)\u27e9$ . | $\u27e8 C \xaf Zhang(I)\u27e9$ . |
---|---|---|---|---|---|---|---|---|---|---|

NDC-substances | 5\,556 | 112\,919 | 12.2 | 2.0 | 0.2760 | 0.1418 | 0.1792 | 0.0252 | 0.0012 | 0.0093 |

Youtube | 94\,238 | 30\,087 | 3.1 | 9.8 | 0.0920 | 0.0094 | 0.0225 | 0.0142 | 0.0001 | 0.0043 |

Food recipe | 6\,714 | 39\,774 | 63.8 | 10.8 | 0.1118 | 0.0178 | 0.0501 | 0.0658 | 0.0054 | 0.0271 |

Github | 56\,519 | 120\,867 | 7.8 | 3.6 | 0.1129 | 0.0408 | 0.0329 | 0.0084 | 0.0001 | 0.0029 |

Crime involvement | 829 | 551 | 1.8 | 2.7 | 0.0369 | 0.0243 | 0.0169 | 0.0037 | 0.0010 | 0.0013 |

Wallmart | 88\,860 | 69\,906 | 5.2 | 6.6 | 0.0120 | 0.0046 | 0.0046 | 0.0010 | 0.0001 | 0.0003 |

Dataset . | N
. | M
. | $ k \xaf$ . | $ \chi \xaf$ . | $ C \xaf q( I real)$ . | $ C \xaf Lind( I real)$ . | $ C \xaf Zhang( I real)$ . | $\u27e8 C \xaf q(I)\u27e9$ . | $\u27e8 C \xaf Lind(I)\u27e9$ . | $\u27e8 C \xaf Zhang(I)\u27e9$ . |
---|---|---|---|---|---|---|---|---|---|---|

NDC-substances | 5\,556 | 112\,919 | 12.2 | 2.0 | 0.2760 | 0.1418 | 0.1792 | 0.0252 | 0.0012 | 0.0093 |

Youtube | 94\,238 | 30\,087 | 3.1 | 9.8 | 0.0920 | 0.0094 | 0.0225 | 0.0142 | 0.0001 | 0.0043 |

Food recipe | 6\,714 | 39\,774 | 63.8 | 10.8 | 0.1118 | 0.0178 | 0.0501 | 0.0658 | 0.0054 | 0.0271 |

Github | 56\,519 | 120\,867 | 7.8 | 3.6 | 0.1129 | 0.0408 | 0.0329 | 0.0084 | 0.0001 | 0.0029 |

Crime involvement | 829 | 551 | 1.8 | 2.7 | 0.0369 | 0.0243 | 0.0169 | 0.0037 | 0.0010 | 0.0013 |

Wallmart | 88\,860 | 69\,906 | 5.2 | 6.6 | 0.0120 | 0.0046 | 0.0046 | 0.0010 | 0.0001 | 0.0003 |

### A. Mean quad clustering coefficient

^{24}with a prescribed degree sequence $ k \u2192( I real)$ and cardinality sequence $ \chi \u2192( I real)$ (see Appendix G for a description of the algorithm used to generate hypergraphs from the configuration model). The results in Fig. 4 reveal that the quad clustering coefficients of real-world networks are significantly larger than the average clustering coefficient $\u27e8 C \xaf q( I)\u27e9$ of the corresponding configuration models [ $\u27e8 C \xaf q( I)\u27e9\u22480.10 C \xaf i q( I real)$, see Table I]. Hence, the density of quads in real-world networks is higher than what is expected in the configuration model, similarly to previous findings for clustering coefficients in networks with pairwise interactions, see, e.g., Ref. 2. Similar conclusions can be drawn from comparing Lind’s and Zhang’s clustering coefficients between real-world and random networks (see Table I). However, the corresponding values of Lind’s and Zhang’s clustering coefficients are one order of magnitude smaller than the quad clustering coefficient, consistent with the behavior of the clustering coefficients as a function of the number of quads as shown in Fig. 3 and discussed in Sec. III.

### B. Distribution of quad clustering coefficients

Figure 5 shows the distribution $P( C q; I real)$ for the six real-world hypergraphs under study. We highlight a few noteworthy features of these plots. First, a significant proportion of nodes possess a near zero quad clustering coefficient, viz., between 50% and 70% in the hypergraphs (a)–(d) and over 90% in the hypergraphs (e) and (f). Second, for the remaining nodes, the distribution of $ C i q$ is broad. This latter feature stands in contrast with the average distribution $\u27e8P( C q; I)\u27e9$ in the corresponding configuration model with prescribed degree sequence $ k \u2192( I real)$ and cardinality sequence $ \chi \u2192( I real)$, generated by a standard stub-joining algorithm,^{25} also plotted in Fig. 5. Third, the hypergraphs in Fig. 5 exhibit a peak at $ C q\u22481$, which is most clearly visible in the NDC-substances hypergraph (a) and the Github hypergraph [hypergraph (d)].

As discussed in Sec. III, quad clustering can also be quantified with the Lind and Zhang clustering coefficients. As shown in Fig. 6, the peak at $ C q\u22481$ also appears when quantifying quad clustering with the Lind clustering coefficient or the Zhang clustering. However, the distributions $P( C Lind; I real)$ and $P( C Zhang; I real)$ have a larger peak at the origin, while the number of nodes with an intermediate value (not zero or one) is smaller. This result is consistent with the nonlinearity observed in Fig. 3. Indeed, since the $ C Lind$ and $ C Zhang$ clustering coefficients are nonlinear, nodes accumulate at values $ C Lind\u22480,1$ and $ C Zhang\u22480,1$, and hence these clustering coefficients are less effective at discriminating nodes based on their density of quads.

As shown in Fig. 5, hypergraph (f), exhibits clustering properties that are different from those of the other networks. Specifically, hypergraph (f) exhibits a peak at $1$ in the distribution of pairwise clustering coefficients of the projected graph and does not have a peak at $1$ observed in the distribution of quad clustering coefficients. To understand this peculiar property of hypergraph (f), we examine the network motifs formed by the nodes $i$ for which it holds that both $ C i q<0.5$ and $ C i pi>0.8$ (a total of $38520$ nodes out of the $88860$ satisfy this condition). We have found two types of structures among such nodes: in particular, $75%$ of the nodes have $ \u2211 \chi = 3 \u221e k i( I;\chi )=1$, and hence their quad clustering coefficient equals zero and their pairwise clustering coefficient equals one; see Fig. 7(a) for an illustration of such a motif. The remaining $25%$ of the nodes have a structure similar to those in Fig. 7(b): the neighborhoods of the hyperedges incident to the central node are disjoint when we exclude the central node. However, each pair of nodes $ j 1, j 2$ that are incident to hyperedges incident to the central node is themselves directly connected by a hyperedge. Consequently, also in this case, $ C i q=0$ and $ C i pi=1$. Note that in the real-world examples, the latter motifs are slightly different from those shown in Fig. 7(b), and hence values of $ C i q\u2208[0,0.5]$ and $ C i pi\u2208[0.8,1]$ are observed.

### C. Quad clustering coefficients as a function of degree and cardinality

In this subsection, we make a study of the topological properties of nodes that have a large quad clustering coefficient $ C i q\u22481$.

First, we address the correlations between $ C i q( I real)$ and the modified degree $ k i \u2217( I real)$, as defined in Eq. (7). We consider the modified degree $ k i \u2217$ instead of the degree $ k i$, as by default hyperedges with unit cardinality do not contribute to the quad clustering coefficient. In Fig. 8, we present scatterplots containing all the pairs $[ k i \u2217( I real), C i q( I real)]$ for the six canonical real-world hypergraphs that we consider in this paper, one marker for each node in the hypergraph. The red dashed line is a fit to the scaling relation $ C q\u223c ( k \u2217 ) \u2212 \beta $ and it shows the decreasing trend of the quad clustering with the modified degrees. This demonstrates that highly clustered nodes have on average lower degrees than nodes with small quad clustering coefficients. Nevertheless, up to modified degrees $ k i \u2217\u2248100$ there exist nodes with $ C i q( I)\u22481$, and hence real-world hypergraphs contain highly clustered nodes that have large degrees. This result is surprising, as the denominator of the quad clustering coefficient increases fast as a function of $ k i$, see Eqs. (13) and (17); hence, one may have expected that the highly clustered nodes with $ C i q( I)\u22481$ consist exclusively of nodes with small modified degrees.

*NDC-substances*network, Fig. 9(a), the maximum value of $ k i \u2217$ among nodes with $ C q=1$ is $ k i \u2217=192$. This is unexpectedly large, as it implies that the 192 hyperedges connected to node $i$ form a fully clustered configuration.

## VI. QUAD CLUSTERING COEFFICIENT FOR DIRECTED HYPERGRAPHS

In this section, we define a quad clustering coefficient for directed hypergraphs and we analyze its properties in real-world directed hypergraphs.

### A. Preliminaries on directed hypergraphs

A directed hypergraph is a quadruplet $ H \u2194=( V, W, E in, E out)$ consisting of the set $ V$ of $N= | V |$ nodes, the set $ W$ of $M= | W |$ hyperedges, and the sets $ E in\u2282 V\xd7 W$ and $ E out\u2282 V\xd7 W$ of directed inlinks and outlinks, respectively. Both inlinks and outlinks consist of pairs $(i,\alpha )$ with $i\u2208 V$ and $\alpha \u2208 W$, albeit the former represents links directed from a hyperedge to a vertex, while the latter represents links directed from a vertex to a hyperedge.

Figure 11 illustrates different ways of representing hypergraphs with an example.

*out-degree*and

*in-degree*of node $i\u2208 V$ are defined by

*out-cardinality*and

*in-cardinality*of hyperedge $\alpha \u2208 W$ by

*modified*

*out-*and

*in-cardinalities*

Note that there exists a one-to-one correspondence between simple, directed hypergraphs $ H dir$ and pairs $ I \u2194$ of incidence matrices, while the mapping between $ H$ and $ A proj$ is not one-to-one, and hence the projected graph is a coarse-grained representation of the hypergraph.

### B. Clustering coefficient for directed graphs with pairwise interactions

We review the definition of the pairwise clustering coefficient for directed graphs, as introduced in Ref. 26.

^{26}

Following the example of pairwise clustering coefficients, we define in Subsection IV C a quad clustering coefficient for directed hypergraphs, which is an extension of the corresponding clustering coefficient for nondirected hypergraphs.

### C. Quad clustering coefficient for directed hypergraphs

We define a quad clustering coefficient for directed hypergraphs. Similarly to the pairwise clustering coefficient for directed graphs $ C i pi \u2194$, we require that the quad clustering coefficient counts the number of directed quads incident to the node $i$ of a hypergraph, and we require that for nondirected hypergraphs the directed quad clustering coefficient equals the quad clustering coefficient defined in Eq. (13).

*in-*and

*out-cardinalities*of the hyperedges $\alpha \u2208 \u2202 i$, and the corresponding values of $ I i \alpha \u2194$. We omit the explicit mathematical expression for $ q max \u2194$ here, as it is elaborate, but it can be found in Appendix H. If $ \u2211 \alpha \u2208 \u2202 i ( I \u2194 )( \chi \alpha , i in+ \chi \alpha , i out)<2$, then $ C q \u2194( I)=0$. To illustrate how quads are counted by $ Q i \u2194( I \u2194)$, consider the example in panel (b) of Fig. 12. In this case, $ Q i \u2194( I \u2194)=4$, as the motif contains the four quads in the left column of panel (a) of Fig. 12.

Next we turn to the denominator of the right-hand side of (67). Similarly to the pairwise, directed, clustering coefficient $ C i pi \u2194( A)$, the denominator $ q max \u2194( { X i \alpha ( I \u2194 ) , I i \alpha \u2194} \alpha \u2208 \u2202 i)$ normalizes the directed quad clustering coefficient $ C i q \u2194( I \u2194)$ such that its value is independent of both the directionality and symmetry (i.e., unidirectional or bidirectional) of the links that connect node $i$ to its neighboring hyperedges. This means that if two nodes $i$ and $j$ have the same motif of inlinks, as shown in panel (c) of Fig. 12, then the quad clustering coefficient of the two nodes, $ C i q \u2194$ and $ C j q \u2194$, must be the same, even if the motifs of outlinks are different.

Note that for nondirected hypergraphs the directed quad clustering coefficient, defined by Eq. (67), equals the quad clustering coefficient for nondirected hypergraphs, defined by Eq. (13) (see Appendix I).

### D. Clustering in directed, real-world hypergraphs

In Sec. V, we found that the density of quads in nondirected real-world hypergraphs is large compared to the density of quads in the configuration model. In this section, we investigate whether an analogous phenomenon can be observed in directed hypergraphs. Specifically, we build directed hypergraphs from three datasets related to the DNC-email network, the English thesaurus, and the Human metabolic pathway (see Appendix F for more detailed information about these datasets).

In Table II, we present the mean quad clustering coefficient $ C \xaf q \u2194( I real)\u2261 1 N \u2211 i = 1 N C i q \u2194( I real)$ for the three real-world hypergraphs under study and compare their values with the corresponding directed configuration models, which have the prescribed degree sequences $ k \u2192 in( I real \u2190)$ and $ k \u2192 out( I real \u2192)$ and the prescribed cardinality sequences $ \chi \u2192 in( I real \u2192)$ and $ \chi \u2192 out( I real \u2190)$. We observe that the real-world networks have significantly larger direct quad clustering coefficient, up to $500$ times larger than those of corresponding random models. Hence, the density of directed quads in real-world directed hypergraphs is significantly higher than their density in the corresponding configuration models, consistent with earlier findings for nondirected hypergraphs.

Dataset . | N
. | M
. | $ C \xaf q \u2194( I real \u2194)$ . | $\u27e8 C \xaf q \u2194( I \u2194)\u27e9$ . |
---|---|---|---|---|

DNC-email | 2029 | 5598 | 0.3419 | 0.0715 |

English thesaurus | 40 963 | 35 104 | 0.2371 | 0.0004 |

Metabolic pathways | 1508 | 1451 | 0.0684 | 0.0179 |

Dataset . | N
. | M
. | $ C \xaf q \u2194( I real \u2194)$ . | $\u27e8 C \xaf q \u2194( I \u2194)\u27e9$ . |
---|---|---|---|---|

DNC-email | 2029 | 5598 | 0.3419 | 0.0715 |

English thesaurus | 40 963 | 35 104 | 0.2371 | 0.0004 |

Metabolic pathways | 1508 | 1451 | 0.0684 | 0.0179 |

Furthermore, we determine the distribution of directed, quad clustering coefficients in real-world hypergraphs defined by $P( C q \u2194; I real \u2194)\u2261 1 N \u2211 i = 1 N\delta ( C q \u2194\u2212 C i q \u2194 ( I \u2194 ) real$ and present the results in Fig. 13. Also, in directed real-world hypergraphs, we observe a peak at $ C q \u2194\u22481$ in the quad clustering distribution. In the specific examples considered, the peak is most pronounced in the DNC-email hypergraph.

## VII. DISCUSSION

We have introduced a clustering coefficient, called the quad clustering coefficient, that captures the multiplicity of interactions between neighboring nodes in (non)directed hypergraphs with higher order interactions. We have shown that for random hypergraphs the mean quad clustering coefficient has a value near zero, while for real-world networks it is one order of magnitude larger taking values ranging from $0.01$ to $0.34$, which is a smaller range than the one observed for pairwise clustering coefficients in real-world networks;^{27} we note, however, that the distribution of quad clustering coefficients is supported on the whole $[0,1]$ range of values. Hence, the quad clustering coefficient describes a feature of real-world networks that is not captured by the current random hypergraph models.

We have determined the average quad clustering coefficient in several random hypergraph models. We have obtained exact expressions for models with fluctuating degrees and fixed cardinalities. Our analysis shows that it is significantly more difficult to deal with fluctuating cardinalities.

Analyzing the distribution of quad clustering coefficients in real-world networks, we have found that there exist a significant fraction of nodes that take its maximal value. Analyzing the topological properties of the neighborhood sets of these highly clustered nodes, we have found that they can exhibit large degrees, and their neighboring nodes can have large cardinalities.

The results of this paper show that the configuration model is not a good null model for real-world networks with higher order interactions. This itself is not a surprising result, as the configuration model is also not a good model for networks without higher order interactions, see, e.g., discussions in Ref. 6. However, what is surprising is that the distribution of quad clustering coefficients exhibits a peak at its maximal value. This result has, to the best of our knowledge, no counter part in systems without higher order interactions.

This raises the question of what type of random hypergraph model can generate statistical properties similar to those observed in real-world networks with higher order interactions, see, e.g., Refs. 5, 7, and 8 for related questions in networks without higher order interactions. Another pertinent question concerns the implications of nodes with high quad clustering coefficients on dynamical processes such as percolation. Since highly clustered nodes do not appear in random hypergraphs, they may play an important role in dynamical processes governed on real-world networks.

## ACKNOWLEDGMENTS

G. -G. Ha thanks D. -S. Lee, J. W. Lee, S. H. Lee, S. -W. Son, H. J. Park, M. Ha, and N. W. Landry. This work was supported by the Engineering and Physical Sciences Research Council, part of the EPSRC DTP, Grant Ref. No. EP/V520019/1.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Gyeong-Gyun Ha (하경균)**: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Writing – original draft (equal). **Izaak Neri:** Formal analysis (equal); Supervision (equal); Writing – review & editing (equal). **Alessia Annibale:** Formal analysis (equal); Supervision (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

We used the databases *NDC-substances*,^{28}^{,} *Youtube*,^{29,30} *Food recipe*,^{31}^{,} *Github*,^{29,32} *Crime involvement*,^{29} and *Wallmart*^{33} as the real-world undirected hypergraph. As a directed hypergraph, we used *DNC-email*,^{29}^{,} *English thesaurus*,^{34} and *Human metabolic pathways*^{35} database. We implemented computation algorithms in Fortran to compute nondirected and directed quad clustering coefficients in a hypergraph, available from https://github.com/Gyeong-GyunHa/qch.

### APPENDIX A: ALTERNATE EXPRESSION FOR THE DENOMINATOR OF THE QUAD CLUSTERING COEFFICIENT

### APPENDIX B: ASYMPTOTIC EXPRESSION OF LIND’S AND ZHANG’S CLUSTERING COEFFICIENTS FOR LARGE CARDINALITIES

#### 1. Lind’s clustering coefficient

#### 2. Zhang’s clustering coefficient

### APPENDIX C: EXPLANATION OF THE TWO CONFIGURATIONS FOR $ C i Lind$ CONSIDERED IN THE LOWER PANEL OF FIG. 3

In the lower panel of Fig. 3, we consider motifs consisting of a central node $i$, three hyperedges $\alpha $, $\beta $, and $\gamma $, and a given number $ Q i( I)$ of quads. There are different ways of assigning quads to a given node $i$ and three hyperedges, and this leads to different values of the Lind clustering coefficients $ C i Lind$, as shown in Fig. 3. In this appendix, we specify the two ways of assigning quads to $i$ that have been considered in Fig. 3 and which we call the uniform and the biased case. Since there are three hyperedges, the different ways of assigning quads to these three hyperedges are fully determined by the numbers $ q i \alpha \beta ( I)$, $ q i \beta \gamma ( I)$, and $ q i \alpha \gamma ( I)$ that denote the number of quads incident to node $i$ and two given hyperedges [see Eq. (15) for the definition]. The example considered in Fig. 3 has cardinalities $ \chi \alpha =15$, $ \chi \beta =20$, and $ \chi \gamma =25$, and therefore we focus on this case.

#### 1. Uniform case

#### 2. Biased case

### APPENDIX D: AVERAGE QUAD CLUSTERING COEFFICENT FOR RANDOM HYPERGRAPH MODELS WITH REGULAR CARDINALITIES

Building on random graph methods as developed in Refs. 1, 2, 3, and 25, we derive in this appendix the expressions (31), (34), and (36) for the average quad clustering coefficients of random hypergraph models with regular cardinalities. In Appendix D 1, we derive Eq. (31), and in Appendix D 2, we derive Eq. (36). Since (34) is a special limiting case of (36), we do not discuss it separately.

#### 1. *χ*-regular ensemble

*χ*

##### a. Normalization constant of $ P \chi $

###### b. Average clustering coefficient

#### 2. *χ*-regular with prescribed degree sequence

*χ*

We derive formula (36) for the average quad clustering coefficient of the $\chi $-regular hypergraph ensemble with a prescribed degree sequence $ k \u2192$, as defined in Eq. (35), in the limit $N\u2192\u221e$ with fixed ratio

In Appendix D 2 a, we determine the normalization constant $ M k \u2192 , \chi $, and in Appendix D 2 b we calculate the average clustering coefficient.

##### a. Normalization constant of $ P k \u2192 , \chi $

##### b. Average clustering coefficient

which is identical to Eq. (36) in the main text. A comparison between Eq. (D30) and the average quad clustering coefficient of large numerically generated random graphs shows an excellent agreement (results not shown).

If all terms of the degree sequence are equal [i.e., it is $(c,\chi )$-regular hypergraph], then Eq. (D30) becomes $\u27e8 C i q(I)\u27e9=(c\u22121)(\chi \u22121)/(cN)+ O ( 1 / N 2 )$.

### APPENDIX E: AVERAGE QUAD CLUSTERING COEFFICIENT FOR BIREGULAR CARDINALITIES

#### 1. Normalization constant of $ P \chi 1 , \chi 2$

#### 2. Average clustering coefficient

### APPENDIX F: DATASETS FOR REAL-WORLD HYPERGRAPHS

In Secs. V and VI of this paper, we have considered six nondirected hypergraphs. These

*NDC-substances*:^{28}The nodes are substances, and the hyperedges are commercial drugs registered in by the U.S. Food and Drug Administration in the National Drug Code (NDC). A node is linked to a hyperedge whenever the corresponding substance is used to synthesize the drug.*Youtube*:^{29,30}Nodes represent YouTube users and hyperedges represent Youtube channels with paid subscription. A user is linked to a hyperedge when the user pays for the membership service.*Food recipe*:^{31}Nodes are ingredients and hyperedges are recipes for food dishes.*Github*:^{29,32}Nodes are GitHub users and hyperedges are GitHub projects. A node is linked to a hyperedge whenever the corresponding user contributes to the GitHub project.*Crime involvement*:^{29}The nodes are suspects, and the hyperedges are crime cases. Nodes are linked to hyperedges whenever the corresponding suspects are involved with the crime investigation.*Wallmart*:^{33}Nodes are products sold by Walmart, and the hyperedges represent purchase orders. Nodes are linked to hyperedges whenever the corresponding products are part of the purchased order.

In Sec. VI D, we have considered three directed hypergraphs:

*DNC-email*:^{29}Nodes are users sending and receiving emails and hyperedges are emails that are part of the 2016 Democratic National Committee (DNC) email leak. Hyperedges are directed from the sender to its recipients. Since an email always has a single sender, all hyperedges have an in-cardinality equal to one.*Human metabolic pathways*:^{35}Nodes represent metabolic compounds in the human metabolism, and hyperedges are metabolic reactions. A hyperedge is directed from the reactants toward the products of the metabolic reaction, and metabolic reactions with very small rates are omitted, yielding a directed hypergraph.*English thesaurus*:^{34}Nodes are English words and hyperedges represent synonym relations between words. Hyperedges are directed from a root word to target words. Since not all words occur as root words, the hypergraph is directed. The in-cardinality of each hyperedge equals to one.

### APPENDIX G: CONFIGURATION MODEL FOR HYPERGRAPHS

We describe the algorithm used to generate a single instance from the configuration model for hypergraphs. There are two types of configuration models: the microcanonical ensemble that specifies the degree $ k \u2192( I)$ and cardinality $ \chi \u2192( I)$ sequences and the canonical ensemble that specifies the distributions $P(k)$ and $P(\chi )$ for the degrees and cardinalities of nodes and hyperedges, respectively. In the microcanonical ensemble, links are generated randomly between nodes and hyperedges given the specified sequences, while in the canonical ensemble we first generate these sequences, and then generate the links.

In Secs. V and VI, we use a micro-canonical ensemble with the number of nodes $N$, hyperedges $M$, degree sequence $ k \u2192$, and the cardinality sequence $ \chi \u2192$ as given by the real-world hypergraph under study. The links between the nodes and hyperedges are generated as follows. We associate a number of stubs to the nodes and hyperedges of the graph corresponding to their degrees and cardinalities. Subsequently, we randomly connect the stubs of nodes with those of hyperedges with the additional constraints that there are no multiple links connecting the same pair of nodes and hyperedges. The upper panel of Fig. 15 shows an example of this process for the case of $ k \u2192=(1,1,1,2,2,1,1,1)$ and $ \chi \u2192=(5,5)$. An analogous process applies to directed hypergraphs and is illustrated in the lower panel of Fig. 15 for $ k \u2192 i in=(0,0,0,1,1,1,1,1)$, $ k \u2192 out=(1,1,1,1,1,1,1,0)$, $ \chi \u2192 in={3,4}$, and $ \chi \u2192 out={2,3}$.

### APPENDIX H: DENOMINATOR OF THE QUAD CLUSTERING COEFFICIENT FOR A DIRECTED HYPERGRAPHS

#### 1. $ \chi \alpha , i in= \chi \alpha , i out$ and $ \chi \beta , i in= \chi \beta , i out$

Figure 16 shows two examples, one for which $ C i q=0$ and another one for which $ C i q=1$.

#### 2. $ \chi \alpha , i in= \chi \alpha , i out$ and $ \chi \beta , i in\u2260 \chi \beta , i out$

Figure 17 shows examples with $ C i q=0$ and $ C i q=1$ for each of the three above cases.

For the case with $ \chi \alpha , i in\u2260 \chi \alpha , i out$ and $ \chi \beta , i in= \chi \beta , i out$, an analogous expression applies with the two indices $\alpha $ and $\beta $ swapped.

#### 3. $ \chi \alpha , i in\u2260 \chi \alpha , i out$, $ \chi \beta , i in\u2260 \chi \beta , i out$, and $ | X X i \alpha \u222a X X i \beta |\u22604$.

#### 4. $ \chi \alpha , i in\u2260 \chi \alpha , i out$, $ \chi \beta , i in\u2260 \chi \beta , i out$, and $ | X X i \alpha \u222a X X i \beta |=4$

### APPENDIX I: FOR NONDIRECTED HYPERGRAPHS $ C q \u2194( I \u2194)= C q( I)$

## REFERENCES

*The Structure and Dynamics of Networks*

*The Nature of Complex Networks*

*Proceedings of the Web Conference 2021*(ACM, 2021), pp. 3396–3407.

*Generating Random Networks and Graphs*

*Proceedings of the 22nd International Conference on World Wide Web*(ACM, 2013), pp. 1343–1350.

*Online Social Networks: Measurement, Analysis, and Applications to Distributed Information Systems*

*Proceedings of The Web Conference 2020*(ACM, 2020), pp. 706–717.

*Project Gutenberg Literary Archive Foundation*(2002), available at https://www.gutenberg.org/ebooks/3202.