We use topological data analysis and machine learning to study a seminal model of collective motion in biology [M. R. D’Orsogna *et al.*, Phys. Rev. Lett. **96**, 104302 (2006)]. This model describes agents interacting nonlinearly via attractive-repulsive social forces and gives rise to collective behaviors such as flocking and milling. To classify the emergent collective motion in a large library of numerical simulations and to recover model parameters from the simulation data, we apply machine learning techniques to two different types of input. First, we input time series of order parameters traditionally used in studies of collective motion. Second, we input measures based on topology that summarize the time-varying persistent homology of simulation data over multiple scales. This topological approach does not require prior knowledge of the expected patterns. For both unsupervised and supervised machine learning methods, the topological approach outperforms the one that is based on traditional order parameters.

A fundamental goal in the study of complex, nonlinear systems is to understand the link between local rules and collective behaviors. For instance, what influence does coupling between oscillators have on the ability of a network to synchronize? How does policing affect hotspots of crime in urban areas? Why do the movement decisions that fish make lead to schooling structures? We examine such links by bringing together tools from applied topology and machine learning to study a seminal model of collective motion that replicates behavior observed in biological swarming, flocking, and milling. Studies of collective motion often focus on the so-called forward problem: given a particular mathematical model, what dynamics are observed for different parameters input to the model? In contrast, we study an inverse problem: given observed data, what model parameters could have produced it? Also, given observed data, what paradigmatic dynamics are being exhibited? To answer these questions, we use machine learning techniques and find that they achieve higher accuracy when applied to topological summaries of the data—called crockers—as compared to more traditional summaries of the data that are commonly used in biology and physics to characterize collective motion. Ours is the first study to use crockers for nonlinear dynamics classification and parameter recovery.

## I. INTRODUCTION

Fundamental to many nonlinear systems is the link between local rules and global behaviors. For instance, what influence does coupling between oscillators have on the ability of a network to synchronize? How does policing affect hotspots of crime in urban areas? Why do the movement decisions that fish make lead to schooling structures? In this paper, we consider the relationship between local rules and global behavior in service of two tasks of interest in nonlinear science, namely, classification of dynamics and parameter recovery. We conduct this study in the context of one of the aforementioned applications: collective motion in biology.

Animal groups such as flocks, herds, schools, and swarms are ubiquitous in nature,^{1,2} inspire numerous mathematical models^{3–6} and motivate biomimetic approaches to engineering and computer science.^{7,8} In animal groups, two levels of behavior come into play. First, at the individual level, organisms make decisions about how to move through space. It is well-established in the biological literature that social interactions between organisms play a key role. Second, at the group level, collective motion may occur, where animals coordinate and produce emergent patterns. Mathematical models can help link these two levels of behavior. This type of linkage is fundamental to innumerable nonlinear systems displaying collective behavior.

While there is a vast literature on mathematical models of collective motion, we focus on the influential model of D’Orsogna *et al.*^{9} This model describes agents whose movement is determined by self-propulsion, drag, and social attraction-repulsion, forces frequently used in collective motion studies.^{10,11} The model produces paradigmatic behaviors such as rotating rings, vortices, disorganized swarms, and traveling groups; these mimic fish schools, swarms of midges, bird flocks, and more.^{12–14}

Studies of collective motion models often focus on a *forward problem*: given a particular model, what dynamics are observed for different parameters? In our present work, however, we study an *inverse problem*: given observed data, what parameters could have produced it? Also, given observed data, what paradigmatic dynamics are being exhibited?^{15,16} There are numerous strategies to infer parameters from data with simple differential equation models, including Bayesian inference^{17} and frequentist approaches.^{18} Our primary contribution here is to solve an inverse problem with an agent-based model using tools from topological data analysis and machine learning. Though we focus on collective motion in the D’Orsogna model, our protocol is applicable in many other settings.

Machine learning and mathematical modeling have traditionally been viewed as separate ways of understanding data. On one hand, machine learning can extract predictions of complex relations within large data sets. On the other hand, modeling can be used to hypothesize how mechanisms lead to an observed behavior. There is a growing understanding, however, that modeling and machine learning can be used in synergy.^{19} For example, Lu *et al*. use nonparametric estimators to learn the rules governing the observed output of agent-based models.^{20} The method is applied to fundamental interacting particle systems,^{21} models of social influence,^{22} predator-swarm systems,^{23} and phototactic bacteria.^{24}

Our study hinges on the use of topological data analysis (TDA), a set of tools that “help the data analyst summarize and visualize complex datasets.”^{25} TDA has played a pivotal role in studies of breast cancer, spinal cord injury, contagion, and other biological systems.^{26–28,45} A primary tool in the TDA toolbox is *persistent homology*. *Homology* has to do with calculating certain topological characteristics, while *persistence* refers to examining which of these are maintained across multiple scales in the data. In its fundamental form, persistent homology provides a framework for describing the topology of a static data set. However, ideas from persistent homology have been extended to time-varying data. One approach is the *Contour Realization Of Computed $k$-dimensional hole Evolution in the Rips complex*, known more simply as a *crocker*.^{29} A crocker shows contours of quantities called *Betti numbers* as a function of time and of persistence scale, providing a topological summary of time-varying point clouds of data. Recent work has shown how crockers can be used for exploratory data analysis of collective motion and for judging the fitness of potential mathematical models of experimental data.^{29,30}

Our present work brings together mathematical modeling, machine learning, and TDA to study collective motion in the D’Orsogna model. More specifically, we construct crockers from numerical simulations of the model and use them as inputs to machine learning clustering and classification algorithms in order to identify different paradigmatic patterns and the model parameters that produce them. We compare this approach to a more traditional one, which uses order parameters commonly used in studies of collective motion. While the traditional order parameters are typically chosen using prior knowledge of the system, the TDA tools can be used with no prior knowledge and are problem-independent. In our methodological study, we use data simulated from models so that we can compare inferred parameters to those actually used to generate the data. Once trained, the machine learning algorithms can be used to infer model parameters from experimental data.

The rest of this paper is organized as follows. Section II describes our methods, namely, numerical simulation of the D’Orsogna model, computation of traditional order parameters and crockers, and machine learning techniques. Section III presents our results, including our primary finding: the topological approach outperforms the traditional one. We conclude in Sec. IV.

## II. METHODS

### A. D’Orsogna model and numerical simulations

The D’Orsogna model^{9,31} describes the motion of $N$ interacting agents of mass $m$. Each agent is characterized by its position and velocity $xi,vi\u2208Rd$, $i=1,\u2026,N$. We focus on the two-dimensional case, $d=2$, throughout this study, though the case for $d=3$ has been considered in Ref. 32. Position and velocity obey the coupled, nonlinear ordinary differential equations,

where $\u2207\u2192i$ is the gradient with respect to $xi$. Equation (1a) states that the time derivative of position is velocity. Equation (1b) is Newton’s law, with the right hand side describing three forces acting on each agent: self-propulsion with strength $\alpha $, nonlinear drag with strength $\beta $, and social interactions. These social interactions are attractive-repulsive, as specified by the Morse-type potential in (1c). The first term inside the brackets describes social repulsion of overall strength $Cr>0$. Repulsion decays exponentially in space, with characteristic length scale of decay $\u2113r>0$. The exponential decay reflects that organisms’ ability to sense each other through sight, sound, or smell decays over distance. Restated, an organism will be more heavily influenced by near individuals than far ones. The second term is signed oppositely, and hence describes social attraction. The summation in (1c) means that a given organism interacts with all other organisms, albeit with influence decaying exponentially in space.

Biological modeling studies typically assume that repulsion is strong but operates over a short length scale, while attraction is weaker but operates over a longer scale. For (1), this would mean $Cr>Ca$ and $\u2113r<\u2113a$. In this case, the Morse potential $U$ has a well-defined minimum, representing the distance at which attraction and repulsion balance for two organisms in isolation. However, this distance is not the separation observed between individuals in a group because of the nonlinear all-to-all coupling. Regardless, as in Ref. 9, we do not enforce the restriction $Cr>Ca$ and $\u2113r<\u2113a$.

For our study, we set $N=200$. After nondimensionalizing (1), we are left with four dimensionless parameters: $\alpha $, $\beta $, $C=Cr/Ca$, and $l=lr/la$. We set $\alpha =1.5$, and $\beta =0.5$, corresponding to a base case of parameters from Ref. 9 and explore the remaining parameter space by varying the ratios $C=Cr/Ca$ and $\u2113=\u2113r/\u2113a$. Both $C$ and $\u2113$ will take on values in ${0.1,0.5,0.9,2.0,3.0}$, resulting in 25 different possible parameter combinations. For each combination, we perform 100 simulations using MATLAB’s ode45 function. Because the D’Orsogna model is typically independent of initial conditions, we draw $xi(0)$ and $vi(0)$ each from a uniform distribution on $[\u22121,1]d$. We simulate until $t=100$, allowing the swarm to attain a dynamic equilibrium state. From the computed trajectories, we sample the positions and velocities every 0.05 time units. Thus, the final output is ${(xi(tj),vi(tj))}i=1,..,Nj=1,\u2026,M$ for $tj=(j\u22121)\Delta t,M=2001,\Delta t=0.05$ for each of our 2500 simulations.

These simulations produce paradigmatic collective motion including single mills, double mills, double rings (referred to simply as rings in Ref. 9), collective swarms, and group escape, which we split into three distinct classes.^{9} For the remainder of this paper, we refer to paradigmatic collective behaviors as *phenotypes*. The first column of Fig. 1 shows a representative snapshot of each major phenotype, and Fig. 2 shows the three distinct escape types. Table I lists the values of $(C,\u2113)$ that produce each phenotype in our library of simulations. We describe these phenotypes in more detail at the end of Sec. II C.

### B. Order parameters

Investigators often use *order parameters* to summarize the output of collective motion experiments or simulations.^{9,11,29} Summaries are necessary because it is impractical or impossible to manually inspect large amounts of raw output data. The order parameters are intended to suggest if and when certain types of group behavior emerge in a population, for instance, when many individuals move in the same direction or rotate with the same orientation. Typical order parameters used for (1) include polarization, angular momentum, absolute angular momentum, and the mean distance to the nearest neighbor.^{29} We use these four in our present study, plotted for each phenotype in the second column of Fig. 1.

Group polarization $P(t)$ measures the degree of alignment between agents and is given by

with $P=1$ signifying that all agents have the same direction of motion. All phenotypes in Fig. 1 exhibit low $P(t)$, suggesting no translational flocking. Angular momentum $Mang(t)$ can detect rotational motion and is given by

where $ri(t)=xi(t)\u2212xcm(t)$, and $xcm(t)$ refers to the center of mass of the agents. A group with $Mang=1$ would have individuals sharing perfectly rotational motion. Following Ref. 31, we also consider the absolute angular momentum $Mabs(t)$, given by

Discrepancies between angular momentum and absolute angular momentum can distinguish a single mill from a double mill, in which counter-rotating agents cancel out each other's angular momentum. For example, we observe in Fig. 1 that $Mabs(t)\u2248Mang(t)$ for the single mill, whereas $Mabs(t)>Mang(t)$ for the double mill. Finally, we consider the mean distance to nearest neighbor, $DNN(t)$, which may distinguish group escape behavior from other phenotypes.^{33} $DNN(t)$ is given by

$DNN(t)$ becomes very large for the escape phenotype over time in Fig. 1 as the particles act repulsively.

In calculating time series of these order parameters, we downsample by a factor of 23 (chosen since it is a divisor of 2001, the original number of simulation frames), resulting in $M=87$ time points. While some information is lost with downsampling, it makes subsequent computations faster, while maintaining a high level of classification accuracy with the downsampling rate we have chosen.

### C. Persistent homology and crockers

While order parameters can be useful summaries of collective motion data, they are typically designed in a problem-specific manner and with some knowledge of the expected dynamics. We compare an order parameter approach to one that measures the topology of the data. This approach is arguably more agnostic and less application-specific. We now review relevant ideas from topology. To make this review broadly accessible, we keep it conceptual. For some technical details, see, e.g., Refs. 29,34, and 35.

We begin our explanation of persistent homology and crockers by focusing on agents’ positions during one time step of a simulation (or experiment). These data constitute a *point cloud*, made up of $N$ points in $Rd$. To study the topology of a point cloud, we transform it into an object called a *simplicial complex*. While there are many ways to construct a simplicial complex, we use the *Vietoris-Rips* (VR) complex, a common choice in TDA because it is efficient to compute.

To build a VR complex, we select a distance $\u03f5$ and draw a ball of diameter $\u03f5$ around each point. If two balls intersect, we connect them with an edge. If three balls all pairwise intersect, we connect all three edges and fill in the resulting triangle. If four balls all pairwise intersect, we connect all six edges, fill in each of the four triangles bounded by those edges, and fill in the solid tetrahedron bounded by the four triangles. Points, edges, filled triangles, and solid tetrahedra are called 0-, 1-, 2-, and 3-*simplices*, and more generally, $k$-simplices for any $k+1$ points with $\u03f5$-balls that pairwise intersect.

With a simplicial complex built from our point cloud, we now measure its topology by calculating its *Betti numbers*. Betti numbers $bk$ are topological invariants, meaning that they are unchanged under continuous deformations of the object such as stretching, compressing, warping, and bending. Thus, they measure something fundamental about the shape of the object. More specifically, Betti numbers enumerate the number of distinct holes in the complex that have a $k$-dimensional boundary, that is, a hole surrounded by $k$-simplices. For instance, $b0$ counts the number of connected components in the simplicial complex. Similarly, $b1$ counts the number of topological loops that bound a 2D void. Betti number $b2$ counts the number of trapped 3D volumes and so on as dimension increases. Algebraic topology tells us how to encode the calculation of Betti numbers as a linear algebra problem; see standard topology texts or Ref. 36 for a tutorial. The calculation is a *homology* computation because $bk$ is the rank of an algebraic object called a *homology group*.

In the discussion above, no value of $\u03f5$ was specified. *Persistent homology* constructs simplicial complexes and calculates Betti numbers for a range of $\u03f5$ values. There exist powerful software packages that automate this process.^{35} We use the Ripser package.^{37} The outputs of these computations are *birth* and *death* values of $\u03f5$, that is, the values of $\u03f5$ for which the various features enumerated by $bk$ appear and disappear. The word *persistence* refers to the ranges of $\u03f5$ over which features persist. For example, features that persist over large ranges of $\u03f5$ might be interpreted as signals rather than topological noise. There exist many ways to organize the birth and death information, with the most common being objects called barcodes and persistence diagrams.^{35} Additionally, for a given value of $k$, one could construct a vector in which each entry gives the value of $bk$ for a specific value of $\u03f5$ (say, on a grid). This information, $bk(\u03f5)$, is a *Betti curve*.

Small perturbations in data produce small perturbations in persistence diagrams; that is to say, persistence diagrams are stable to noise near the data.^{38} However, persistence computations are not stable with respect to outliers. In practice, a codensity measure can be used to filter outliers. Further work in this area develops notions of distance that are robust to large quantities of empirical noise and outliers.^{39–41} A clustering approach could also be used to limit the effects of this type of noise.^{42}

Thus far, we have discussed topological analysis of static point clouds. If we allow our agents’ positions to evolve dynamically in time, $t$, then we can construct a Betti curve for each frame of the simulation or experiment, and concatenate these into a matrix. We let time $t$ vary along columns and $\u03f5$ vary along rows. Each entry specifies the Betti number $bk$ for a specific pair $(t,\u03f5)$. The matrix is a topological signature of the time-varying data of a simulation and once vectorized can serve as input to machine learning algorithms. Equivalently, for visualization, one could take this matrix and construct a contour plot of $bk(t,\u03f5)$. Such a plot is the *Contour Realization Of Computed $k$-dimensional hole Evolution in the Rips complex*, or *crocker*, defined in Ref. 29.

As mentioned in Sec. II A, our data consist of numerical solutions of (1). While the D’Orsogna model tracks agents’ positions and velocities, we restrict ourselves to using position data in our topological analysis. This approach has two advantages. First, it renders our techniques applicable to experimental data, where position is the most easily observed quantity. Second, it circumvents a potential scaling disparity between numerical values of position and velocity when performing TDA on the data, as exhibited in Fig. 3. In contrast, three out of four order parameters described above require knowledge of the velocity.

To regain some of the information lost by excluding velocity, we incorporate time-delayed position information into some of our analyses. We demonstrate this time-delay approach for a double ring phenotype in Fig. 3, finding $b1=2$ for a range of $\u03f5$ values, as we would expect based on Ref. 29. Our time-delayed point cloud consists of points in $R4$ of the form $[xi(tj),xi(tj\u22125\Delta t)]$, where $j$ ranges from 23 to 2001 with a spacing of 23 to result in 86 time samples. Figure 5 and accompanying text in Appendix A describe the 4D data for single mills, double mills, and double rings. We will also consider position-only crockers {computed on a point cloud consisting of $[xi(tj)]$, where the sampling of $j$ is the same as that discussed for the order parameters in Sec. II B}.

Still, challenges remain with the normalization of our topological data. With escape phenotypes, interagent distances can approach infinity, whereas they remain bounded for other phenotypes. To circumvent this challenge, when performing our topological analyses, we take any agent whose distance from the origin crosses the threshold $\Vert xi\Vert \u221e=10$ and edit the simulation data to hold it fixed at this position.

Even after enforcing this bound, phenotypes occur on a range of scales. While agent coordinate positions are capped at 10 for group escape, they are as small as $10\u22123$ for collective swarms. We normalize position data across all simulations with a global normalization constant to ensure $\u22121\u2264\Vert xi(t)\Vert \u221e\u22641$. With this scaling, the smallest normalized phenotypes have typical distances of $10\u22124$. Thus, we compute persistent homology with $\u03f5$ varying logarithmically between $10\u22124$ and 1 with 200 grid points such that $\u03f5q=10\u22124+q\Delta \u03f5,\Delta \u03f5=4/200,q=1,\u2026,200$.

The third and fourth columns of Fig. 1 show crockers ($b0$ and $b1$) for five example simulations. The crockers for single and double mills differ markedly. In (b), the $b1$ crocker for the double mill contains small islands corresponding to two loops ($b1=2$) within a large area of topological noise ($b1>2$). In (a), the $b1$ crocker for the single mill lacks this signature. In (c), a very strong signature of two loops ($b1=2$) appears for the double ring simulation. Appendix A provides an explanation for the presence of two loops for double mills and rings. In (d), multiple agents form tight clumps in the collective swarm simulation, with each clump sufficiently dense that it appears as a single dot in the figure. The time scale at which clumps form manifests as the disappearance of high-valued regions in the $b0$ contour. On a macroscopic scale, we notice that each clump eventually travels with rotational motion, consistent with $b1=1$ over a range of scales at later times. In (e), agents escape to infinity in a radially expanding circular arrangement. The strong signature of $b1=1$ occurs at larger scales as time increases, consistent with an expanding circle.

### D. Unsupervised learning

We use the $k$-medoids algorithm to cluster numerical simulations. Each simulation is characterized by a *feature vector* constructed either from traditional order parameters or from topology. The order parameter feature vectors consist of time series $P(t)$, $Mang(t)$, $Mabs(t)$, $DNN(t)$, or the concatenation of all four. The topological feature vectors consist of vectorized crocker matrices for $b0$, $b1$, or the concatenation of the two, calculated from agent position (in some cases, augmented with time-delayed position as described in Sec. II C).

The $k$-medoids algorithm divides the ensemble of simulations into $k$ clusters, each of which is defined by one member of the ensemble that serves as the *medoid*. The algorithm chooses medoids to minimize the sum of pairwise distances within each cluster, and each simulation is assigned to the cluster containing its closest medoid. We use the R software function pam to cluster our simulations into $k=25$ groups, since there are 25 distinct parameter choices ($C$,$\u2113$). This is an *unsupervised* approach, as it does not require labeled training data.

### E. Supervised learning

As an alternative approach, we use a multiclass linear support vector machine (SVM) to infer parameters.^{43} Our use of SVMs is *supervised* because we train them on a subset of our simulations, each labeled with its true $(C,\u2113)$ values. A linear SVM takes this training data and finds hyperplanes that maximally divide the simulations according to parameter values. To classify a simulation not included in the training set, one identifies the intrahyperplane region in which it falls and reads off the appropriate label, i.e., the parameter values.

We use Matlab’s fitcecoc function to build and train our SVMs using a one-vs-one approach and 5-fold cross-validation. That is, for each round of cross-validation, we withhold 20% of the data (20 simulations from each parameter combination) for the testing data set. We then train the linear SVM on the remaining data and compute the out-of-sample accuracy for simulations in the testing set.

Order parameter and topological feature vectors are not of the same dimension. The time series of each order parameter is 87-dimensional, and the time series of all four concatenated is $4\xd787=348$-dimensional. On the other hand, each position crocker is $200\xd787=17400$-dimensional, and the concatenation of $b0$ and $b1$ is double that. To make a fair comparison between the order parameter and topological approaches, we use principal component analysis^{44} (PCA) to reduce the dimensionality of our input feature vectors. In one case, we reduce crockers and the concatenated order paramers to 87 dimensions in order to compare them directly to the individual order parameter time series. In a second case, we reduce all feature vectors to three dimensions to investigate performance at low dimensionality.

## III. RESULTS

Figure 4 recapitulates the entire analysis pipeline for our study. We summarize the dynamics of (1) using two approaches. The order parameter-based approach uses problem-specific quantities designed to distinguish between observed dynamics. The topology-based approach does not require this *a priori* knowledge and is instead based purely on the shape of the data. We then construct feature vectors from each type of summary and input them into machine learning algorithms to identify parameters and a phenotypic pattern for each model simulation. In Secs. III A and III B, we compare how accurately these different approaches can recover simulation parameters and identify dynamic phenotypes.

### A. Unsupervised learning results

Table II summarizes results for $k$-medoids clustering with $k=25$. Columns 1 and 2 specify the feature vector used, while columns 3 and 4 give the accuracies obtained for parameter recovery and phenotype identification. For the parameter recovery task, using the concatenation of all four order parameters yields 49.9% accuracy. For the concatenation of $b0$ and $b1$ based only on position data, we obtain 76.6% accuracy. Finally, for $b0$ and $b1$ based on time-delayed position, we have 71.3% accuracy. In Appendix B, Fig. 7 displays confusion matrices. These reveal that regardless of feature vector type, the collective swarming and single mills are the most frequently misclassified phenotypes.

For all feature vector types, phenotype identification is significantly more accurate than parameter recovery. The confusion matrices reveal that most cases of parameter recovery failure nonetheless place a simulation in its appropriate phenotypic regime. Using topological feature vectors based on position, we classify nearly every simulation correctly by phenotype, and this approach slightly outperforms concatenation of all order parameters. The slight improvement in classification accuracy when using TDA instead of order parameters may not be sufficient to justify the increased computational time for phenotype classification tasks. However, for parameter recovery tasks, TDA significantly improves classification accuracy when using k-medoids, and, as discussed shortly, when using a supervised classification method.

### B. Supervised learning results

Table III summarizes supervised classification results for linear SVMs. For topological feature vectors based on position, $b0$ does best with 97.0% accuracy. Similarly, for delayed positions, $b0$ also does best, with 99.6% accuracy. Finally, for order parameter feature vectors, $DNN(t)$ does best with 91.1% accuracy, while concatenating all four order parameters yields 89.2% accuracy.

For any feature vectors with a dimension greater than 87, we also include results obtained after reducing the dimensionality to 87 via a principal component analysis, allowing for a more fair comparison. After dimension reduction, the time-delayed topological information for $b1$ achieves the highest classification accuracy at 99.9%, followed by the position-only topological information for $b0$ at 96.2%. The classification accuracy for all four concatenated order parameters drops to 69.6%. Thus, a fourfold reduction in the dimension of the concatenated order parameters results in a 19.6% loss of accuracy, whereas a 200-fold reduction for time-delayed topological information leads merely to a 0.8% drop in accuracy for position-only information and a 0.1% increase for time-delayed topological information. These results suggest that even with dimensionality reduction, the topological feature vectors still carry more discriminative information.

To examine the limit of low-dimensional data, we calculate accuracies obtained after reducing all feature vectors to three dimensions. In this case, for topological feature vectors based on position data, the concatenation of $b0$ and $b1$ does best, with an accuracy of 93.1%. For delayed position data, $b0$ does best, achieving $87.7%$ accuracy, and for order parameters, $DNN(t)$ does best, yielding 81.5% accuracy. Figure 6 in Appendix B shows the three-dimensional representations of the $b0$ and $b1$ crockers. We observe a strong separation of the different phenotypes, which explains the high out-of-sample classification accuracy.

Also in Appendix B, Fig. 8 visualizes the classification results. These results suggest that, similar to the unsupervised case, collective swarms are the most difficult phenotype to classify. Still, overall, using topological data rather than order parameter data can significantly improve parameter recovery.

## IV. CONCLUSIONS AND DISCUSSION

We have combined mathematical modeling, topological data analysis, and machine learning to study nonlinear dynamics and parameter inference in the D’Orsogna model of collective motion. More specifically, we simulated (1), summarized the data using traditional and topological descriptors, and input these summaries into unsupervised and supervised machine learning algorithms in order to recover model parameters and classify pattern phenotypes.

Our machine learning classifiers achieved higher accuracy when using topological feature vectors (namely, crockers) than when using feature vectors based on traditional order parameters. Since the crocker feature vectors have higher dimensionality than the order parameter ones, we sought a fair comparison by reducing them via PCA. In this case, the crocker approach still achieved better classification accuracy using a supervised approach. In fact, $b0$ crockers generated from time-delayed position data produced a nearly perfect classification. The time-delayed crockers encode information on particle velocity, which appears explicitly in Eq. (1) and may thus aid the algorithm. The addition of $b1$ information serves primarily to increase the dimensionality of the feature space and results in reduced, though still quite high, classification accuracy.

One limitation of the topological approach is the computational cost required to produce crockers. While an order parameter is scalar-valued at each time, a crocker is vector-valued. However, recent software improvements, including the development of the Ripser package, have led to a significant reduction in cost.

A major advantage of using topological data summaries is that they do not require prior knowledge about the patterns resulting from model simulation. Order parameters, on the other hand, are typically developed to capture specific features of previously observed model behavior. We found that for the D’Orsogna model, topological approaches to phenotype classification and parameter recovery achieved higher accuracy than order parameters even though they do not incorporate knowledge of the model or its dynamics.

In future work, we would like to apply this approach to data from biological experiments or field observations. There is a scarcity of publicly available data describing real biological aggregation dynamics, so for the present, we have demonstrated our method on simulation data. Furthermore, we would like to extend our work to more complex settings, e.g., to the D’Orsogna model posed in three dimensions, in which dynamical transitions occur between distinct phenotypic regimes.^{32} Finally, it would be useful to augment the model with noise and assess its effect by using our topological methods.

## ACKNOWLEDGMENTS

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. DMS 1641020. We wish to acknowledge the American Mathematical Society’s Mathematical Research Community, which brought our collaboration together and supported our work. L.Z. is partially supported by the NSF grant (No. CDS&E-MSS-1854703). D.B. is funded by the National Cancer Institute IMAT Program (No. R21CA212932). J.T.N. is partially supported by the NSF grant (No. DMS-1638521) to SAMSI. C.M.T. is supported by the NSF grant (No. DMS-1813752).

### APPENDIX A: TIME-DELAYED CROCKERS

In Fig. 5, we examine data used to compute crockers in order to understand differences in $b1$ for the single mill, double mill, and double ring phenotypes. Since the data are 4D, as described in Sec. II C, we show various 2D projections. Panel (a) shows a single mill. We see a single looplike structure that arises from the arrangement of particles in an annulus all traveling with the same orientation. This structure produces $b1=1$ in Fig. 1(a). Panel (b) shows a double mill. Especially in the last two columns, we see a signal of two loops. The two loops correspond to the two counter-rotating swarms of the double mill. However, the signal is quite noisy. This noisy signal manifests as the transient islands of $b1=2$ in Fig. 1(b). Panel (c) shows a double ring. Agents occupy a well-defined circle, with some rotating clockwise and others counterclockwise. This phenotype gives rise to two loops in the 4D space of our data (see last two columns) and produces a clear signal of $b1=2$ in Fig. 1(c).

### APPENDIX B: VISUALIZATION OF MACHINE LEARNING RESULTS

Figure 6 displays the representation of all 2500 time-delayed $b0$ and $b1$ crockers reduced to three-dimensional space using PCA. Here, we see a strong separation of the single mill, double mill, collective swarm, escape, and double ring phenotypes. This separation explains why a linear SVM trained on this information has a fairly high out-of-sample classification accuracy, as shown in Table III.

Figure 7 displays confusion matrices arising from the unsupervised classification (k-medoids clustering) performed in Sec. III A. In panel (a), we cluster on the concatenation of all four order parameters. Most of the simulations with misidentified model parameters are those exhibiting single mill behavior or collective swarm behavior. However, for these incorrect cases, most were clustered with cases sharing the same phenotype. In panel (b), we use the concatenation of $b0$ and $b1$ crockers derived from 2D position data. In this case, parameter recovery is more accurate for single mill simulations than in panel (a) but collective swarms remain challenging. Panel (c) is similar to (b), but it is based on 4D data incorporating position and time-delayed position. This approach yields results similar to those in (b), with a slight increase in misclassification among the three escape phenotypes.

Figure 8 depicts the out-of-sample parameter classification results from linear SVMs, as described in Sec. III B. Note that in this figure, we are depicting the classification of each individual simulation and not binning these classifications together as we do in the confusion matrices of Fig. 7. Because of the high supervised classification accuracies, depicting the individual classifications is more informative than the summary confusion matrix. In panel (a), the linear SVM is trained on feature vectors comprised of $DNN(t)$ without any dimensionality reduction. We observe a high misclassification from parameters $(C,\u2113)=(0.1,0.5)$ and $(0.5,3.0)$ as each other, as well as simulations from $(C,\u2113)=(0.1,0.9),(0.1,2.0),(0.1,3.0)$. All of these parameter choices produce the collective swarm phenotype (see Table I), suggesting that this is the most difficult phenotype for parameter recovery. In panel (b), the linear SVM is trained on feature vectors comprised of the concatenation of $b0$ and $b1$ crockers derived from 2D position data with dimensionality reduction down to 87 [to match the $DNN(t)$ dimensionality]. Here, we observe a marked reduction in the misclassification of the collective swarm parameter values as compared to Panel (a). In panel (c), the linear SVM is trained on feature vectors comprised of $b0$ and $b1$ crockers derived from 4D time-delayed data with dimensionality reduction down to 87. Here, we observe very accurate classifications, with only 7 simulations being misclassified out of 2500.

## REFERENCES

*Self-Organization in Biological Systems*, Princeton Studies in Complexity (Princeton University Press, Princeton, NJ, 2001.

*Animal Groups in Three Dimensions*, edited by J. K. Parrish and W. M. Hamner (Cambridge University Press, Cambridge, UK, 1997), pp. 68–89.

*15th International Conference on Machine Learning and Data Mining*(Ibai Publishing, 2019), Vol. II, pp. 860–874.