Pesticides benefit agriculture by increasing crop yield, quality, and security. However, pesticides may inadvertently harm bees, which are valuable as pollinators. Thus, candidate pesticides in development pipelines must be assessed for toxicity to bees. Leveraging a dataset of 382 molecules with toxicity labels from honey bee exposure experiments, we train a support vector machine (SVM) to predict the toxicity of pesticides to honey bees. We compare two representations of the pesticide molecules: (i) a random walk feature vector listing counts of length-*L* walks on the molecular graph with each vertex- and edge-label sequence and (ii) the Molecular ACCess System (MACCS) structural key fingerprint (FP), a bit vector indicating the presence/absence of a list of pre-defined subgraph patterns in the molecular graph. We explicitly construct the MACCS FPs but rely on the fixed-length-*L* random walk graph kernel (RWGK) in place of the dot product for the random walk representation. The *L*-RWGK-SVM achieves an accuracy, precision, recall, and F1 score (mean over 2000 runs) of 0.81, 0.68, 0.71, and 0.69, respectively, on the test data set—with *L* = 4 being the mode optimal walk length. The MACCS-FP-SVM performs on par/marginally better than the *L*-RWGK-SVM, lends more interpretability, but varies more in performance. We interpret the MACCS-FP-SVM by illuminating which subgraph patterns in the molecules tend to strongly push them toward the toxic/non-toxic side of the separating hyperplane.

## I. INTRODUCTION

### A. Pesticide toxicity to bees

Pesticides (including insecticides, fungicides, and herbicides) are used in agriculture as an economical means to control weeds, pests, and pathogens. Thereby, pesticides increase expected crop yield and quality and contribute to food security.^{1–4} However, widespread pesticide use has negative externalities on both aquatic and terrestrial ecosystems and human health.^{5–9} For example, pesticides can harm agriculturally beneficial species not deliberately targeted, such as earthworms and bees.^{10,11}

Although under debate, extensive pesticide use in agriculture may play a role^{11–16} in the widespread decline^{17–20} of bee populations (see Ref. 21 for a synopsis) via both lethal and sublethal toxicity.^{11,21} Harms to bee populations are especially concerning because bees are valuable for agricultural production:^{22} (1; primary value) Bees serve as pollen vectors for many crops, including fruits, vegetables, nuts, oilseed, spices, and coffee.^{23,24} Specifically, bees visit the flowers of plants (angiosperms) to collect pollen or nectar as a food source. In the process, bees (inadvertently) transfer pollen from the anther of one flower to the stigma of another flower, a necessary step in the production of seeds and fruits for many plants.^{25} (2; secondary value^{26}) Honey bees produce honey and beeswax. In addition, bees are ecologically valuable as pollen vectors for plants in natural habitats.^{27}

Because insect,^{28} weed,^{29} and fungi^{30} populations can develop resistance to an insecticide, herbicide, and fungicide, respectively, new pesticides must be continually discovered and deployed.^{1} New pesticide development is also driven by the aim to reduce negative environmental impacts of incumbent pesticides.^{31}

Virtual screenings can accelerate the discovery of new pesticides operating under a known mechanism. For example, suppose an insect protein is a known target for insecticides. Then, computational protein-ligand docking^{32} can score candidate compounds for insecticide activity, informing experimental campaigns.^{33–39} However, newly proposed pesticides must also be assessed for toxicity to honey bees^{40} (see the US EPA website^{41}).

A computational model that accurately predicts the toxicity of pesticides to bees would be useful^{42} (i) as a toxicity filter in virtual and experimental screenings of compounds for pesticide activity, (ii) in emergency situations where an immediate assessment of toxicity risk is needed, and (iii) to focus scrutiny and motivate more thorough toxicity assessments on existing and new pesticides predicted to be toxic. Generally, training machine learning models to predict the toxicity of compounds to biological organisms is an active area of research.^{43,44} Indeed, open data from bee toxicity experiments^{45–56} have been leveraged to train machine learning models to computationally predict the toxicity of pesticides to bees.^{58–63}

### B. Representing molecules for supervised machine learning tasks

A flurry of research activity is devoted to the data-driven prediction of the properties of molecules via supervised machine learning.^{64} A starting point is to design a machine-readable representation of the molecule for input to the machine learning model.^{66–67}

A vertex- and edge-labeled graph (vertices = atoms, edges = bonds, vertex label = element, and edge label = bond order) is a fundamental representation of the concept of a small molecule. For many classes of molecules, the mapping of the concept of a molecule to a molecular graph is one-to-one. If we wished to communicate a small molecule to an intelligent, extraterrestrial life form that has just arrived on Earth and does not know our language, we would likely sketch a vertex- and edge-labeled, undirected graph. Though, molecular graph representations break down for certain classes of molecules^{65} and are invariant to the 3D structure and stereoisomerism.^{68}

However, because classical machine learning algorithms operate in a Euclidean vector space, much research is devoted to the design of fixed-size, information-rich *vector* representations of molecules that encode their salient features.^{65,70–71} Many molecular fingerprinting methods^{71} extract topological features from the molecular graph^{71} to produce a “bag of fragments” bit vector representation of the molecule.^{72} For example, Molecular ACCess System (MACCS) structural key fingerprints^{73} of a molecular graph are bit vectors indicating the presence or absence of a pre-defined list of subgraph patterns. Other hand-crafted molecular feature vectors include chemical, electronic, and structural/shape (3D) properties of the molecule as well.^{65,72}

Two advanced supervised machine learning approaches circumvent explicit hand-crafting of vector representations of molecular graphs:

graph representation learning,

^{74}such as message passing neural networks (MPNNs)^{75,76}that*learn*task-specific vector representations of molecular graphs for prediction tasks in an end-to-end manner andgraph kernels,

^{78–82}which (loosely speaking) measure the similarity between any two input graphs, allowing for the use of kernel methods,^{83}such as support vector machines,^{84}kernel regression/classification,^{83}and Gaussian processes,^{85}for prediction tasks.

That is, MPNNs and kernel methods operate directly on the molecular graph representation, bypassing engineering and explicit construction, respectively, of molecular feature vectors for machine learning tasks.

MPNNs are powerful models for molecular machine learning tasks^{75} but require large training datasets. In contrast, kernel methods with graph kernels are likely more appropriate when training data are limited, as they are easier to train, possess fewer hyper-parameters, and are less susceptible to overfitting.^{86} Empirically, graph kernels give performance on par with MPNNs on a variety of molecular prediction tasks.^{87}

### C. Our contribution: Building a bee toxicity classifier of pesticides via the random walk graph kernel

We train and evaluate a support vector machine (SVM) classifier for predicting the toxicity of pesticide molecules to bees. Enabling a machine learning approach, the BeeToxAI project^{57} compiled labeled data from bee toxicity experiments, composed of 382 (pesticide molecule, bee toxicity outcome) pairs. We compare two constructions of a molecular vector space for the SVM: (1) a random walk feature space, describing pesticides by the set of vertex- and edge-label sequences along length-*L* walks on their molecular graphs, and (2) the MACCS fingerprint (FP) space, describing pesticides by the presence/absence of a list of pre-defined subgraph patterns in their molecular graphs. We explicitly construct the MACCS FPs but instead rely on the kernel trick and fixed-length-*L* random walk graph kernel (RWGK) for dot products in the random walk feature space. The *L*-RWGK-SVM achieves an F1 score (mean over 2000 runs) of 0.69 on the test dataset, and *L* = 4 is the mode optimal walk length to describe the molecular graphs. The MACCS-FP-SVM performs on par/marginally better but exhibits more variance in its performance. Finally, we illuminate subgraphs in the pesticide molecules that tend to most strongly push molecules in the MACCS FP space toward the toxic/non-toxic side of the separating hyperplane of the MACCS-FP-SVM.

## II. PROBLEM SETUP: CLASSIFYING THE TOXICITY OF A PESTICIDE TO HONEY BEES

**The pesticide toxicity classification task.** We wish to construct a classifier $f:G\u2192{\u22121,1}$ that maps any molecular graph *G* (see Sec. III A) representing a pesticide molecule to a predicted binary label $y\u0302=f(G)$ , where $y\u0302=1$ is toxic to honey bees (*Apis mellifera*) and $y\u0302=\u22121$ is nontoxic. The classifier *f* is valuable as a cheap-to-evaluate “surrogate model” of an expensive bee toxicity experiment [see Fig. 1(a)].

**The labeled bee toxicity dataset.**^{57} From the BeeToxAI project,^{57} we took labeled data ${(Gn,yn)}n=1N$ composed of *N* = 382 examples of (i) a molecular graph $Gn\u2208G$ representing a pesticide or pesticide-like molecule and (ii) its experimentally-determined acute contact bee toxicity label *y*_{n} ∈ { − 1, 1} (1: toxic, −1: nontoxic).

Figure 1(b) shows the class (im)balance; 113 of the molecules are labeled toxic, and 269 are labeled nontoxic.

The outcome of a bee exposure experiment was mapped to a toxicity label on the pesticide following US EPA guidelines:^{88} the pesticide was labeled as toxic if the median lethal dose (LD_{50}) after 48 h to an adult honey bee was greater than 11 *μ*g/bee—and nontoxic otherwise.

The dataset includes neonicotinoid, pyrethroid, organophosphate, carbamate, pyridine azomethine, phenylpyrazole, and organochlorine insecticides,^{45–47,49–54,55,56} herbicides,^{56} miticides,^{50–52} and fungicides.^{47,48,52,53,56} Any molecules bearing tetrahedral chiral centers are not labeled with stereochemical configuration.

**The machine learning approach: data-driven prediction of bee toxicity.** Our objective is to leverage the labeled bee toxicity dataset to train an SVM as the toxicity classifier *f*(*G*). An SVM is a versatile supervised machine learning model that aims to find the maximum-margin separator (a hyperplane) between the positive and negative training examples in a mapped feature (vector) space. The mapped feature space does not need to be explicitly constructed. Instead, kernel functions can be used to (implicitly) perform the needed operation (dot product) in the mapped feature space. We compare two constructions of a molecular vector space by representing pesticide molecules with (1) the fixed-length random walk feature vector and (2) the MACCS fingerprint. We explicitly construct the latter, while for the former we rely on the fixed-length random walk graph kernel for the dot product.

## III. METHODS

### A. The vertex- and edge-labeled graph representation of a molecule

A fundamental representation of a molecule is as a vertex- and edge-labeled, undirected graph $G=(V,E,\u2113v,\u2113e)$:

$V={v1,\u2026,vN}$ is the set of vertices representing its

*N*atoms, excluding hydrogen atoms. We exclude H atoms in the molecular graph to avoid redundancy. For example, the hydrogen-excluding molecular graphs of ethane, ethylene, and acetylene can be distinguished by the order (an edge label) of the C–C bond (edge).$E$ is the set of edges representing chemical bonds; ${vi,vj}\u2208E$ iff the atoms represented by vertices $vi\u2208V$ and $vj\u2208V$ are bonded.

$\u2113v:V\u2192{C,N,O,S,P,F,Cl,I,Br,Si,As}$ is the vertex-labeling function that provides the chemical element of each vertex (atom).

$\u2113e:E\u2192{1,2,3,a}$ (“a" for aromatic) is the edge-labeling function that provides the bond order of each edge (bond).

For example, see Fig. 2. This *molecular graph* representation of a molecule describes its topology and is invariant to translations and rotations of the molecule and to bond stretching, bending, and rotation.

Let $G$ be the set of possible molecular graphs, so $G\u2208G$.

### B. Two molecular vector spaces

We explore two *feature maps* $\varphi :G\u2192RF$ that map a molecular graph $G\u2208G$ to a feature vector $\varphi (G)\u2208RF$, with $RF$ being the molecular vector space in which the SVM operates: (1) the MACCS structural key fingerprint and (2) the fixed-length random walk feature vector.

#### 1. MACCS structural key fingerprint

The Molecular ACCess System (MACCS) structural key fingerprint (FP) of a molecular graph is a bit vector whose entries indicate the presence (1) or absence (0) of a list of *F* = 166 pre-defined subgraph patterns (molecular substructures/fragments).^{71,73} The number of “on” (1) bits in the MACCS FP is equal to the number of these subgraphs with presence in the molecule. The list of molecular patterns defining the MACCS feature map $\varphi MACCS(G):G\u2192R166$ was curated by a company, Molecular Design Limited, Inc. (MDL), for drug discovery tasks.^{73} Thus, we hypothesize that the MACCS fingerprint encodes biologically-relevant information about pesticide molecules for predicting their toxicity to bees. Indeed, MACCS fingerprints have been found to be predictive of toxicity in other studies.^{63,90–92} We use the MACCS fingerprint implementation in RDKit,^{93} whose source code lists the subgraph patterns (described by SMARTS strings) corresponding to each keybit.^{94} For example, keybits 29, 134, 165, and 154 indicate the presence of phosphorus, a halogen, a ring, and a carbonyl group, respectively. Figure 3(a) illustrates further.

#### 2. The fixed-length random walk feature vector

##### a. Walks on a molecular graph and label sequences along them.

The random walk feature map describes a molecular graph by the set of label sequences along walks on it.

**A walk.** A *walk* *w* of length *L* on a molecular graph *G* is a sequence of vertices such that consecutive vertices are joined by an edge,

The length *L* refers to the number of edges (not necessarily unique) traversed along the walk (for example, see Fig. 2).

Let $WL(G)$ be the set of all possible walks of length *L* on a graph *G*.

**The label sequence of a walk.** The *label sequence* *s* = *ℓ*_{w}(*w*) of a walk *w* = (*v*_{1}, …, *v*_{L+1}) gives the progression of vertex and edge labels along the walk,

(for example, see Fig. 2).

Let $SL={s1,\u2026,sSL}$ be the set of all possible label sequences among length-*L* walks on all molecular graphs $G\u2208G$—so $|SL|=SL$.

##### b. The fixed-length random walk feature map.

The fixed-length random walk feature vector of a molecular graph lists the number of fixed-length walks on the graph with each possible vertex- and edge-label sequence. Thereby, the molecular graph is described by the distribution of label sequences along fixed-length (equipoise^{95}) random walks on it.^{97–98}

Precisely, the fixed-length-*L* feature map $\varphi (L):G\u2192RSL$ constructs a vector representation of a graph $G\u2208G$ whose element *i* is a count of length-*L* walks on *G* with label sequence *s*_{i},

As a length *L* = 0 walk constitutes an atom, *ϕ*^{(0)}(*G*) lists counts of atom types in the molecule. As a length *L* = 1 walk constitutes two (ordered) atoms joined by a bond, *ϕ*^{(1)}(*G*) lists counts of each particular (ordered) pairing of atoms joined by a particular bond type in the molecule.

Figure 3(b) illustrates by listing (in an arbitrary order) the nonzero elements of *ϕ*^{(L=2)}(*G*) for an example molecular graph *G*. In both Eq. (4) and Fig. 3(b), we dodge the task of explicitly imposing an ordering of the set $SL$, since we never explicitly construct *ϕ*^{(L)}(*G*).

#### 3. Comparing and contrasting the MACCS FP and random walk feature vector

Both the MACCS FP and random walk feature vector characterize a molecular graph by looking for a list of “patterns” in it—subgraph patterns for the MACCS FP and label sequences along walks for the random walk feature vector. Distinctions are as follows: (1) the MACCS FP looks for *variable-size* subgraphs, whereas the random walk feature vector looks at *fixed-size* (length) label sequences along walks; (2) the random walk feature vector *counts* patterns, whereas the MACCS FP only *indicates* the presence of patterns; (3) the random walk feature vector exhaustively counts *all possible* walk patterns, while the MACCS FP non-exhaustively looks for a pre-defined, curated *subset* of the possible subgraph patterns; (4) owing to the variably-sized list of subgraph patterns, including wildcard atoms/bonds, in the MACCS FP, multiple subgraphs can activate the same bit, and a single subgraph can activate multiple bits, in contrast with the random walk feature vector, whose entries have a one-to-one correspondence with the set of possible patterns; (5) the MACCS FPs $\varphi MACCS(G)\u2208R166$ are feasible to explicitly construct and store in memory, while the fixed-length random walk feature vectors $\varphi (L)(G)\u2208RSL$ are not for large *L*, owing to the large number of possible label sequences *S*_{L} present in length-*L* walks on molecular graphs;^{97} given V possible vertex labels and *E* possible edge labels, theoretically $|SL|=VL+1EL$ label sequences are possible in length-*L* walks, although many of these will not be observed in any plausible molecule.

### C. The fixed-length random walk graph kernel

The fixed-length random walk kernel^{97} *k*^{(L)}(*G*, *G*′) = *ϕ*^{(L)}(*G*)· *ϕ*^{(L)}(*G*′) allows us to circumvent explicit construction of *ϕ*^{(L)}(*G*) when employing a kernel method of machine learning, which can be cast to rely only on dot products *ϕ*^{(L)}(*G*) · *ϕ*^{(L)}(*G*′) of pairs of vector representations of molecular graphs *G*, *G*′.

#### 1. Definition and explanation of the *L*-RWGK

The fixed length-*L* random walk graph kernel^{96,99} (*L*-RWGK) $k(L):G\xd7G\u2192R$ is a (symmetric, positive semidefinite) function such that evaluating *k*(*G*, *G*′) is implicitly equivalent to (i) mapping the two input graphs *G* and *G*′ into the random walk vector space $RSL$ via the feature map *ϕ*^{(L)} and then (ii) taking the inner product of these two vectors,

As seen from Eq. (4), term *i* of *k*^{(L)}(*G*, *G*′) in Eq. (5) is the number of pairs of length-*L* walks—one in graph *G* and the other in graph *G*′—with the label sequence $si\u2208SL$. Hence, *k*^{(L)}(*G*, *G*′) sums counts of pairs of length-*L* walks on the two graphs *G*, *G*′ sharing a label sequence,

As the term associated with a label sequence s is nonzero iff *both* graphs *G* and *G*′ possess a length-*L* walk with label sequence s, this sum may be restricted to be over the subset of label sequences in common between length-*L* walks on the two graphs, $\u2113w(WL(G))\u2229\u2113w\u2032(WL(G\u2032))$.

Intuitively, the 0-RWGK *k*^{(0)}(*G*, *G*′) sums counts of pairs of atoms of a particular type between the two graphs *G*, *G*′. The 1-RWGK *k*^{(1)}(*G*, *G*′) sums counts of pairs of two particular (ordered) atoms joined by a particular bond.

To evaluate the *L*-RWGK *k*^{(L)}(*G*, *G*′) without explicitly constructing the random walk feature vectors *ϕ*^{(L)}(*G*), *ϕ*^{(L)}(*G*′), we leverage the direct product graph to count pairs of label sequences in common between walks on two graphs *G*, *G*′.

#### 2. The direct product graph to compute RWGKs

Given two input graphs $G,G\u2032\u2208G$, we construct a new graph, the direct product graph $G\xd7=G\xd7G\u2032=(V\xd7,E\xd7,\u2113v,\xd7,\u2113e,\xd7)$, to evaluate the *L*-RWGK *k*^{(L)}(*G*, *G*′) between *G* and *G*′. The direct product graph *G*_{×} is constructed to give a one-to-one mapping between (i) walks in *G*_{×} and (ii) pairs of walks—one on *G* and one on *G*′—with the same label sequence.

**Definition of the direct product graph.** Each vertex of the direct product graph *G*_{×} = *G* × *G*′ is an ordered pair of vertices—the first in *G* and the second in *G*′. The vertices of the direct product graph are constituted by the subset of pairs of vertices between *G* and *G*′ with the same vertex label,

An undirected edge joins two vertices of the direct product graph *G*_{×} = *G* × *G*′ iff (i) the two involved vertices of *G* are joined by an edge in $E$ and (ii) the two involved vertices of *G*′ are joined by an edge in $E\u2032$ and (iii) these two edges in $E$ and $E\u2032$ have the same label,

We equip the direct product graph *G*_{×} = *G* × *G*′ with vertex- and edge-labeling functions that give the (same) label of the involved vertices and edges in *G* and *G*′,

Figure 4 shows the direct product graph of two molecular graphs as an example.

**Utility of the direct product graph for evaluating the** *L***-RWGK.** By construction, any given length-*L* walk *w*_{×} on the direct product graph *G*_{×} = *G* × *G*′ with label sequence *ℓ*_{w,×}(*w*_{×}) corresponds to a unique pair of walks {*w*, *w*′}, with $w\u2208WL(G),w\u2032\u2208WL(G\u2032)$, possessing the label sequence $\u2113w(w)=\u2113w\u2032(w\u2032)=\u2113w,\xd7(w\xd7)$, and vice versa (giving a bijection). This is illustrated in Fig. 4. Therefore, all three of the following quantities are equivalent:

the number of length-

*L*walks on the direct product graph*G*_{×}=*G*×*G*′,the number of pairs of length-

*L*walks on*G*and*G*′ with the same label sequence, andthrough Eq. (6), the value of the

*L*-RWGK*k*^{(L)}(*G*,*G*′).

The key to counting length-*L* walks on *G*_{x} = *G* × *G*′—and thus to evaluating *k*^{(L)}(*G*, *G*′)—lies in its $|Vx|\xd7|Vx|$ adjacency matrix *A*_{x} whose entry (*i*, *j*) is one if vertices $vx,i,vx,j\u2208Vx$ are joined by an edge and zero otherwise. The number of walks of length *L* from vertex *v*_{x,i} to vertex *v*_{x,j} is given by element (*i*, *j*) of $AxL$. Summing over all possible starting and end vertices of walks,

**Summary of evaluating the** *L***-RWGK**. Computing the *L*-RWGK *k*^{(L)}(*G*, *G*′), therefore, involves (i) constructing the direct product graph *G*_{×} = *G* × *G*′, (ii) building the adjacency matrix *A*_{×} of *G*_{×}, (iii) computing the *L*th power of *A*_{x}, $A\xd7L$, and then (iv) summing its entries.

### D. The linear kernel between two MACCS structural key fingerprints

For comparison to the *L*-RWGK, note the (linear) kernel applied to the MACCS FPs of a pair of molecular graphs,

gives the number of subgraph patterns in the MACCS library that are exhibited by *both* graphs *G* and *G*′.

### E. Support vector machines (SVMs) as classifiers

A support vector machine (SVM)^{83,84,100,101} is a supervised machine learning model for binary classification. To train an SVM using a labeled training dataset ${(Gn,yn)}n=1N$, with $Gn\u2208G$ and *y*_{n} ∈ { − 1, 1}, we rely on a feature map $\varphi :G\u2192RF$ to represent graphs in a vector space. Such feature maps can be constructed explicitly (the case with the MACCS FP) or implicitly via the use of a kernel function *k*(*G*, *G*′) = *ϕ*(*G*) · *ϕ*(*G*′) between pairs of data (the case with the random walk feature vector). We briefly explain the SVM here. For more details, see Refs. 83 and 100.

**The decision boundary.** Ultimately, an SVM classifier *f*(*G*) employs a hyperplane *w* · *ϕ*(*G*) + *b* = 0 in the feature space $RF$ as the decision boundary,

with $w\u2208RF$ normal to the hyperplane, pointing in the direction of (most) of the positive examples, and $b\u2208R$ specifying the offset of the hyperplane from the origin. Training an SVM constitutes using the training data to find the “optimal” hyperplane described by parameters *w*, *b*.

**The primal optimization problem.** The (soft margin) SVM seeks a hyperplane that separates most of the training data with a large margin defined by the thickness of the region |*w* · *ϕ*(*G*) + *b*| ≤ 1. The primal optimization problem associated with training an SVM is

The slack variable *ξ*_{n} associated with data vector *ϕ*(*G*_{n}) allows, if it is nonzero, violation of the constraint *y*_{n}(*w* · *ϕ*(*G*_{n}) + *b*) ≥ 1 that it lies (i) on the correct side of the decision boundary and (ii) outside of or on the boundary of the margin. The first term in the objective function describes the size of the margin; the second term penalizes constraint violations. The hyperparameter *C* ≥ 0 trades a large margin for constraint violations.

**The dual optimization problem.** The Lagrangian dual of the primal optimization problem is in *N* Lagrange multipliers {*α*_{1}, …, *α*_{N}},

where the solution to the dual problem *α* and the solution to the primal problem *w* satisfy

**The kernel trick.** The objective of the dual problem in Eq. (17) depends only on the dot products *ϕ*(*G*_{i}) · *ϕ*(*G*_{j}) of the training data. The kernel trick is to replace *ϕ*(*G*_{i}) · *ϕ*(*G*_{j}) with a kernel function *k*(*G*_{i}, *G*_{j}) = *ϕ*(*G*_{i}) · *ϕ*(*G*_{j}) to bypass the explicit mapping of the graphs *G*_{i} and *G*_{j} into the vector space $RF$ to compute the dot product *ϕ*(*G*_{i}) · *ϕ*(*G*_{j}). Indeed, we use the *L*-RWGK *k*^{(L)}(*G*_{i}, *G*_{j}) in Eq. (5) in place of constructing *ϕ*^{(L)}(*G*) and *ϕ*^{(L)}(*G*′) and taking their dot product.

Using Eq. (20), we can also rewrite the decision rule in Eq. (13) for a new graph *G* in terms of the kernel between it and the graphs in the training dataset,

with *α*_{n} being the solution to the dual problem. Equation (21) allows us to also bypass mapping new molecular graphs *G* into the feature space $RF$ via *ϕ* when classifying them with the trained SVM.

**The support vectors.** An SVM is a *sparse* kernel machine;^{100} the decision rule in Eq. (21) will depend on a *subset* of the training data, the *support vectors* *ϕ*^{(L)}(*G*_{n}) with *α*_{n} > 0 that lie inside or on the boundary of the margin or outside the margin but on the wrong side of the decision boundary.

**The Gram matrix**. When, in practice, invoking the kernel trick, we store the inner products between all pairs of molecular graphs in a *N* × *N* Gram matrix *K*, whose element (*i*, *j*) gives the kernel *k*(*G*_{i}, *G*_{j}) between molecular graphs *G*_{i} and *G*_{j}.

**Centering.** SVMs tend to perform better if the feature vectors {*ϕ*(*G*_{1}), …, *ϕ*(*G*_{N})} are first centered.^{102} Again to avoid explicit construction of them, the double-centering trick^{83} allows us to obtain the inner products of the centered feature vectors from the inner products of the uncentered feature vectors in the Gram matrix *K*. Particularly, the centered Gram matrix $K\u0303\u2254CKC$ with centering matrix $C=I\u22121Noo\u22ba$ (*I* the identity matrix, *o* a vector of ones).^{103}

### F. Classification performance metrics

The performance metrics of a classifier $y\u0302=f(G)$ include the following (measured over a labeled test dataset):

Accuracy: fraction of examples classified correctly.

Precision: among the examples classified as toxic $(y\u0302n=1)$, the fraction that are truly toxic (

*y*_{n}= 1).Recall: among the examples that are truly toxic (

*y*_{n}= 1), the fraction that are correctly predicted as toxic $(y\u0302n=1)$.

## IV. RESULTS

We now train and evaluate the performance of a support vector machine (SVM) to classify the toxicity of pesticide molecules to honey bees using two different molecular representations:

the fixed-length-

*L*random walk feature vector andthe MACCS structural key fingerprint (FP).

We explicitly construct the MACCS FPs but invoke the kernel trick and rely on the fixed-length-*L* random walk graph kernel (*L*-RWGK) in place of a dot product in the random walk feature space.

### A. Machine learning procedures

**Data preparation**. The data are prepared from the SMILES strings representing the pesticide molecules in the BeeToxAI dataset.^{57}

*MACCS fingerprints.* We used RDKit^{93} to explicitly construct the MACCS FPs of the pesticide molecules, {*ϕ*^{MACCS}(*G*_{1}), …, *ϕ*^{MACCS}(*G*_{N})}. Then, we computed the dot product *ϕ*^{MACCS}(*G*_{i}) · *ϕ*^{MACCS}(*G*_{j}) between each pair of MACCS FPs and stored them in the Gram matrix *K*^{MACCS}.

*Fixed-length random walk graph kernel.* We used MolecularGraph.jl to obtain the molecular graphs {*G*_{1}, …, *G*_{N}} representing the pesticide molecules. For each pair of graphs (*G*_{i}, *G*_{j}), we constructed their direct product graph *G*_{i} × *G*_{j}, evaluated the *L*-RWGK *k*^{(L)}(*G*_{i}, *G*_{j}) via Eq. (11) (using our own code), and then stored it in a Gram matrix *K*^{(L)} for *L* ∈ {0, …, 12}.

**A train-test run.** For both the MACCS FP and fixed-length random walk representations, a “train-test run” of an SVM comprises the following procedure. First, we randomly shuffle and then split the examples into a 80%/20% train/test split. We stratify the split to preserve the distribution of class labels in the two splits. Second, using only the training split, we use stratified *K* = 3-fold cross-validation to determine the optimal hyperparameter(s). For the MACCS fingerprint, the hyperparameter is the *C* parameter of the SVM. For the fixed-length random walk representation, the hyperparameters are both *C* and *L*, the length of the random walks. Through grid search, we choose the optimal hyperparameter(s) as the one(s) providing the *K* SVMs (each trained on *K* − 1 folds) with the maximal mean F1 score on the validation sets (one fold each). The hyperparameter grid comprises (i) log_{10}*C* ∈ {−6, …, 1} and (ii) *L* ∈ {0, …, 12}. Finally, we train a deployment SVM with the optimal hyperparameter(s) on all training data and evaluate its performance (precision, recall, accuracy, and F1 score) on the hold-out test set.

Note that, for each SVM trained, we center the Gram matrix *K* pertaining only to the training graphs via the double-centering trick.^{83} We adopt a similar centering trick^{103} for the Gram matrix giving the similarity of the test graphs with the training graphs when we feed it as input to the SVM for predictions on the test split.

We used the SVM implementation and Gram matrix centerer in scikit-learn.^{105} We scaled the *C* parameter in Eq. (14) seen by the slack variables pertaining to each class to balance penalization of constraint violations for each class.

**Overall procedure.** For both the MACCS fingerprint and fixed-length random walk representations, we conducted 2000 (stochastic, owing to the random train/test and *K*-folds splits) train-test runs; for each run, we evaluated the performance of a hyperparameter-optimized, trained SVM classifier on the hold-out test set. Conducting multiple train-test runs allows us to report both expected performance and variance in the performance.

### B. Cross-validation results

Figure 5(a) shows the empirical distribution of optimal hyperparameters during the *K* = 3-folds cross-validation routine. The mode of the distribution of the optimal *C* parameter for the MACCS-FP-SVM is 0.1. The mode of the joint distribution for the optimal walk length *L* and SVM *C* parameter for the *L*-RWGK-SVM is *L* = 4 and *C* = 0.001. In conclusion, the pesticide molecules were best described by random walks of length *L* = 4 for bee toxicity prediction. The optimal *C* parameter tended to decrease with the walk length, consistent with the view of the inverse of *C* as a regularization parameter expected to increase when the representation of the inputs is more complex.

### C. Classification performance on the test set

Figure 5(b) shows the mean and standard deviation of the accuracy, precision, recall, and F1 score of the *L*-RWGK-SVM and MACCS-FP-SVM on the hold-out test set of pesticide molecules. The performance of the *L*-RWGK-SVM is on par with/slightly lower than that of the MACCS-FP-SVM but has the advantage of a lower variance (see error bars). The *L*-RWGK-SVM achieves, on average, an F1 score, precision, recall, and accuracy of 0.69, 0.68, 0.71, and 0.81, respectively.

### D. Interpreting the MACCS-FP-SVM

Explaining the predictions of and interpreting a molecular machine learning model can give chemical insights and foster trust—or distrust, by uncovering “Clever Hans” predictions—in the model.^{106,107,125}

In contrast to the *L*-RWGK-SVM that leverages the kernel trick, the MACCS-FP-SVM lends interpretability because we may explicitly construct *w* via Eq. (20).

We interpret a MACCS-FP-SVM toxicity classifier by inspecting the vector $w\u2208R166$ normal to its separating hyperplane. Weight $wi\u2208R$ in *w* is associated with MACCS keybit *i*, which looks for a particular subgraph pattern in the molecular graph. For ease of interpretability, here we do not standardize the input MACCS FPs and instead retain them as bit vectors; consequently, a positive (negative) coefficient *w*_{i} implies the presence of the subgraph pattern described by MACCS keybit *i* tends to produce a prediction of toxicity (non-toxicity).

Figure 6 visualizes the *w* vector of our interpretable MACCS-FP-SVM, trained on all of the data and with the optimal *C* hyperparameter found in Fig. 5(a) (*C* = 0.1). We inspect the molecular patterns corresponding with the four most positive (top) and four most negative (bottom) coefficients (bars decorated with *)—associated with predictions of toxicity and non-toxicity, respectively. Their MACCS keybits and SMARTS strings specifying the molecular pattern they look for are shown. We also show an example molecule in the dataset exhibiting that pattern (see highlight); the molecules on the top (bottom) were correctly predicted to be toxic (non-toxic).

We caution against mistaking *association* for *causality* in our interpretation of the SVM as (i) the two random variables indicating the presence of two subgraphs (described by two MACCS keybits) in a molecule are generally not independent and (ii) anthropogenic biases^{109–110} could be involved in the generation and curation of the training dataset.

### E. Run times.

The majority of the computational run time for generating our results was in computing the 382 × 382 Gram matrices *K*^{(L)} involving the *L*-RWGK. Using four cores, the run time ranged from less than five minutes (*L* = 0, 1) to ∼20–25 minutes for *L* ≥ 7 (see the supplementary material).

## V. DISCUSSION

We trained and evaluated a support vector machine classifier that predicts the toxicity of pesticides to honey bees. We compared two molecular vector representations: (1) MACCS fingerprints listing the presence/absence of a set of pre-defined subgraph patterns in the molecular graph and (2) a random walk feature vector listing counts of label sequences along all fixed-length walks on the molecular graph. While we explicitly construct the fingerprints, we relied on the fixed-length random walk graph kernel for dot products in the random walk vector space. Remarkably, the L-RWGK gives similar classifier performance as the MACCS FP, despite that it has fewer prior assumptions built into it—assumptions about which patterns are predictive of molecular properties.

Graph kernels have been previously used with SVMs, Gaussian processes, and kernel regression for molecular machine learning tasks,^{78,111} such as to classify proteins,^{112} score protein–protein interactions,^{113} predict methane uptake in nanoporous materials,^{114} predict atomization energy of molecules,^{115,116} and predict thermodynamic properties of pure substances.^{117}

A Gaussian process model^{85} using the *L*-RWGK would enable uncertainty quantification in the prediction.

As the BeeToxAI dataset^{57} was collected from many sources, including the scientific literature, anthropogenic biases could be present, e.g., in the choices of which molecules to be tested for bee toxicity. As a result, the source and target data distributions could differ, and the performance measure on the hold-out set (sampled from the training distribution) may not reflect the generalization error when the model is deployed. Reference 110 articulates various types of this “dataset shift.” Anthropogenic biases in datasets in chemistry have been uncovered and shown to cause machine learning algorithms trained on them to exhibit poorer generalization performance.^{108,109}

Disadvantages of the *L*-RWGK include (i) its compute- and memory-intensity to evaluate, hence poor scalability to large molecules and large datasets,^{78} and (ii) tottering. Expanding on (ii), by definition, the vertices in a walk [see Eq. (1)] may not be distinct (then, it would be a *path*). Thus, long walks that totter back and forth between the same few vertices—e.g., at the extreme: *w* = (*u*, *v*, *u*, *v*, …, *u*, *v*)—are accounted for in the *L*-RWGK. These walks do not contribute extra information about the similarity of two graphs—e.g., for our extreme example, no more information beyond the length-2 walk (*u*, *v*). Tottering could thus lead to a “dilution” of the similarity metric expressed by the random walk kernel.^{99} Modification of the random walk kernel can prevent tottering walks^{118} from contributing to the similarity metric.

The *L*-RWGK can be generalized further by defining a kernel between two walks *w* and *w*′ as a product of the kernel between the edges and vertices along the walk.^{78,97,112} A (non-Dirac) kernel between vertices could account for similarity of chemical elements.

In addition to the fixed-length-*L* random walk kernel, the (i) max-length-*L* random walk kernel and (ii) geometric random walk kernel count pairs of length-*ℓ* walks with a shared label sequence (i) for *ℓ* ∈ {0, …, *L*} and (ii) *ℓ* ∈ {0, …}.^{97,99}

In addition to random walk kernels, other graph kernels can be used to express the similarity of molecular graphs:^{78} shortest-path,^{119} graphlet,^{120} tree- and cyclic-pattern,^{121,122} and optimal assignment kernels.^{123}

## SUPPLEMENTARY MATERIAL

The supplementary material includes (i) a comparison of the (a) linear kernel between MACCS FPs and (b) the *L* = 4 random walk graph kernel and (ii) the run times for computing the random walk graph kernel.

## ACKNOWLEDGMENTS

We thank Jana Doppa and Aryan Deshwal for stimulating Cory’s interest in graph kernels. We acknowledge support from the National Science Foundation (Award No. 1920945).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

## DATA AVAILABILITY

The data and code to fully reproduce this study are available in GitHub (Ref. 124).