Large classes of materials systems in physics and engineering are governed by magnetic and electrostatic interactions. Continuum or mesoscale descriptions of such systems can be cast in terms of integral equations, whose direct computational evaluation requires *O*(*N*^{2}) operations, where *N* is the number of unknowns. Such a scaling, which arises from the many-body nature of the relevant Green’s function, has precluded wide-spread adoption of integral methods for solution of large-scale scientific and engineering problems. In this work, a parallel computational approach is presented that relies on using scalable open source libraries and utilizes a kernel-independent Fast Multipole Method (FMM) to evaluate the integrals in *O*(*N*) operations, with *O*(*N*) memory cost, thereby substantially improving the scalability and efficiency of computational integral methods. We demonstrate the accuracy, efficiency, and scalability of our approach in the context of two examples. In the first, we solve a boundary value problem for a ferroelectric/ferromagnetic volume in free space. In the second, we solve an electrostatic problem involving polarizable dielectric bodies in an unbounded dielectric medium. The results from these test cases show that our proposed parallel approach, which is built on a kernel-independent FMM, can enable highly efficient and accurate simulations and allow for considerable flexibility in a broad range of applications.

## I. INTRODUCTION

Massively parallel computer hardware and the corresponding software have enabled the solution of increasingly complex problems in science and engineering. Such problems are often cast in terms of partial differential equations (PDEs). In many cases, it is convenient to formulate the PDEs in terms of integral equations, which can then be solved by relying on a variety of numerical methods. For the particular case of boundary integral equations, the Boundary Element Method (BEM)^{1} has found numerous applications in physical problems ranging from solids and fluids^{2} to multiphase materials,^{3} heat transfer,^{4} electrostatics,^{5} and magnetostatics. For example, BEMs were used in fluid problems to study the motion and deformation of bubbles^{6} and to determine the phase diagram of complex fluids;^{7} BEMs have also been used to address three-dimensional (3D) linear elasticity problems involving particles embedded in a binder^{8} and in crack analysis in 3D time harmonic elastodynamics.^{9} BEMs are particularly well-suited for problems that involve bodies embedded in an infinite medium, with the boundary condition that the relevant potential decays to zero at infinity. If the embedding space is homogeneous and the governing equations for the underlying physics are linear, BEMs are frequently used in order to avoid meshing the embedding space exterior to the bodies. In this work, we consider two versions of this type of problem as explicit test cases: interacting ferromagnetic bodies embedded in space and dielectric bodies embedded in an infinite dielectric medium. However, our proposed method is widely applicable to problems in which the mathematical formulation can be cast in terms of a boundary integral equation.

For the problem of interacting ferromagnetic bodies embedded in space, the interactions between the bodies are calculated using an open boundary condition on their boundaries. Specifically, the non-zero, finite magnetization of the bodies leads to a discontinuity in the magnetostatic potential at their boundaries. One approach to this problem is to use a hybrid finite element (FEM)–BEM to calculate the non-local magnetostatic interactions between the magnetic bodies.^{10} The method has the advantage in that it allows for arbitrary-shaped bodies and requires no mesh between them. A disadvantage is that the integral of the Green’s function over the boundary of all magnetic bodies leads to a scaling of *O*(*N*^{2}) in direct implementations, where *N* is the number of degrees of freedom (DoFs) on the boundaries of all magnetic bodies. Several techniques have been proposed to increase the efficiency of the boundary integral calculation. Examples include the tree-code algorithm and the hierarchical matrices method,^{11,12} which exhibit computational complexities of *O*(*N*log*N*) and *O*(*N*), respectively.

In electrostatics, a similar class of problems is concerned with the electrostatic polarization of finite-size dielectric bodies embedded in an infinite dielectric medium. Free charges carried by these bodies, or free point charges in the medium, are sources of electrostatic fields. Because the relative permittivity of the medium is usually different from those of the bodies, the electrostatic field is discontinuous across the interface between a body and the medium. As a consequence, bound surface charges are induced on the interface to compensate for the discontinuity in the electrostatic field. In order to calculate the induced bound charges on the interface, a BEM was proposed using a variational approach^{5} to solve the underlying Poisson equation. In numerical simulations of the polarizable bodies, BEMs have been used for stationary or mobile dielectrics.^{13–17} A central aspect of this class of problems is the calculation of the distribution of bound surface charge density through solutions of a linear system of equations *Ax* = *b*, where *A* is a dense matrix whose entries depend on the geometry of the dielectric bodies, *x* represents the induced bound surface charge density, and *b* represents contributions from free charges on the dielectric bodies or free point charges in the exterior medium. The iterative Generalized Minimal Residual (GMRES)^{18} method provides an efficient numerical method to solve the resulting dense linear system of equations.^{19} However, the matrix-vector multiplication in each GMRES iteration scales as *O*(*N*^{2}) in any direct implementation. In Refs. 17 and 19, the matrix-vector multiplication is accelerated using a fast Ewald solver, i.e., the particle-particle particle-mesh method (PPPM or P^{3}M),^{20} which offers a scaling of *O*(*N*log*N*).

In this work, we have developed an efficient and scalable parallel computational approach for BEMs that is built on the scalable parallel libraries libMesh (http://libmesh.github.io)^{21} and ScalFMM (http://scalfmm-public.gforge.inria.fr).^{22} The boundary integrals are accelerated using a kernel-independent fast multipole method (FMM)^{23} with *O*(*N*) scaling. This kernel-independent FMM is an interpolation-based FMM that utilizes Chebyshev interpolation and low-rank approximation and can also be used for kernels that are known only numerically. It was first developed for non-oscillatory kernels and later extended for oscillatory kernels.^{24} Our approach to boundary integrals also adopts a matrix-free method that requires no explicit storage of a global matrix, thus enabling simulations with a memory cost of only *O*(*N*). The corresponding software is made freely available to potential users. In the remainder of this paper, we first introduce the general boundary integral problem. We then discuss methods for the magnetostatic and electrostatic problems and our implementations of BEM; we apply our computational approach to several model problems and test its accuracy, efficiency, and scalability. We conclude with a summary and suggestions for future directions for improvements of the proposed approaches and elaborate on several research areas and physical phenomena to which our approach could be applied.

## II. METHODS

### A. General integral problems

Many integral problems share similar configurations to those shown schematically in Fig. 1. Space is divided into an exterior region Ω and an interior region Γ consisting of the union of different bodies with arbitrary shapes, for example, dielectric or ferroelectric/ferromagnetic bodies. It is assumed here that there are no intersections or contacts between any two bodies. In most problems, the solution is obtained from functions *f*(**r**) obtained from the integral over the boundary of all bodies (∂Γ) of a kernel *K*(**r**, **r**′) and a source function *g*(**r**). The kernel is derived from a Green’s function that represents the underlying physics of a delta-function source; thus,

In numerical integration of the above integral using a Gaussian quadrature rule, *f*(**r**_{i}) can then in general be expressed as

Equation (2) is exactly the type of summation for which the FMM is used, with *O*(*N*) scaling rather than the *O*(*N*^{2}) scaling that results from using the direct method over all points **r**_{i} and **r**_{j}. The kernel-independent FMM enables fast computation of the integral with arbitrary kernels for a variety of problems; two specific test cases are considered in Secs. II B and II C.

### B. Magnetostatic problem

The magnetostatic problem is ubiquitous for any system that involves patterned micron- or sub-micron size magnetic bodies, such as read or write heads in magnetic hard disk drives, magnetic nanoparticles,^{25} coupled magnetic disks,^{26,27} or artificial spin ices.^{28} In patterned ferromagnetic systems, there typically arises a competition between the short-range exchange interactions and long-range magnetostatic interactions. These latter originate in the volume pole density ∇ ⋅ **M**(**r**) and the surface pole density $ n \u02c6 \u22c5M ( r ) $. The magnetostatic fields from the surface pole density then involve solving a Poisson equation with sources on the boundary of a finite body and with the boundary condition that the magnetic scalar potential goes to zero far away from the body. In the framework of a hybrid FEM–BEM, the magnetic scalar potential *ϕ* is split into *ϕ* = *ϕ*_{1} + *ϕ*_{2}, where *ϕ*_{1} solves the Poisson equation in the magnetic region (see Fig. 1) Γ,

with the boundary condition

Here, **M**(**r**) is the magnetization density of the magnetized bodies *p _{i}*, and

**n**is the outward-pointing (from the interior of the magnetic region) unit normal vector on the boundary of the magnetic region (∂Γ). After

*ϕ*

_{1}is obtained by solving the Poisson equation Eq. (3),

*ϕ*

_{2}is calculated first on ∂Γ by the standard double layer integral

where **r** is the coordinate vector on ∂Γ, and Ω(**r**) is the solid angle subtended by ∂Γ at **r**. For the surface of spheres and faces of cuboids, Ω(**r**) = 2*π*, while for the edges of cuboids, Ω(**r**) = *π*, and for the corners of cuboids, Ω(**r**) = *π*/2. With *ϕ*_{2} known on ∂Γ, its known values on ∂Γ are then applied as a Dirichlet boundary condition for the Laplace equation below for *ϕ*_{2} in Γ,

where the test function *ψ* ∈ *H*^{1}(Γ), and *H*^{1}(Γ) is the usual Sobolev space. The weak form of the Laplace equation, Eq. (6), is simply

### C. Electrostatic problem

A large class of systems in materials science involve polarizable metal oxide colloidal particles in a dielectric continuum. The presence of a particle exerts an electrostatic field on nearby particles, leading to their polarization, which then propagates throughout the entirety of the system in a cooperative manner. Under some circumstances, polarization interactions can in fact lead to attractive forces between charged particles having the same charge, a feature that has been revealed in striking detail for metal oxide micro-particles.^{29} Here we consider polarizable dielectric bodies *p _{i}* embedded in an infinite dielectric medium Ω. Dielectric polarization arises on the surface of the bodies when the relative permittivities of the bodies are different from that of the medium. Bound surface charges are then induced because of the discontinuity of the electrostatic field across the interfaces between bodies and medium. In the following, we first review key equations that facilitate implementation of the BEM. For details on the derivation of the method, we refer to Refs. 5 and 17. The relative permittivity of the medium is

*ϵ*, the relative permittivity of

_{m}*i*th body is

*ϵ*, and the point charge carried by the

_{i}*i*th particle is

*Q*. In this method,

_{i}*Q*is represented by an equivalent free surface charge density

_{i}*σ*on the surface of each body (see Section IV.I in Ref. 17). In the particular case of spherical bodies, $ \sigma f = Q i / ( 4 \pi r i 2 ) $ on the

_{f}*i*th body with radius

*r*. The induced bound surface charge density

_{i}*σ*can then be calculated by solving

_{b}where

with

In the above equations, $ \u03f5 \u0304 i = ( \u03f5 i + \u03f5 m ) /2$, Δ*ϵ _{i}* =

*ϵ*−

_{m}*ϵ*,

_{i}*ϵ*

_{0}is the permittivity of vacuum,

**n**is the outward-pointing normal unit vector at the interface between a body and the medium, and

**E**

_{b}is the electrostatic field that arises from the bound surface charge density

*σ*. The integral is performed only on the surfaces of all dielectric bodies (∂Γ). The right-hand side

_{b}*b*in Eq. (9) is obtained from

with

where **E**_{f} is the electrostatic field that arises only from the free charges, *σ _{f}* is the free surface charge density on the surface of the dielectric bodies, and

*ρ*is the volume free charge density in the medium Ω. In our examples, only free charges carried by the dielectric bodies are considered, and there are no free volume charges in the medium, so the second term in Eq. (13) is absent. Once we obtain the total surface charge density

_{f}*σ*=

*σ*+

_{f}*σ*and total electrostatic field

_{b}**E**=

**E**

_{f}+

**E**

_{b}on the surface of the dielectric bodies (∂Γ), the force vector acting on the

*i*th body,

**F**

_{i}, can be calculated by a boundary integral over the surface of

*i*th body,

where **f**(**r**) is the surface force density on the surface of the body,

## III. IMPLEMENTATIONS

In the magnetostatic problem, a standard FEM is used to solve the Poisson and Laplace equations. The magnetic bodies are first discretized and meshed, in our case using the CUBIT Geometry and Mesh Generator;^{30} the mesh is then read by libMesh. Shape functions and weights for Gaussian quadrature points are provided by libMesh as well. For the Laplace equation, a Dirichlet boundary condition is enforced using a penalty method.^{31} The system of equations is then solved using GMRES, as provided by libMesh and its underlying PETSc^{32} library. For the boundary integral, the strategy is to create a separate boundary mesh on the fly during the computations. The boundary mesh is extracted from the boundary of the volume mesh of all bodies using libMesh’s functionalities, and we then transfer nodal values of the boundary source term from the boundary of the volume mesh to the corresponding DoFs on the boundary mesh before executing the boundary integration.

In the electrostatic problem, because dielectric polarization occurs at the interface, only the surface of dielectric bodies, rather than their volume, is discretized with a triangular mesh using CUBIT. Constant monomial shape functions are used to approximate variables such as surface charges and electrostatic fields in each element. The boundary integration on each element is done using a one-point quadrature rule: the quadrature point in each element is set to be the centroid of that element, and the weight is the surface area corresponding to that element. If the dielectric bodies are spheres, the surface area is calculated analytically as the spherical area; if the dielectric bodies are not spheres, the surface area is approximated by the flat area of each triangular element. The point where variables are approximated in each element is set to be the same as the quadrature point. The linear system of equations Eq. (9) is solved using GMRES in a matrix-free form provided by libMesh and PETSc, i.e., we do not explicitly form the matrix *A*, rather, we define the action of matrix-vector multiplication *Aσ _{b}* that is used in each GMRES iteration. The most computationally expensive part in the matrix-vector multiplication is calculating the electrostatic fields, which is a boundary integral with the kernel multiplied by the surface charge densities [see Eqs. (11) and (13)]. In a direct method, the scaling of the matrix-vector multiplication is

*O*(

*N*

^{2}). In Refs. 17 and 19, a fast Ewald solver is used to reduce the scaling to

*O*(

*N*log

*N*); in our implementation, FMM is used to further reduce the scaling to

*O*(

*N*). In Subsections III A and III B, boundary integration and parallelization are presented.

### A. Integrals accelerated by FMM

#### 1. Magnetostatic problem

Here, the relevant integral is given in Eq. (5). The boundary mesh is constructed for and applied only to the boundary integration, and it also enables us to couple the three solvers for Eqs. (3)–(6) in a single simulation. We transfer nodal values of *ϕ*_{1} from the boundary of the volume mesh to the corresponding DoFs on the boundary mesh before executing the boundary integration. After *ϕ*_{2} is evaluated on the boundary mesh, we transfer its nodal values from the boundary mesh to the boundary of the volume mesh before imposing the Dirichlet boundary condition. Note that the boundary mesh itself is 3D, but it uses elements with 2D shape functions. We evaluate *ϕ*_{2} on nodal points of the boundary mesh using a five-point Gaussian quadrature rule and the resulting discretized form of Eq. (5) is then

where *i* is the DoF index on the boundary mesh, *j* is an index for the quadrature points, and *N* is the total number of quadrature points in all boundary elements. Note that *N* depends on the number of boundary elements and the order of the quadrature rule. Finally, *w*(**r**_{j}) is the quadrature weight at each quadrature point. We can see that the integral at the *i*th DoF on the boundary mesh is discretized as a summation over all quadrature points, which is the first term in Eq. (16), with values of *ϕ*_{1} and quadrature weights *w*(**r**_{j}) at quadrature points multiplied by the kernel

We also denote the artificial physical value at *j*th quadrature point as

In the implementation of the summation in Eq. (16) using a kernel-independent FMM in ScalFMM, the normal unit vector at the quadrature points cannot be directly applied in the interpolation of the kernel. There are two methods to include the normal unit vector. In the first, the kernel in Eq. (17) is split into three components,

and we rewrite the summation as

In this method, the summation is then carried out using three kernels, and the components of the normal unit vector are incorporated in three artificial physical values associated with each kernel, as shown in Eq. (20). The second method incorporates the normal unit vector during the particle-to-multipole (P2M) stage in ScalFMM. According to the method of kernel-independent FMM and Eq. (6) in Ref. 23, the summation in Eq. (16) can be expressed as

where *n* is the order of Chebyshev interpolation, $ S n ( x \u0304 l , x i ) $ is the interpolating function for node $ x \u0304 l $, and *K _{r}* is the 1/

*r*kernel such that the kernel in Eq. (17) can be written as

The idea of the second method is to use the usual 1/*r* kernel and to apply the normal unit vector as a normal derivative to the interpolating function (the last term in Eq. (21)), which can be implemented in the P2M stage of FMM. One advantage with this approach is that ScalFMM is highly optimized for the symmetric 1/*r* kernel, which further reduces the computational cost of the boundary integration. However, this second method requires modifications to the current ScalFMM source code, which are beyond the scope of our work. Therefore, we currently incorporate the normal unit vector using only the first method.

In our implementation of the boundary integral using ScalFMM, we used the target source model (TSM): the nodal points are treated as target points, and quadrature points in every element are treated as source points, so that target points are different from source points. These target and source points are then inserted into the octree^{33} with information about their coordinates, particle type (target/source), and the artificial physical value [*σ*_{1,j}, *σ*_{2,j}, *σ*_{3,j} in Eq. (20)] associated with source points. We avoid the singularity of the integral by setting a tolerance radius around each target point: if the distance between target and source points is less than a small number, e.g., 10^{−5}, then its contribution to the integral is set to zero. During the FMM computations, the height of the octree, which represents the balance between near-field and far-field calculations, is tuned to achieve optimal performance of the FMM calculation. In general, the more particles in the octree, the larger its height is.

#### 2. Electrostatic problem

For the numerical implementation of the boundary integrals that arise when calculating electrostatic fields, we use Eq. (11) as an example for the discretization. The electrostatic field vector at the *j*th element that arises from the bound surface charge density is

where *N* is the total number of boundary elements (also the DoF because of the constant monomial shape function used here), *a _{j}* and

*a*are the surface areas of the

_{k}*j*th and

*k*th element,

*H*is the mean curvature at

_{j}*j*th element, and

**n**

_{j}is the unit outward-pointing normal vector at the centroid of the

*j*th element. The first term in Eq. (23) is the contribution from all source elements except for the target element itself, and the second term is a correction term that replaces the original singular term when

*j*=

*k*. This approximate correction works reasonably well for arbitrary geometries (see Section IV.B in Ref. 17). For spherical surfaces,

*H*= 1/

_{j}*r*, where

*r*is the radius of sphere to which the

*j*th element belongs; for cylindrical lateral surfaces,

*H*= 1/(2

_{j}*r*), where

*r*is the radius of the cylinder; for flat surfaces,

*H*= 0, which means the correction term is zero.

_{j}The kernel used in the summation in Eq. (23) is

Note that this kernel is different from that in Eq. (17) for the magnetostatic problem, because it contains no component of the normal unit vector. It is also evident from Eqs. (10) and (12) that the normal unit vector is associated with target points rather than source points and is outside of the boundary integral. We also note that the above kernel in Eq. (24) is the spatial derivative of 1/*r* kernel. Because the 1/*r* kernel and its spatial derivatives, i.e., the forces from the 1/*r* potential, are already implemented in ScalFMM, we can take advantage of its highly optimized algorithm specifically designed for the symmetric 1/*r* kernel to perform boundary integrals for the electrostatic problem, which is more efficient than our current implementation for the boundary integrals in the magnetostatic problem.

### B. Parallelization

In our parallel implementation, we use a serial mesh, i.e., every processor has a copy of the boundary mesh (and volume mesh in the magnetostatic problem), but the data structures, such as matrices and vectors, used to solve the linear system of equations are partitioned and distributed among all processors automatically in libMesh during parallel solving using PETSc. A parallel mesh is needed only if storing mesh data on a each processor consumes too much memory, which may be the case for problems with billions of DoFs. We use METIS^{34} as the partitioner for data structures associated with the mesh.

The magnetostatic problem has a volume mesh for solving the Poisson and Laplace equations [Eqs. (3) and (6)] and a boundary mesh for the boundary integral [Eq. (5)]. If we then were to use the same partitioner for the volume mesh and boundary mesh, poor load balancing would result. That is because even if the volume mesh is almost uniformly partitioned, the boundary mesh is generally not. Therefore, we assign one partitioner to the volume mesh and a separate partitioner to the boundary mesh, which significantly improves the parallel performance of the boundary integration, especially when the number of processors is large. For the electrostatic problem, only one partitioner is necessary and the main parallelization is implemented in building the right-hand side *b*, performing matrix-vector multiplication in each GMRES iteration, and calculating the force vectors on the dielectric bodies, i.e., every processor calculates its own contribution to the right-hand side, to the matrix-vector product, and to the force vectors.

For the parallelization of FMM in the magnetostatic problem, one strategy is to partition only the source points, i.e., every processor has a copy of the coordinates of all target points on the boundary mesh, as well as coordinates, physical values of *ϕ*_{1}, and normal unit vectors of only local source points. We then perform serial FMM calculations on every processor, which means that we first calculate the contributions from local source points to all target points. We then let all processors communicate and sum up contributions from every processor and assign integrated values to all target points. An alternative strategy would share a similar concept, but partition only target points and keeps on every processor a local copy of information from all source points. The specific choice between the above two strategies that is better suited for the problem at hand depends on whether the number of target points or number of source points is largest. If the number of source (target) points is larger than that of target (source) points, we choose to partition the source (target) points, *i.e.,* the larger set of points, because this can reduce the computational time and the memory cost of storing particles on every processor. Most of the times we found the number of source points is larger than the number of target points, because the number of quadrature points is mostly larger than that of nodal points, so we choose to partition the source points in our parallel implementation.

The strategy for parallelization of FMM in the electrostatic problem is similar to that in the magnetostatic problem. The only difference comes from the fact that we use different shape functions in this problem than those employed for the magnetostatic problem. Since a constant monomial shape function is used here, the quadrature point (source point) in each element is also the point (target point) where variables, such as charge densities and electrostatic fields, are approximated, i.e., target points are the same as source points. In order to use the TSM model in ScalFMM and the strategy of partitioning only target/source points, all quadrature points are inserted into the octree first as target points, and then they are inserted into the same octree again as source points, along with information of associated physical values and quadrature weights. We then partition only the target points; each processor keeps a copy of all source points along with the information of physical values and quadrature weights, and we perform FMM in serial on every processor.

We also note that ScalFMM has its own strategy for parallelization of FMM, in which both target and source points are partitioned based on their Morton ordering.^{35} The idea is to map 3D points in the octree to those in a 1D array based on the z-value of each point and then to partition all points based on the index of their 1D representation. After partitioning, local points are inserted into the octree on each processor, and each processor then performs FMM and communicates with other processors.^{36} This strategy consumes less memory than the previous two methods, but its implementation is more involved. Therefore, we have not considered it here and will examine it in future developments of our computational approach, where we will also compare its parallel efficiency with methods that partition only target/source points. In Sec. IV, we test our computational approaches for the magnetostatic and electrostatic problems by applying them to several model problems, and we compare the results with analytical solutions. These comparisons serve to demonstrate the accuracy, efficiency, and scalability of our parallel approach.

## IV. RESULTS AND DISCUSSION

### A. Magnetostatic problem

Two model magnetostatic problems are considered here to verify our proposed methods and their implementation: a uniformly magnetized unit radius spherical body and a uniformly magnetized unit cubical body, with unit length side (see Fig. 2). For demonstration purposes, the magnetization density within each body is set to be **M** = (1, 0, 0). After solving the magnetostatic problem using our computational approach, the distribution of the magnetic potential within each body is obtained, as shown in Fig. 2. For the unit sphere, the magnetic potential varies linearly along the direction of the magnetization density, and the stray field (demagnetizing field) is calculated by

which is the (negative) gradient of the magnetic potential, and it is uniform everywhere inside the sphere in this particular case. The analytical solution of the stray field along the direction of magnetization density is 4*π*/3 in Gaussian units, see Ref. 10. For the unit cube, the magnetic potential and stray field are nonuniform within the body. Instead, we compute the stray field energy and compare our result with analytical solutions. The stray field energy is calculated as

The analytical value of the stray field energy in a unit cube with unit homogeneous magnetization density is 2*π*/3 [$ \mu 0 M s 2 $] (see Refs. 37 and 38), with *M _{s}* the saturation magnetization density. We calculate the relative error for the stray field energy, for example, as

Figure 3(a) shows the relative errors as a function of surface DoFs; they decay linearly in the log-log plot. For a certain number of surface DoFs, the relative error for the unit cube is smaller than that for the unit sphere; also the convergence rate for the unit cube is larger than that for the unit sphere from *a posteriori* estimates. The larger errors associated with the sphere calculations are due to the larger discretization error that arises when using flat elements and linear shape functions to describe a curved geometry. They do not arise because of selected settings in ScalFMM; indeed, Fig. 4 shows that the difference between the relative errors by FMM and the direct method is small, and that using quadratic shape functions reduces the discretization errors. One way to decrease the relative errors in modeling curved surfaces is to adaptively refine the surface mesh so that surface DoFs and computational cost increase simultaneously; another way is to use curved elements or higher-order methods^{39} to approximate variables on curved surfaces.

Figure 3(b) shows the time spent on computing the boundary integral by FMM as a function of surface DoFs. The order of Chebyshev polynomial is 5 in all cases. The time increases linearly as the surface DoFs increase, and *O*(*N*) scaling is confirmed. Note that the total number of points inserted in the octree is the sum of surface DoFs (target points) and the number of quadrature points (source points). We use a fifth-order quadrature rule for the boundary integration, so the number of source points in the octree is roughly 8 times that of target points. In using FMM to accelerate computation of boundary integrals, it is critical to choose the appropriate octree height to achieve optimal FMM performance, because it controls the balance between the near-field and far-field computation in FMM. The optimal octree height depends on the total number of points in the octree and the size of the cubic box, and they are determined by trial runs. As DoFs increase, the optimal octree height generally increases. In Fig. 3(b), two small humps are observed when the surface DoFs are 53 K and 192 K, and they indicate changes in the octree height. The trade-off between FMM settings (the octree height and the order of Chebyshev polynomial) and performance is shown in Fig. 5. The relative errors of the stray field energy change little when octree height changes, and the accuracy in FMM is mainly determined by the Chebyshev order. Increasing the Chebyshev order improves accuracy, but the computational time increases. We also refer to Ref. 23 for more details on the trade-offs between FMM settings and performance.

We also tested the parallel efficiency of a simulation of magnetized spherical bodies with 7.4 M volume DoFs and 70 K surface DoFs. Figure 3(c) shows the strong scaling for time spent on the Poisson solver, the boundary integral using FMM, the Laplace solver, and the total time for solving the full problem. For the Poisson solver and Laplace solver, the time includes that spent on construction of the block-Jacobian preconditioner as well as the time spent on solving the linear system of equations using GMRES. We tested up to 512 processors and the time spent on each part decreases as the number of processors decreases. When the number of processors is larger than 64, we observe that the decay rate of time on the Poisson solver is slower than that on the Laplace solver and the boundary integral, while the decay rates of time on the Laplace solver and boundary integral are very similar. Algebraic multigrid preconditioners,^{32} such as gamg and hypre, for the Poisson solver were also tested with the result that total computational time and strong scaling are similar to those using block-Jacobian preconditioner. Constructing preconditioners using gamg or hypre takes longer time than using a block-Jacobian method, but the number of GMRES iteration is much less using gamg and hypre than using block-Jacobian. For static problems, the total times are similar using different preconditioners but, for study of time-dependent problems, if the global matrix remains the same between different time steps, then preconditioning using gamg or hypre is recommended.

### B. Electrostatic problem

We verified our method and implementation with a few model calculations for which there are analytical as well as numerical literature solutions: specifically, two dielectric spheres with surface free charges and a sphere inserted in a cylindrical cavity in a dielectric medium. In addition, we tested our method on a model problem for which there exists no analytical solution in closed form, namely, two dielectric cubes with uniform free surface charges. The results are shown in Fig. 6, and Fig. 6(a) shows the solution for the problem of two spherical dielectric bodies. We also varied the center-to-center distance between two spheres and calculated the electrostatic forces at each distance using our numerical approach and an analytical expression^{40,41} (see Fig. 7(a)). Figure 6(b) shows the solution for the dielectric sphere inside a cylindrical cavity in a dielectric medium. The polarization effect on the sphere is turned off by setting its relative permittivity to be the same as that of the medium; note that this problem is equivalent to that of a point charge inside a dielectric cylinder. Our results are in good agreement with analytical solutions as well as with previous numerical results from Ref. 5. Figure 6(c) shows results for the two dielectric cubes. In particular, the bound surface charge densities accumulate on the corners and edges of the cubes, rather than on the faces with a max/min value of ±0.49. To better show the polarization effect on cubic faces, the color range is set to ±0.04 in Fig. 6(c).

Figure 7(a) shows the comparison between our numerical method and an analytical method of calculating body forces on two dielectric spherical bodies as a function of their center-to-center distance (particle separation). For the numerical method, the surface DoFs on each body are varied to test the accuracy, while the relative tolerance for the iterative solver GMRES is set to 10^{−5} in all cases (further decreasing the relative tolerance improves numerical results little). The relative errors for the body force based on Eq. (27) are shown in Fig. 8. As the number of degrees of freedom increases, the relative errors decrease significantly when *d*/*R* ≥ 2.8. They decrease marginally when *d*/*R* ≤ 2.5, and there are known difficulties in calculating forces accurately when two spheres are very close to each other because of large discretization errors, and adaptive mesh refinement is a technique to efficiently improve the accuracy in this situation.^{17} The relative errors do not always decrease when 2.5 < *d*/*R* < 2.8; under the same number of degrees of freedoms, for example, 5004 DoFs/sphere, they are even larger when 2.7 < *d*/*R* < 3.0 than those when *d*/*R* ≥ 3.0. This is because the body forces are nearly zero at these locations as shown by analytical solutions, and they are orders of magnitude smaller than those at other locations and thus more sensitive to discretization errors, which makes the relative errors large while the absolute error is very small.

Including polarization effects results in very different electrostatic interactions compared to those when only bare Coulombic interactions are included, especially when particle separations are small. These issues have been reported in past theoretical and numerical studies. In the particular case considered in Fig. 7, even though the free surface charge densities on both bodies are all positive, the electrostatic interactions may exhibit attractive forces between the bodies when their center-to-center distance (*d*/*R*) is less than 2.74. Note that the bare Coulomb interaction consists of only repulsive forces, so it is crucial that the effects of dielectric polarization be included in calculations of electrostatic interactions for simulations of colloidal bodies, lest the underlying physics not be captured correctly.

Figure 7(b) shows the total computational time spent on assembling the right-hand-side of Eq. (9), solving the linear system of equations iteratively using GMRES, and calculating the body forces on one particle. As before, the order of Chebyshev polynomial is 5 in all cases. We can see that as surface DoFs increase, the computational time increases linearly, which is similar to that of the boundary integral using FMM in the magnetostatic simulation, and it confirms the *O*(*N*) scaling of our method. In each case, because the distance between two particles is different from those in other cases, the box size for the octree is also different. The octree height in each case is tuned to achieve optical FMM performance, and the height generally increases as the surface DoFs increase. Compared to PPPM and the particle-mesh Ewald (PME) method^{43} with *O*(*N*log*N*) scaling, FMM-based methods exhibit reduced increasing rates of computational time as *N* increases. Figure 7(b) shows that they have a clear advantage in high-performance computing of large-scale simulations with millions of unknowns. In addition, FMM-based methods are mesh-free approaches and more efficient for heterogeneous systems (e.g., systems with electrostatic polarization), where the mesh-based approaches such as PPPM and PME that require an evenly spaced FFT-mesh become slow and memory-intensive.^{44}

Figure 7(c) shows strong scaling for up to 512 processors for solving the electrostatic problem when the surface DoFs on each particle are 0.36 M and 1.95 M. Overall, the computational times decrease as number of processor increases, and they are slightly larger than for ideal scaling when the number of processors is less than four. The difference between the computational time and ideal scaling increases as the number of processors increases. For the case with 0.36 M DoFs using 256 and 512 processors, the computational times were found to reach a plateau, while the strong scaling shows better performance for 1.95 M DoFs. In order to investigate what prevented good scaling (*i.e.,* closer to ideal) for larger numbers of processors, we performed a detailed profiling of our code. We found that the time spent on inserting all source particles and FMM calculations on all source points dominates the total time, while the communication time spent on gathering all information of source points to every processor is much smaller. This observation shows that the plateau in strong scaling is the result of our parallel strategy of partitioning only target points and keeping a copy of all source points on every processor, so that all source points need to be inserted into the octree. This dominates the insertion time. In addition, FMM calculations have an *O*(*N*) scaling where *N* = *N _{t}*/

*np*+

*N*since we only partition target points,

_{s}*N*/

_{t}*np*is the number of target points per processor, and

*N*is the number of all source points. In the electrostatic case,

_{s}*N*=

_{t}*N*because of the one-point quadrature rule used in the boundary integration. So the FMM calculation time is dominated by calculations over source points, even if there are much fewer target points when the number of processors (

_{s}*np*) is reasonably large. One strategy to recover better scaling for larger numbers of processors would be to create a separate partitioner for FMM and to distribute both target and source points to every processor according to their Morton index. There will then be two partitioners in the simulation: one for FMM and one for boundary elements. The challenge is then to create a mapping between the two partitioners to couple FMM with the boundary element calculation. Because our current parallel implementation shows very good scaling up to 128 processors, and quite good scaling for 1.95 M DoFs up to 512 processors, we believe our implementation is more than adequate for most applications. Further improvements to our parallel implementation will be performed in future work.

## V. CONCLUSIONS

We have developed accurate, efficient, and scalable parallel computational approaches for integral problems based on the open source libraries libMesh and ScalFMM. The problems considered here all contain integrals of Green’s functions over the boundary of the relevant domains, which exhibit a scaling of *O*(*N*^{2}) in direct implementations. To facilitate wider usage of these methods, we accelerated the computation of the boundary integrals using a kernel-independent FMM to achieve *O*(*N*) scaling, and also with a memory cost of *O*(*N*), resulting in significant gains in computational speed as well as in efficient memory utilization.

We note that FMMs have been used previously for boundary integrals and BEM,^{45–47} as well as combining with GMRES.^{8,48} However, there are few publicly available parallel codes using these techniques, in particular highly efficient and scalable ones. Our work is aimed at providing such a code framework, which we developed with the following features in mind: first, the code must run in parallel for large-scale problems with good parallel performances; second, it must be available for general distribution, at http://ime-code.uchicago.edu as part of the Continuum-Particle Simulation Suite (COPSS) from the Midwest Integrated Center for Computational Materials (MICCoM); third, it should provide convenient interfaces for computational researchers who are not familiar with acceleration techniques or parallelizations to focus on the underlying physical problems, as opposed to software engineering. Our computational approaches open up numerous possibilities and will enable efficient modeling for a variety of large-scale many-body problems, such as dynamics of colloids and interactions of magnetic bodies, to explore new physics and functions of materials. Further improvements can be made on our computational approaches, such as using the parallelization strategy in ScalFMM with the concept of Morton ordering and compare its performance with partitioning only target/source points as in our current implementation. Furthermore, for the boundary integral in the magnetostatic problem, the normal derivative in the Green’s function could be implemented in the P2M stage in ScalFMM, which will take advantage of the highly optimized symmetric 1/*r* kernel in ScalFMM to further accelerate the boundary integrations.

## Acknowledgments

X.J. and J.Q. acknowledge support by U.S. DOE, Office of Science, under Contract No. DE-AC02-06CH11357. The work by J.L., J.H.-O., J.J.d.P., and O.H. was supported by MICCoM, as part of the Computational Materials Sciences Program funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division. We gratefully acknowledge the computing resources provided on Blues and Fusion, high-performance computing clusters operated by the Laboratory Computing Resource Center at Argonne National Laboratory.

## REFERENCES

3D space is partitioned into a tree structure in which each node has eight children.