We present graphics processing unit accelerated implementations of Markov chain algorithms to sample random tilings, dimers, and the six vertex model.

Dedicated to the memory of Ludwig Faddeev

## I. INTRODUCTION

The study of tilings of planar regions by dominos, lozenges, and other shapes has a long history in mathematics, physics, and computer science. In the recent decades, the discovery of arctic circles, limit shapes, and other phenomena in random tilings of large regions has sparked a renewed excitement into their investigation. For a survey of these developments, see Ref. 11.

A remarkable tool in the study of tilings has been the use of various computer programs to generate and visualize random tilings. A variety of powerful algorithms and techniques for sampling tilings have been developed and studied (see, for example, Refs. 15, 24, 1, 21, and 18), and their standard implementations are widely available and well-utilized by researchers in this field.

On the other hand, the recent years have seen the advent of many new tools and techniques in high performance computing. In particular is the resurgence of parallel computing, driven by the fact that as speeds of single processors reach their physical limits, computers increasingly rely on multicore processors for performance and efficiency. While previously restricted to supercomputing applications, parallel computing is now-a-days essential to fully exploit modern computational power.

A powerful element of multicore hardware is the graphics processing unit (GPU). Designed specifically for certain computations in 3D computer graphics, GPUs employ a massively parallel architecture with up to thousands of limited-functionality processing cores working synchronously. Thanks to the popularity of video games, powerful GPUs are now common-place on nearly all personal computers, as well as Playstations, XBoxes, and other devices. Utilizing these computational resources for tasks beyond computer graphics is a tantalizing prospect. Indeed, GPUs have proven to be well-suited for many other types of problems, and general purpose computing with the GPU (GPGPU) has become increasingly popular and successful in many fields. For an overview of applications of GPUs to other Monte Carlo simulations, see Ref. 23.

The main purpose of this note is to demonstrate the use of GPUs to generate random tilings. Our approach is based on the Glauber dynamics Markov chains where Markov moves are local “flips” of the tiling. For parallelization, we consider non-local updates consisting of clusters of flips, which when chosen according to a domain decomposition, can be generated and executed independently and in parallel on GPU cores. We implemented the program with C++ and OpenCL and have made our source code available.^{9}

The structure of the remainder of the paper is as follows: in Sec. II, we recall the basics of domino tilings and Markov chain algorithms for generating random tilings. We briefly discuss generalizations to some other models. In Sec. III, after reviewing some important aspects of graphics hardware architecture and programming, we explain our implementation of the Markov chain algorithm on the GPU. In Conclusion, we present the results of some experiments we conducted to test the program.

## II. RANDOM TILINGS

### A. Domino tilings

By the *square lattice*, we mean the planar graph embedded in the Euclidean plane with vertices at coordinates $(i,j)\u2208Z2$ and edges joining all pairs of vertices unit distance apart. A *domain* $D$ is the union of a finite set of faces of the square lattice. We assume that $D$ is simply connected.

A *domino* is the union of two square lattice faces that share an edge. A *tiling* $T$ of $D$ is a set of dominos whose interiors are disjoint and whose union is $D$. We denote by $\Omega D$ the set of tilings of $D$. A domino tiling $T$ can equivalently be viewed as *perfect matching* or *dimer cover* $T\u2009\u2009*$ of the dual graph or as a lattice routing as shown in Fig. 1.

By identifying tilings with perfect matchings of a bipartite graph, each tiling $T$ can be associated with an integer valued *height function* $hT$ on vertices $v\u2208D$, according to the rules shown in Fig. 2. Height functions induce a partial ordering on tilings, where $T<T\u2009\u2009\u2032$ if $hT(v)<hT\u2032(v)$ for all $v\u2208D$. Moreover, the point-wise maxima (or minima) of two height-functions also define a height function and a corresponding tiling, which endows $\Omega D$ with the structure of a distributive lattice. We denote by $Tmax$ and $Tmin$ the unique maximal and minimal elements of $\Omega D$, respectively, see Fig. 2.

We say a domain $D$ is *tileable* if $\Omega D$ is non-empty. An efficient algorithm due to Ref. 22 determines the tileability of a domain and returns a maximum (or minimum) tiling in time proportional to the size of the domain.

#### 1. Weights

It is often important to consider various probability measures on $\Omega D$. Gibbs measures originate from models of statistical physics, in which dimer covers of the lattice correspond to bond configurations in crystals.

These measures are defined by *edge weights* that assign to each dual edge *e* a positive number $we$, and to each tiling $T$ a weight $W(T)=\u220fe\u2208T*we$. A tiling $T$ then has the probability

Equivalently, the Gibbs measure can be defined by *volume weights* that assign to each vertex $v$ a positive real number $qv$, and to each tiling $T$ the weight $W(T)=\u220fv\u2208DqvhT(v).$ Choosing all weights to be 1 gives the uniform distribution on Ω. See Figs. 11–14 for examples of domino tilings.

#### 2. Moves on tilings

We say a tiling $T$ is *rotatable* at a vertex $v$ if the faces adjacent to $v$ are covered by two parallel dominos. An *elementary rotation* at $v$ rotates the dominos in place, as shown in Fig. 3. More precisely, an up-rotation replaces two horizontal dominos by two vertical dominoes, and down rotation does the opposite. We denote by $R\xb1v:\Omega D\u2192\Omega D$ the function that maps a tiling $T$ to the tiling obtained from $T$ by rotating up/down at *v* if possible. Here we adopt the convention of formally signing vertices to denote up and down rotations.

A theorem of Ref. 22 states that any two tilings $T$ and $T\u2009\u2009\u2032$ of a domain $D$ are connected by a sequence of elementary rotations.

It is natural to consider simultaneous rotations on clusters of faces. Note that if two signed vertices $v$ and $v\u2032$ are not adjacent, then $Rv\u25e6Rv\u2032$ = $Rv\u2032\u25e6Rv$. We call a subset of signed vertices *S* an *admissible cluster* if no two of its vertices are adjacent. Then the cluster rotation *R*_{S} = $\u220fv\u2208SRv$ is well defined. In practice, a convenient class of admissible clusters is found by fixing a vertex coloring, and choosing subsets of vertices of the same color.

#### 3. Markov chain on tilings

A random walk on Ω_{D} is defined by an initial tiling $T\u2009\u2009(0)$ and a sequence of randomly chosen admissible clusters ${Si}i=1\cdots \u221e$, where the *n*th step is

In other words, at each step, the Markov chain chooses a random cluster *S* and moves to $RS(T)$. It is straightforward to check that this Markov chain is irreducible and aperiodic, from which it follows that in the limit *n* → *∞* the random tiling $T(n)$ is uniformly distributed.

In reality, the Markov chain is run for a finite but large time determined by the rate of convergence or *mixing time*. Although mixing times of Markov chains on tilings have been studied by many, see, for example, Refs. 15, 24, and 13, and upper bounds rigorously established in many particular settings, very little is known in general. In practice, the mixing times can often be estimated empirically by heuristic techniques such as a self-consistent analysis of autocorrelation times.

#### 4. Perfect sampling with coupling from the past

When statistical soundness is paramount, exact sampling can be accomplished using the *coupling-from-the-past* algorithm,^{18} which effectively simulates running the Markov chain for an infinite time. It works as follows: given a sequence of random clusters ${Si}i=1\cdots \u221e$, define the backwards walk

Almost surely there exists an *n* for which $|R(\u2212n)(\Omega D)|=1$ (and in fact for all earlier times *m* ≥ *n*). The Markov chain is then said to have *collapsed* and the unique element in the range of *R*^{(−n)} is distributed according to the stationary distribution.

For Markov chains with large state spaces, checking for collapse can be impractical. However, in the case that the state space is partially ordered and the Markov moves monotone, as is the case for domino tilings, the state space collapses if and only if $R(\u2212n)(Tmax)=R(\u2212n)(Tmin)$. Consequently, it is sufficient to check only the maximal and minimal states.

### B. Other models

The machinery described above for domino tilings can be generalized to a variety of other models. Let us briefly describe a few examples:

#### 1. Lozenge tilings

Lozenge tilings are the triangular-lattice analog of domino tilings, see Fig. 4. Lozenge tilings correspond to dimers on the hexagonal lattice, whose bipartite structure allows the introduction of a height function at vertices (which is easy to visualize by imagining a tiling as a stack of cubes), which in turn induces a partial order and lattice structure on the set of tilings. The set is connected by elementary rotations at vertices of the triangular lattice as shown below. Cluster rotations can be generated by three-coloring the triangular lattice. See Figs. 15 and 17(a) for examples of lozenge tilings.

#### 2. Bibone tilings

Bibone tilings are the hexagonal-lattice analog of domino tilings, see Fig. 5. While they correspond to dimers on the triangular lattice, techniques for bipartite dimers do not apply; for example, bibone tilings do not admit height functions. Nonetheless, the author of Ref. 10 showed the connectedness of the set of tilings under three types of elementary moves as shown in Fig. 5. For parallelization, we consider clusters of the same type of move, with a different domain decomposition for each type. See Fig. 16 for an example of a bibone tiling.

#### 3. Rectangle-triangle tilings

Rectangle-triangle tilings were studied by Nienhuis.^{16} The tiles are isosceles triangles and rectangles with side lengths 1 and $3$. We focus on tilings of domains of the triangular lattice. Like lozenge tilings, rectangle-triangle tilings can be visualized in 3D as stacks of half-cubes, which gives a partial ordering to the set of tilings. It is easy to check that the set of tilings is connected by an elementary move at vertices as shown in Fig. 6. In practice, we allow many other local moves to improve the mixing rate.

Local weights can be introduced by assigning to each face of the triangular lattice a weight depending on the tiles that cover the face.^{16} The three possibilities refer to types of tiles that can cover a face, labeled as (A), (B), and (C). These cases are assigned weights *t*, *c*, and *r*, respectively; the weight of the tiling is given by the product of all weights of faces. See Figs. 17(b) and 18 for examples of rectangle-triangle tilings.

#### 4. The six-vertex model

The six-vertex model is defined on a domain $D$ of the square lattice. A configuration *S* of the model is an assignment of “occupied” or “unoccupied” to each edge in $D$ that satisfies the condition that at every vertex $v$ in the interior of $D$, the edges adjacent to $v$ must be one of the six local configurations shown in Fig. 7. A boundary condition for the six-vertex model fixes the state of edges intersecting the boundary of $D$.

Six-vertex configurations correspond bijectively to height functions, as shown in Fig. 8. This endows the set of configurations with a partial ordering and the structure of a distributive lattice.

Gibbs measures are defined by assigning positive real weights to each of the six vertex types. The weight of a configuration *S* is $W(S)=\u220fv\u2208Dw(v,S)$, where *w*($v$, *S*) is the weight of the vertex $v$ in the configuration *S* depending on the vertex type. The probability of *S* is then

The weights *a* = 1, *b* = 1, *c* = 1 give the uniform distribution on configurations.

With fixed boundary conditions, the set of configurations is connected by elementary *c*-flips at faces, as shown in Fig. 7. In general, variations of *c*-flips for boundary faces are also required. For parallelization, cluster moves can be generated by coloring faces according to both the parity of the *x*- and *y*-coordinate of the face.

It can be checked that the local moves preserve the partial order only if the weights satisfy *a* ≤ *c* and *b* ≤ *c*; this means that coupling-from-the-past can only be used for these weights, although the standard Markov chain sampling can be used for all weights. For the weights (*a* = 1, *b* = 1, *c* = 1), a much faster program is possible due to the combinatorial nature of the model, although we do not pursue it here. See Figs. 19–22 for examples of six vertex configurations.

## III. IMPLEMENTATION

### A. Graphics processing units

Certain properties of graphics hardware have great significance in the design and implementation of efficient algorithms for the GPU. Let us briefly mention a few important aspects and refer the reader to Ref. 7 for details.

#### 1. GPU design and architecture

Many tasks in 3D computer graphics involve carrying out an identical sequence of operations independently on a large set of input primitives. Rendering a surface for instance could require: for each pixel drawn to the screen, computing the local surface normal, computing directions to the viewer and light sources, and finally computing the color according to some lighting and shading model. These types of tasks are often called *data parallel*, and the routine executed on each input called a shader or *kernel*.

GPUs are designed specifically for data parallel tasks. At the hardware level, a typical GPU consists of a number of *compute units*, each containing an instruction block, shared memory caches, and a number of *processing elements* that carry out arithmetic computations. A parallel computation is organized by issuing a *thread* for each kernel execution instance; the threads are divided into blocks and distributed to compute units for execution. Each compute unit executes the kernel in *single instruction multiple data* (SIMD) fashion, by issuing at each time step the same kernel instruction to every processing element to carry out in a different thread.

For tasks well-suited to this architecture, GPUs can perform up to hundreds of times faster than conventional CPUs.

Performance can drastically suffer, however, when programs are not tuned for the graphics hardware. Due to the SIMD execution model, GPU performance can be severely affected by *branch divergences*, which occur when conditional expressions in the kernel cause different threads to follow different execution paths. Furthermore, compared to conventional processors, GPUs have little hardware dedicated to speed-up the execution of the sequential code. In particular, GPU cores have relatively smaller memory caches and limited bandwidth for memory access. While GPUs hide memory-access latency by multithreading, their performance is nonetheless sensitive to the pattern of memory-access and organization of data in memory.

#### 2. The OpenCL framework

In early years, general purpose computing with GPUs required reformulating a task in terms of graphics primitives and operations. Since then, GPUs have evolved into highly flexible and easily programmable processors. Frameworks, such as NVidia’s Compute Unified Device Architecture (CUDA) and the Open Computing Language (OpenCL), define C-type kernel programming languages and provide C/C++ libraries to access GPU devices.

For this work, we chose to use OpenCL.

In the OpenCL framework, a computing platform consists of a *host* (typically the CPU) and any number of *devices* (typically GPUs). The kernel source code is loaded and compiled for devices by OpenCL at run-time. The host issues command to the device, such as kernel executions and data transfers, via a command queue. Before launch, the set of kernel execution threads is arranged by the user into a grid of *work-items*. The grid coordinates of a work-item can be accessed from within the kernel program and used to identify the thread. The grid is further divided into *work-groups*, each of which is executed on one compute unit of the GPU, and whose work-items can communicate via shared local memory and can synchronize with barriers.

A typical OpenCL host program first initializes the platform, then loads and compiles the kernel source, transfers relevant data to the device, sets up and launches the kernel on the device, and finally reads back the output data from the device.

### B. Random domino tilings with the GPU

Having laid the necessary foundations, let us now explain our implementation.

To each vertex $v$, we associate an integer *state* representing the tiling of the faces adjacent to $v$ as follows: enumerating the edges adjacent to $v$ in the order North, South, East, and West, the tilestate $sv$ is defined as

where *e*_{i} is 1 if a domino of $T$ crosses edge *i* and 0 otherwise. A vertex $v$ is rotatable if its state $sv$ = 12 or $sv$ = 3.

A tiling is represented by the state of every vertex of the domain $D$, as in Fig. 9. Assuming the domain $D$ is contained in an *N* × *N* square domain in the first quadrant of the Euclidean plane, we store a tiling $T$ as an *N* × *N* array *T*, where the (*i*, *j*)th entry of *T* is the state of the vertex with coordinate (*i*, *j*).

To optimize kernel occupancy and memory transfer speed, we divide the tiling into the sub-arrays *T*_{b} and $Tw$ of black and white vertices, respectively, and store each contiguously in the GPU device memory. For simplicity, we assume *N* is even so that, for example, the (*i*, *j*)th component of the array *T*_{b} corresponds to the state of the vertex (*i*, 2*j* + *i* mod 2) in $D$. Moreover, to avoid special cases at the boundary we assume all arrays are sufficiently zero-padded.

In our implementation, the primitives of GPU operations are the vertices of the domain $D$. We define a kernel function that rotates a vertex $v$ with some fixed probability and kernel functions that check the neighbors of $v$ and recompute its state. More precisely,

The kernel

*Rotate*takes as a parameter a tiling sub-array*T*. The work-item (*i*,*j*) first generates a pseudo-random number and then with fixed probability attempts to either rotate up or down the state*T*[*i*,*j*]. The state can be rotated up if*T*[*i*,*j*] = 3 with rotating accomplished by setting*T*[*i*,*j*] = 12 and similarly for down-rotation.The kernels

*UpdateBlack*and*UpdateWhite*take as parameters both sub-arrays*T*_{b}and $Tw$. The work-item (*i*,*j*) of*UpdateBlack*recomputes the state of the vertex with index (*i*,*j*) in*T*_{b}in terms of the neighboring vertices in $Tw$. This can be done efficiently with bitwise operations, as follows:

The *UpdateWhite* kernel is defined similarly.

Other kernels used to generate pseudo-random numbers are discussed in Sec. III B 1.

A random cluster rotation is accomplished by launching *Rotate* on all black vertices and *Update* on all white vertices, or *Rotate* on all white vertices and *Update* on all black vertices. We define the following host functions:

The function

*RandomWalk*takes as parameters a tiling $T$, a random number seed*s*, and a natural number*nSteps*. The function first divides $T$ into the two sub-arrays*T*_{b}and $Tw$ and loads them to the GPU memory. It then seeds all pseudo-random number generators with*s*. Then, looping*nSteps*times, with equal probability it either runs*Rotate*with all black vertices and*Update*on all white vertices or vice versa. Finally, after the loop is complete, it reads back*T*_{b}and $Tw$ and recombines the two sub-arrays into the tiling*T*.The function

*DominoTilerCFTP*generates a random tiling using the coupling-from-the-past algorithm. The pseudo-code is

**DominoTilerCFTP:**

Compute the extremal tilings $Tmax$ and $Tmin$.

Initialize a list *seeds* with a random real number.

Repeat:

Initialize $Ttop=Tmax$

Initialize $Tbottom=Tmin$

For *i* = 1 to length of *seeds*:

Set $Ttop=RandomWalk(Ttop,seedsi,2i)$

Set $Tbottom=RandomWalk(Tbottom,seedsi,2i)$

If $Ttop=Tbottom$, return $Tbottom$.

Else, push a new random number to the beginning of *seeds.*

#### 1. Pseudo-random number generators

Random number generators for massively parallel computing are an active area of research. Most applications require a stream of pseudo-random numbers for each thread, with good statistical properties both within each stream and across different streams. The statistical independence is crucial in our application to ensure that the set of tilings is connected by the cluster moves generated by the *Rotate* kernel.

For our implementation, we adopted a variant of the well-known *Mersenne Twister* pseudo-random number generator known as TinyMT.^{20} Although compared to the Mersenne Twister, the period of TinyMT is relatively small (2^{127}), the TinyMT admits a large number (up to 2^{48}) of internal parameter values that ensure statistical independence (see Ref. 20) of generators with different parameter values. For tilings with fewer than 2^{48} vertices, it is convenient to simply instantiate a TinyMT generator with a unique parameter value for each vertex of the domain.

## IV. CONCLUSION

We tested our implementation using a few different graphics cards: an Intel Integrated HD 510, an NVidia Tesla P100, and an AMD Radeon Pro 555. Although a rigorous performance comparison of GPU algorithms and CPU algorithms can be a delicate matter (see Ref. 14), we nevertheless compared our implementation with standard CPU algorithms, see Fig. 10. We found that for smaller domains, the processing time is dominated by memory transfer time and other overheads, and the CPU is often faster. For larger domains, the parallelism pays off and we found significant speed-ups, of at least an order of magnitude depending on the quality of the graphics hardware.

There are several avenues for improvement and generalization that we leave for future work.

First, our implementation leaves some opportunities for optimization, which we have forgone in favor of simplicity and flexibility. On the other hand, the fundamental limiting factor of GPU performance, particularly for large domains, is the number of available processing cores. Therefore, a natural next step is the use of multiple GPUs in parallel with message passing, which has proven successful in other applications.

More broadly, various other algorithms have been developed for generating tilings and could be adapted effectively for the GPU. Among the earliest were shuffling algorithms for exact sampling tilings of the Aztec diamond. Shuffling-type algorithms have since been generalized to many other settings.^{2}

Finally, there are many models closely related to tilings whose simulations are of great interest. In particular, considerable recent work has focused on certain stochastic processes, including the stochastic six-vertex model, exclusion processes, Schur processes, and others. While these models in a sense arise as special limits of dimer and vertex models, it is clear that their simulation could benefit from a different approach (Fig. 23).

## V. EXAMPLES

## ACKNOWLEDGMENTS

We are very grateful to Nicolai Reshetikhin for many helpful discussions during the course of this work. D.K. was supported by the NSF Grant No. DMS-1601947. A.S. was supported by the NSF Grant No. DMS-16011947 and by the DFG Collaborative Research Center TRR 109 “Discretization in Geometry and Dynamics.”