In our previous work we have studied the performance of a parallel program, based on a direction splitting approach, solving time dependent Stokes equation. In it, we have used a rectangular uniform mesh, combined with a central difference scheme for the second derivatives. In our work, we were targeting massively parallel computers, as well as clusters of multi-core nodes. Therefore, the developed implementation used hybrid parallelization based on the MPI and OpenMP standards. Specifically, (i) between-node parallelism was supported by using MPI-based communication, while (ii) inside-node parallelism was supported by the OpenMP. In this way, by matching “structure of parallelization” with the architecture of modern large-scale computers, we have attempted at maximizing parallel efficiency of the program.
This paper presents an experimental performance study of the developed parallel implementation on a supercomputer using Intel Xeon processors, as well as Intel Xeon Phi co-processors. The experimental results show an essential improvement when running experiments for a variety of problem sizes and number of cores / threads.