CUDA P3M Proposal

Background

We are implementing a parallel algorithm for calculating long range interactions in molecular dynamics (MD) simulations. In MD many interatomic potentials decay rapidly with separation distance and allows the programmer to ignore interactions between far-apart particles when calculating energies and forces. This approximation cannot be made for Coulombic or gravitational interactions where the energy goes as 1/r. Summing interactions of this form result in a conditionally convergent sum that requires interactions over all distances to be taken into account. To reduce computational cost, researchers developed several methods such as Ewald sums and the fast multipole method; however, we will focus on the most performant and commonly used method the particle-particle particle-mesh method (P3M). In the P3M algorithm, the energies and forces are split into short and long range components:

\[E_{\text{tot}} = E_{\text{sr}} + E_{\text{lr}}\]

the short range component is calculated via the particle-particle approach and the long range component is calculated via the particle-mesh approach. In the particle-particle approach, interactions are summed through a traditional force loop that makes use of neighbor lists and scales linearly with the system size. The particle-mesh approach projects the discrete charges onto a mesh which is interpolated via B-splines. The resulting charge field is converted into the Fourier domain where the sum converges rapidly.

This code can naturally be parallelized because the force/energy summation for an individual particle is (nearly) independent from all other particles. Furthermore, the FFT used in the particle-mesh approach is readily parallelized over grid points. This problem can also be decomposed spatially like in HW3 and HW4 where we would then use CUDA aware MPI to break the problem down even more. We will follow the steps below to implement P3M in parallel on a GPU(s):

Decompose domain into equal cubes, assume that number of GPUs is a perfect cube.
Move particle positions, charges and neighbor lists to GPU.
Calculate which particles should be communicated to other processes. Send those particles. This assumes GPU's have direct connection and would not have to go through CPU. If GPU's are not connected it would be better to do this on the CPU.
Launch Particle-Mesh GPU Kernel into GPU Stream 1

Calculate B-spline interpolation of discrete charge onto continuous mesh
FFT of mesh
Calculate long range energies

Launch Particle-Particle component into GPU Stream 2. This component could be calculated on the CPU while the mesh component is calculated; however, that would be unfaithful to how MD is actually implemented where there are other short-range forces (Van der Waals, bonds etc.) to be calculated on the CPU. To keep our code comparable to LAMMPS we will run this on the GPU.

Loop over neighbors to calculate short range component of energy. Similar to HW3 and HW4.

Reduce short range and long range components on GPU, then send data back to CPU for position/velocity update.
Repeat 2-6 every timestep.

The Challenge

We have identified five major challenges when implementing P3M on GPU(s):

Particles will not necessarily interact with the particles adjacent to them in memory leading to uncoalesced memory access and poor GPU performance.
Communication will be negligible in the one GPU case as most of the calculation can be done on the GPU and the results sent back to the CPU. With MPI we will implement a scheme similar to HW4 but in 3D and with periodic boundary conditions. This will result in some communication but we can hide it with computation.
The particle-particle method will likely be much more efficient than the particle-mesh and could result in load-balancing issues.
Re-implementing neighbor lists to match the memory requirements of GPU code (e.g. coalesced access). The tree format from HW3 and HW4 is not amenable to this, at least not without modifications.
Incorporating CUDA libraries to enhance performance, specifically cuFFT, but thrust could also be useful.

By overcoming these challenges we hope to learn CUDA aware MPI, the advantages/disadvantages of using CUDA/MPI in Julia, and how to use the CUDA libraries (specifically cuFFT).

Resources

We will be starting from scratch and programming in Julia, making use of their CUDA.jl wrapper. Our code will follow the implementation of P3M as outlined by its original authors Eastwood and Hockney [1]. This implementation only discusses the original P3M method and not how to parallelize the code, to do this we will follow a Master's thesis which implements the similar PME summation on GPU as well as the literature which describes implementing P3M in various MPI MD codes [2,3].

To run our code we will make use of our research group's resources (8x NVIDIA 2080 Super GPUs) and potentially the resources of the Perlmutter supercomputer should our code scale well.

Goals and Deliverables

Plan to Achieve

Reference CPU code to measure naive performance and verify that P3M attains small error. The P3M error is bounded analytically. We will check to verify Float32 GPU calculations do not make our implementation exceed this bound.
3D domain decomposition with periodic boundary conditions
Single GPU implementation of P3M algorithm
Speed of one GPU is 5x faster than an equivalently priced CPU using all cores.
GPU utilization is $>$80\% and all parts of the P3M algorithm are done on the GPU (data stays on GPU)

Hope to Achieve

CUDA aware MPI implementation for single and multi-node systems
Linear weak scaling with number of GPUs (at least up to number of GPUs in one node) with $>$80\% parallel efficiency. Communication is hidden by computation.
Recursive bisectioning for domain decomposition to automatically handle load balancing across MPI ranks
Investigate potential to use a mixed precision algorithm when calculating energies. The P3M energies are expected to be more accurate than the forces (and are less important in MD anyways). The overall calcluation could be sped up if the energies were calculated with Float16.

Deliverables

We will create speed-up plots that compare our code to a simple CPU code and report energy/force MSEs to demonstrate correctness and accuracy. To compare performance we will run the exact same simulation in LAMMPS, a production code maintained by Sandia National labs, with MPI and compare the run-time at various system sizes. We will also investigate the weak and strong scaling of our code in comparison to LAMMPS. Through this analysis we hope to understand the benefit and drawbacks of using GPUs in molecular dynamics code. Very few production molecular dynamics codes make full use of GPUs and this report will help us understand and demonstrate the difficulty of mapping atomistic simulation onto a GPU. We also hope to understand the scaling properties of this problem and how it scales as you add more and more GPUs in comparison to the scaling of the CPU only code in LAMMPS.

Platform Choice

We have chosen to write our code in Julia to target GPUs. Julia has a rich ecosystem of GPU libraries which wrap their C++ counterparts. For example, the CUDA.jl library provides almost all of the functionality of the CUDA C library but in a higher level language (Julia). Furthermore, Julia offers the KernalAbstarctions.jl library which allows you to target any GPU backend (e.g. Metal, ROCm, CUDA and OneAPI). We will not use KernalAbstractions.jl in this project, but by writing in Julia our project could be easily ported to any GPU backend. We chose to target GPUs because many popular molecular dynamics codes have poor GPU performance and scaling (relative to CPU) despite papers and open source codes demonstrating the superiority of GPUs [4].

Schedule

This is our rough schedule. We will only attempt to adapt the code to multiple GPUs once our original code is ported to a single GPU. This may not be completed and is a lofty goal, but something that would be useful if we can figure it out. Should the multi-GPU approach fail, we might try and use the KernalAbstractions framework to target arbitrary GPU backends. We will update the schedule if we make this decision.

11/13 Implement P3M in serial on CPU (reference code)
11/20 Convert serial code to GPU code, benchmark against LAMMPS
11/27: Adapt code to use multiple GPUs, Write project milestone
12/4 Continue MPI work, start working on poster
12/11 Submit final project report

Project Proposal

Summary