We will implement the Particle-Particle Particle-Mesh method for calculating the Coulombic interactions in a system of charged particles using CUDA.jl. We will verify our implementation is correct by comparing to a direct sum on the CPU and benchmark performance by comparing to MPI implementations in open source molecular dynamics codes (e.g. LAMMPS/HOOMD). Time permitting, we will employ CUDA aware MPI.
We are implementing a parallel algorithm for calculating long range interactions in molecular dynamics (MD) simulations. In MD many interatomic potentials decay rapidly with separation distance and allows the programmer to ignore interactions between far-apart particles when calculating energies and forces. This approximation cannot be made for Coulombic or gravitational interactions where the energy goes as 1/r. Summing interactions of this form result in a conditionally convergent sum that requires interactions over all distances to be taken into account. To reduce computational cost, researchers developed several methods such as Ewald sums and the fast multipole method; however, we will focus on the most performant and commonly used method the particle-particle particle-mesh method (P3M). In the P3M algorithm, the energies and forces are split into short and long range components:
the short range component is calculated via the particle-particle approach and the long range component is calculated via the particle-mesh approach. In the particle-particle approach, interactions are summed through a traditional force loop that makes use of neighbor lists and scales linearly with the system size. The particle-mesh approach projects the discrete charges onto a mesh which is interpolated via B-splines. The resulting charge field is converted into the Fourier domain where the sum converges rapidly.
This code can naturally be parallelized because the force/energy summation for an individual particle is (nearly) independent from all other particles. Furthermore, the FFT used in the particle-mesh approach is readily parallelized over grid points. This problem can also be decomposed spatially like in HW3 and HW4 where we would then use CUDA aware MPI to break the problem down even more. We will follow the steps below to implement P3M in parallel on a GPU(s):
We have identified five major challenges when implementing P3M on GPU(s):
By overcoming these challenges we hope to learn CUDA aware MPI, the advantages/disadvantages of using CUDA/MPI in Julia, and how to use the CUDA libraries (specifically cuFFT).
We will be starting from scratch and programming in Julia, making use of their CUDA.jl wrapper. Our code will follow the implementation of P3M as outlined by its original authors Eastwood and Hockney [1]. This implementation only discusses the original P3M method and not how to parallelize the code, to do this we will follow a Master's thesis which implements the similar PME summation on GPU as well as the literature which describes implementing P3M in various MPI MD codes [2,3].
To run our code we will make use of our research group's resources (8x NVIDIA 2080 Super GPUs) and potentially the resources of the Perlmutter supercomputer should our code scale well.
We will create speed-up plots that compare our code to a simple CPU code and report energy/force MSEs to demonstrate correctness and accuracy. To compare performance we will run the exact same simulation in LAMMPS, a production code maintained by Sandia National labs, with MPI and compare the run-time at various system sizes. We will also investigate the weak and strong scaling of our code in comparison to LAMMPS. Through this analysis we hope to understand the benefit and drawbacks of using GPUs in molecular dynamics code. Very few production molecular dynamics codes make full use of GPUs and this report will help us understand and demonstrate the difficulty of mapping atomistic simulation onto a GPU. We also hope to understand the scaling properties of this problem and how it scales as you add more and more GPUs in comparison to the scaling of the CPU only code in LAMMPS.
We have chosen to write our code in Julia to target GPUs. Julia has a rich ecosystem of GPU libraries which wrap their C++ counterparts. For example, the CUDA.jl library provides almost all of the functionality of the CUDA C library but in a higher level language (Julia). Furthermore, Julia offers the KernalAbstarctions.jl library which allows you to target any GPU backend (e.g. Metal, ROCm, CUDA and OneAPI). We will not use KernalAbstractions.jl in this project, but by writing in Julia our project could be easily ported to any GPU backend. We chose to target GPUs because many popular molecular dynamics codes have poor GPU performance and scaling (relative to CPU) despite papers and open source codes demonstrating the superiority of GPUs [4].
This is our rough schedule. We will only attempt to adapt the code to multiple GPUs once our original code is ported to a single GPU. This may not be completed and is a lofty goal, but something that would be useful if we can figure it out. Should the multi-GPU approach fail, we might try and use the KernalAbstractions framework to target arbitrary GPU backends. We will update the schedule if we make this decision.