An Assessment of the Performance Tradeoffs of LAMMPS Molecular Dynamics Simulations on GPUs and CPUs for Representative Interatomic Potentials.

Molecular dynamics (MD) simulations have a range of applications across the DoD. As an example, one billion atom MD simulations have been used in support of multiscale modeling of energetic materials. Using the MD code, LAMMPS, such simulations were recently run on HPCMP resources consuming ~200 million CPU-hours. These simulations required more than 400 nodes each due to memory needs and also used the hybrid MPI/OpenMP parallelization scheme within LAMMPS. At scale, this parallelization scheme is necessary to reduce the number of MPI ranks as collective communications become the computational bottleneck for potentials that include long-range electrostatic interactions. GPU implementations of MD in LAMMPS use a single MPI rank per GPU which can greatly reduce the volume of collective communications without sacrificing computational efficiency. However, GPU solvers tend to be memory constrained and moving data between the CPU and GPU often diminishes performance, limiting the speedup that is achieved. While GPU vendors have demonstrated speed-ups over CPUs with LAMMPS, they typically use weak baselines for CPU architectures, do not show how the problem scales with system size for the CPU or GPU implementations, and do not report the maximum possible system size for a given GPU or CPU node. Such details are necessary for users to assess whether to transition their workloads from CPU nodes to GPU nodes. This presentation discusses efforts to evaluate the GPU implementation of LAMMPS for four representative test systems to identify resource requirements for running large-scale simulations and the performance benefits, if any, that can be achieved by transitioning MD workloads from CPU nodes to GPU nodes. It will describe both single and multi-node tests that were run on both AMD and Nvidia GPUs. It will also show performance comparisons with similar size tests that were run on CPU nodes on the same HPC systems (where CPU simulations used one MPI rank per CPU core). The presentation will show from the comparison that CPU simulations generally performed better than the GPU simulations with an equivalent number of compute nodes. Lastly, the presentation will show results from similar size hybrid MPI/OpenMP simulations that were far less performant than either the GPU or CPU simulations, suggesting that large scale MD simulations may still benefit from GPUs but, due to memory constraints, would require more GPUs than are readily available on most HPCMP systems.

IMPACT

Accomplishment: Ran four different LAMMPS test cases that are demonstrative of common MD applications on both the GPU and CPU nodes on Nautilus and Ruth. Result: Demonstrated that CPU performance was generally better when CPU simulations could utilize one MPI rank per physical CPU core. This shows that LAMMPS users are likely better off using CPUs. This could have an impact on future TI acquisitions.

PRESENTER

Lasinski, Michael
michael.e.lasinski.ctr@army.mil
765-490-7747

HPCMP PET

CO-AUTHOR(S)

Boyer, Mathew
mathew.j.boyer.ctr@army.mil

CATEGORY

Comp Chemistry & Materials

SECONDARY CATEGORY

GPU usage for HPC

SYSTEM(S) USED

Nautilus and Ruth