User experience creating dynamic parallel workflows on HPCMP systems with the FLUX job scheduler for propellant chemical kinetics rate optimization

Dynamic computational workflows are often needed when using compute-intensive models and simulations within optimization algorithms. Several task managing frameworks are available that can dynamically launch and monitor large numbers of serial jobs concurrently across the parallel resources allocated within an HPC batch job. However, there is limited support for dynamic workflows for multi-core or multi-node parallel jobs. Users can create workflows with standard python concurrency libraries that call the PBS or SLURM schedulers to launch the parallel simulations. This approach can be efficient on HPCMP systems if the simulations are (i) plentiful and run long enough to overcome the latency from the system-level queue and (ii) the individual parallel simulations use most of the computational resources on a compute-node. However, when smaller (e.g., 10’s of cores), shorter running (e.g., minutes) jobs are needed, the latency becomes a bottleneck and resource utilization is poor. Furthermore, the load on the system-level queue can be significant, impacting other users. The FLUX scheduler is a user-level queue that can be started within user’s PBS or SLURM batch job. This enables queue-based, dynamic computational workflows can potentially reduce both job-launch latency and load on the system-level job schedulers. Furthermore, since multiple parallel and serial jobs can be allocated to compute-nodes (i.e., non-exclusive node scheduling), the overall HPC resource utilization efficiency can be increased. We have used the FLUX scheduler to implement several dynamic computational workflows for determining thermodynamic, diffusion, and chemical kinetics parameters needed for propellant combustion models. For example, we used FLUX to manage a Monte-Carlo-based search for intermolecular distances to estimate diffusion properties of gases. This required over 150k multicore QC simulations ranging from O(10) seconds to O(10) hours. The FLUX-based workflow achieved 97% occupancy and reduced the HPC resource usage by a factor of 31 compared to naively using the system-level batch scheduler. We shall present this and other examples of dynamic computational workflow implemented with FLUX to accelerate our optimization workflows. Quantitative improvements to job throughput and HPC resource utilization will be presented along with an assessment of the difficulty in designing and implementing the workflows using FLUX’s command-line and python interfaces.

IMPACT

Accomplishment: Successfully implemented multiple simulation-based parameter optimization algorithms using new dynamic computational workflow framework on HPCMP systems. In one instance, resulted in a 31x reduction in HPC resources required to complete simulations.

PRESENTER

Stone, Christopher
christopher.p.stone5.civ@army.mil
520-691-5244

DEVCOM Army Research Lab

CO-AUTHOR(S)

Chen, Chiung-Chu
chiung-chu.chen.civ@army.mil

CATEGORY

Comp Chemistry & Materials

SECONDARY CATEGORY

Computational workflows

SYSTEM(S) USED

Nautilus, Narwhal, Warhawk