Scaling Neural Network training and Optimization on HPCMP

While HPCMP resources have typically catered to physics modeling and simulation, there is a growing interest in using HPC resources in the DoD machine learning community. With more machine learning work leveraging HPCMP, we explore how to streamline configuring and executing large-scale experiments on high-performance computing clusters. We demonstrate scalable distributed hyper-parameter optimization, real-time metric tracking, and comprehensive cross-experiment analysis using HPCMP resources. Although our research focus is surrogate modeling using Deep Operator Networks, our method and tools can facilitate efficient and reproducible machine learning workflows for any architecture. Currently, we have focused Pytorch-based models, but it can be adapted to other frameworks. Our method is based on open-source tools such as Optuna, AIM to manage hyperameter optimization and allow for real-time visualization and monitoring. Our code and tools, which were developed with HPCMP machines in mind, are open-sourced, actively maintained.

IMPACT

This work supports efficient and cost-effective optimization of neural networks on HPCMP. The focus mission is surrogate modeling of CFD for aerospace applications, but is easily adapted to other domains. Accomplishment: we have brought together common tools for machine learning workflows and have tailored them for HPCMP. Result: Improved the efficiency of ML workflows for targeted researchers on HPCMP

PRESENTER

Ratchford, Jasmine
jratchford@sei.cmu.edu
253-632-7902

CMU Software Engineering Institute

CO-AUTHOR(S)

Christiani, Marco
mchristiani@sei.cmu.edu

CATEGORY

AI/ML for HPC

SECONDARY CATEGORY

Surrogate Modeling for HPC

SYSTEM(S) USED

Narwhal, Nautilus