Scaling Neural Network training and Optimization on HPCMP
While HPCMP resources have typically catered to physics modeling and simulation, there is a growing interest in using HPC resources in the DoD machine learning community. With more machine learning work leveraging HPCMP, we explore how to streamline configuring and executing large-scale experiments on high-performance computing clusters. We demonstrate scalable distributed hyper-parameter optimization, real-time metric tracking, and comprehensive cross-experiment analysis using HPCMP resources. Although our research focus is surrogate modeling using Deep Operator Networks, our method and tools can facilitate efficient and reproducible machine learning workflows for any architecture. Currently, we have focused Pytorch-based models, but it can be adapted to other frameworks. Our method is based on open-source tools such as Optuna, AIM to manage hyperameter optimization and allow for real-time visualization and monitoring. Our code and tools, which were developed with HPCMP machines in mind, are open-sourced, actively maintained.
IMPACT
This work supports efficient and cost-effective optimization of neural networks on HPCMP. The focus mission is surrogate modeling of CFD for aerospace applications, but is easily adapted to other domains. Accomplishment: we have brought together common tools for machine learning workflows and have tailored them for HPCMP. Result: Improved the efficiency of ML workflows for targeted researchers on HPCMP
PRESENTER
Ratchford, Jasmine
jratchford@sei.cmu.edu
253-632-7902CMU Software Engineering Institute
CO-AUTHOR(S)
Christiani, Marco
mchristiani@sei.cmu.eduCATEGORY
AI/ML for HPC
SECONDARY CATEGORY
Surrogate Modeling for HPC
SYSTEM(S) USED
Narwhal, Nautilus