An Initial Study on Correlating System Information and Job Data in the HPCMP Workload Characterization System
The High Performance Computing Modernization Program (HPCMP) Workload Characterization (WC) System is a platform for gathering comprehensive data related to HPC resource utilization. This includes detailed HPC system information, which is constantly sampled from various hardware components and covers metrics like CPU utilization, GPU activity, memory bandwidth, and I/O. In addition to system-level data, the WC system collects job-specific data, gleaning information such as resources used (cores, memory, GPUs), execution times, etc. from scheduler job logs. While both system sampling and job log data are routinely collected, a significant opportunity exists to integrate these datasets for a more holistic understanding of workload behavior. Currently, analysis often occurs in silos, with no correlation of job performance to underlying system conditions. Part of this study aims to bridge that gap by establishing a clear connection between job characteristics and system-level metrics for a carefully selected set of jobs. This study will then attempt to identify patterns, relationships, and dependencies among the data. The focus will be on jobs demonstrating unexpected performance bottlenecks, representing critical scientific applications, or consuming a disproportionate amount of resources. By analyzing this integrated data, we anticipate gaining valuable insights into resource usage and job performance to ultimately improve the overall efficiency and performance of HPC resources for the broader user community.
IMPACT
This study will significantly enhance our understanding of HPC workload behavior by integrating system-level and job-specific data. The resulting insights may lead to practices that will improve system performance, reduce wait times for users, and make HPC resources more efficient. Ultimately, this work will contribute toward the HPCMP’s purpose of providing computational resources that allow DoD’s scientists to complete technical challenges.
PRESENTER
Leach, Carrie
Carrie.L.Leach2@usace.army.mil
601-953-5335ERDC DSRC
CATEGORY
Test & Eval for HPC
SYSTEM(S) USED
Blueback@NAVY, Carpenter@ERDC, Nautilus@NAVY