Kubernetes on High Performance Computing Modernization Program (HPCMP)

As HPC systems modernize, there is a growing need for a wider range of workloads that must be run on the compute nodes. We have seen the need to run persistent services on HPC systems such as GitLab, continuous integration, software security scanning, artifact archival, knowledge and documentation management, and leaderboards to monitor algorithm performance. Some of these services require large and dynamic resource allocations. Often HPC systems are composed of a large number of compute nodes and a few smaller login nodes leaving no good place for running these persistent services. Furthermore, the types of workloads run on compute nodes vary widely especially with ML training tools. Some workloads require varying amounts of resources as they progress that do not fit into the typical HPC scheduling paradigm. We propose using Kubernetes on the HPC compute nodes to help solve both of these problems and provide increased flexibility for future workloads. We will describe how Ansible, Air & Space Force Cognitive Engine (ASCE), Rancher Kubernetes Engine (RKE2), and ASCE Data Tool were used to build a Kubernetes based HPC system in an isolated environment on the Vulcanite HPCMP system. Vulcanite is a re-purposed 39 node HPCMP system where all nodes have GPUs and NVMe storage. A novel approach to providing Kubernetes API authentication based on the existing HPCMP approach to SSH+Kerberos is shown. Kubernetes RBAC mechanisms are used to provide authorization for a multi-tenant cluster. Vignettes of how researchers interact with such an HPC system running Kubernetes using Ray, ASCE Hub, and a leaderboard will be shown. Storage is managed in a cloud native way with Rook managed Ceph storage and a dedicated Lustre storage appliance. We show how the system is secured, in part, by running users' workloads with the correct user and group IDs. Finally, we also show how resources are scheduled including GPU resources along with GPU sharing.

PRESENTER

Tarplee, Kyle
kyle.tarplee@afrl.af.mil
970-231-0663

University of Dayton Research Institute, AFRL/RYZA (ACT3)

CO-AUTHORS

Kevin Pitstick
kapitstick@sei.cmu.edu

Glover George
Glover.E.George@erdc.dren.mil

CATEGORY

Artificial Intelligence / Machine Learning usage for HPC Applications

SYSTEMS USED

Vulcanite

SECRET

No