The MIT Supercloud Dataset and Workload Classification challenge

Weiss, Matthew (MIT Lincoln Laboratory)

Co-Authors:
Bestor, David
Yee, Charles
McDonald, Joseph
Frey, Nathan
Jones, Michael
Gadepally, Vijay
Samsi, Siddharth

Category:
Other: AI/ML for Datacenter optimization, Challenge

Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frameworks, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset and the Workload Classification challenge, which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper discusses the details of the dataset, collection methodology, data availability, and baseline implementations of the Workload Classification challenge. Datasets, code, and future challenge announcements are available via https://dcc.mit.edu.