"CQSim+: Symbiotic Simulation for Multi-Resource Scheduling in High-Performance Computing"

CQSim+: Symbiotic Simulation for Multi-Resource Scheduling in High-Performance Computing

Kurkure, Y., Sharma, S., Wang, X., Papka, M., Lan, Z.

image A time diagram of CQSim+ performing meta-scheduling for multiple systems.

Efficient job scheduling is crucial in high-performance computing (HPC), balancing user demands for quick job turnaround with facility goals for high resource utilization. Traditional scheduling requires users to specify a system at job submission, which can lead to inefficiencies. A unified scheduling approach, viewing the resources within a computing facility as an integrated pool, promises improved resource use and reduced job wait times. This paper presents CQSim+, an open-source, discrete event-driven simulator tailored for symbiotic multi-resource scheduling. CQSim+ supports dynamic simulation by continuously integrating real-time data from job schedulers, enabling adaptive scheduling based on the system’s current state. Through extensive experimentation, we demonstrate CQSim+’s ability to enhance resource utilization and decrease job wait times in both homogeneous and heterogeneous HPC environments.

Additionally, we present a case study that coordinates job scheduling between two production systems, illustrating how CQSim+ can effectively optimize job scheduling across distinct systems.

CCS Concepts: Computing methodologies - Modeling methodologies; Discrete-event simulation; Real-time simulation; Simulation tools; Social and professional topics - System management

Keywords: Multi-resource scheduling, Simulation tool, Resource management systems, High-performance computing

https://doi.org/10.1145/3726301.3728404