"Overcoming RL Limitations in HPC Scheduling: A Model-Based MCTS Approach for Practical Deployment (poster)"

Overcoming RL Limitations in HPC Scheduling: A Model-Based MCTS Approach for Practical Deployment (poster)

Kurkure, Y,, Zhang, Y., Papka, M. E., Lan, Z.

image

High-performance computing (HPC) job scheduling has seen promising advances with Deep Reinforcement Learning (DRL). However, challenges such as low interpretability, instability, and high computational cost hinder DRL’s practical adoption. We explore a model-based alternative using Monte Carlo Tree Search (MCTS) to overcome these limitations. By leveraging existing HPC simulators as models for MCTS and focusing on transparent decision-making, we aim to develop a scalable and interpretable scheduling solution fit for real-world deployment.