SLURM: A Highly Scalable Resource Manager
SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
SLURM's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton). More complex configurations rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms. SLURM also provides an Applications Programming Interface (API) for integration with external schedulers such as The Maui Scheduler or Moab Cluster Suite.
While other resource managers do exist, SLURM is unique in several respects:
- Its source code is freely available under the GNU General Public License.
- It is designed to operate in a heterogeneous cluster with up to 65,536 nodes and hundreds of thousands of processors.
- It is portable; written in C with a GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
- SLURM is highly tolerant of system failures, including failure of the node executing its control functions.
- A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
SLURM provides resource management on about 1000 computers worldwide, including many of the most powerful computers in the world:
- BlueGene/L at LLNL with 106,496 dual-core processors
- EKA at Computational Research Laboratories, India with 14,240 Xeon processors and Infiniband interconnect
- ASC Purple an IBM SP/AIX cluster at LLNL with 12,208 Power5 processors and a Federation switch
- MareNostrum a Linux cluster at Barcelona Supercomputer Center with 10,240 PowerPC processors and a Myrinet switch
- Anton a massively parallel supercomputer designed and built by D. E. Shaw Research for molecular dynamics simulation using 512 custom-designed ASICs and a three-dimensional torus interconnect
SLURM is actively being developed, distributed and supported by Lawrence Livermore National Laboratory, Hewlett-Packard and Bull. It is also distributed and supported by Cluster Resources, SiCortex, Infiscale, IBM and Sun Microsystems.
Last modified 25 March 2009