Preemption

SLURM version 1.2 and earlier supported dedication of resources to jobs based on a simple "first come, first served" policy with backfill. Beginning in SLURM version 1.3, priority partitions and priority-based preemption are supported. Preemption is the act of suspending one or more "low-priority" jobs to let a "high-priority" job run uninterrupted until it completes. Preemption provides the ability to prioritize the workload on a cluster.

The SLURM version 1.3.1 sched/gang plugin supports preemption. When configured, the plugin monitors each of the partitions in SLURM. If a new job in a high-priority partition has been allocated to resources that have already been allocated to one or more existing jobs from lower priority partitions, the plugin respects the partition priority and suspends the low-priority job(s). The low-priority job(s) remain suspended until the job from the high-priority partition completes. Once the high-priority job completes then the low-priority job(s) are resumed.

Configuration

There are several important configuration parameters relating to preemption:

To enable preemption after making the configuration changes described above, restart SLURM if it is already running. Any change to the plugin settings in SLURM requires a full restart of the daemons. If you just change the partition Priority or Shared setting, this can be updated with scontrol reconfig.

Preemption Design and Operation

When enabled, the sched/gang plugin keeps track of the resources allocated to all jobs. For each partition an "active bitmap" is maintained that tracks all concurrently running jobs in the SLURM cluster. Each partition also maintains a job list for that partition, and a list of "shadow" jobs. The "shadow" jobs are job allocations from higher priority partitions that "cast shadows" on the active bitmaps of the lower priority partitions. Jobs in lower priority partitions that are caught in these "shadows" will be suspended.

Each time a new job is allocated to resources in a partition and begins running, the sched/gang plugin adds a "shadow" of this job to all lower priority partitions. The active bitmap of these lower priority partitions are then rebuilt, with the shadow jobs added first. Any existing jobs that were replaced by one or more "shadow" jobs are suspended (preempted). Conversely, when a high-priority running job completes, it's "shadow" goes away and the active bitmaps of the lower priority partitions are rebuilt to see if any suspended jobs can be resumed.

The gang scheduler plugin is designed to be reactive to the resource allocation decisions made by the "select" plugins. The "select" plugins have been enhanced to recognize when "sched/gang" has been configured, and to factor in the priority of each partition when selecting resources for a job. When choosing resources for each job, the selector avoids resources that are in use by other jobs (unless sharing has been configured, in which case it does some load-balancing). However, when "sched/gang" is enabled, the select plugins may choose resources that are already in use by jobs from partitions with a lower priority setting, even when sharing is disabled in those partitions.

This leaves the gang scheduler in charge of controlling which jobs should run on the overallocated resources. The sched/gang plugin suspends jobs via the same internal functions that support scontrol suspend and scontrol resume. A good way to observe the act of preemption is by running watch squeue in a terminal window.

The sched/gang plugin suspends jobs via the same internal functions that support scontrol suspend and scontrol resume. A good way to observe the act of preemption is by running watch squeue in a terminal window.

A Simple Example

The following example is configured with select/linear and sched/gang. This example takes place on a cluster of 5 nodes:

[user@n16 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
active*      up   infinite     5   idle n[12-16]
hipri        up   infinite     5   idle n[12-16]

Here are the Partition settings:

[user@n16 ~]$ grep PartitionName /shared/slurm/slurm.conf
PartitionName=active Priority=1 Default=YES Shared=NO Nodes=n[12-16]
PartitionName=hipri  Priority=2             Shared=NO Nodes=n[12-16]

The runit.pl script launches a simple load-generating app that runs for the given number of seconds. Submit 5 single-node runit.pl jobs to run on all nodes:

[user@n16 ~]$ sbatch -N1 ./runit.pl 300
sbatch: Submitted batch job 485
[user@n16 ~]$ sbatch -N1 ./runit.pl 300
sbatch: Submitted batch job 486
[user@n16 ~]$ sbatch -N1 ./runit.pl 300
sbatch: Submitted batch job 487
[user@n16 ~]$ sbatch -N1 ./runit.pl 300
sbatch: Submitted batch job 488
[user@n16 ~]$ sbatch -N1 ./runit.pl 300
sbatch: Submitted batch job 489
[user@n16 ~]$ squeue -Si
JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
  485    active runit.pl   user   R   0:06      1 n12
  486    active runit.pl   user   R   0:06      1 n13
  487    active runit.pl   user   R   0:05      1 n14
  488    active runit.pl   user   R   0:05      1 n15
  489    active runit.pl   user   R   0:04      1 n16

Now submit a short-running 3-node job to the hipri partition:

[user@n16 ~]$ sbatch -N3 -p hipri ./runit.pl 30
sbatch: Submitted batch job 490
[user@n16 ~]$ squeue -Si
JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
  485    active runit.pl   user   S   0:27      1 n12
  486    active runit.pl   user   S   0:27      1 n13
  487    active runit.pl   user   S   0:26      1 n14
  488    active runit.pl   user   R   0:29      1 n15
  489    active runit.pl   user   R   0:28      1 n16
  490     hipri runit.pl   user   R   0:03      3 n[12-14]

Job 490 in the hipri partition preempted jobs 485, 486, and 487 from the active partition. Jobs 488 and 489 in the active partition remained running.

This state persisted until job 490 completed, at which point the preempted jobs were resumed:

[user@n16 ~]$ squeue
JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
  485    active runit.pl   user   R   0:30      1 n12
  486    active runit.pl   user   R   0:30      1 n13
  487    active runit.pl   user   R   0:29      1 n14
  488    active runit.pl   user   R   0:59      1 n15
  489    active runit.pl   user   R   0:58      1 n16

Future Ideas

More intelligence in the select plugins: This implementation of preemption relies on intelligent job placement by the select plugins. In SLURM 1.3.1 the select/linear plugin has a decent preemptive placement algorithm, but the consumable resource select/cons_res plugin had no preemptive placement support. In SLURM 1.4 preemptive placement support was added to the select/cons_res plugin, but there is still room for improvement.

Take the following example:

[user@n8 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
active*      up   infinite     5   idle n[1-5]
hipri        up   infinite     5   idle n[1-5]
[user@n8 ~]$ sbatch -N1 -n2 ./sleepme 60
sbatch: Submitted batch job 17
[user@n8 ~]$ sbatch -N1 -n2 ./sleepme 60
sbatch: Submitted batch job 18
[user@n8 ~]$ sbatch -N1 -n2 ./sleepme 60
sbatch: Submitted batch job 19
[user@n8 ~]$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
     17    active  sleepme  cholmes   R       0:03      1 n1
     18    active  sleepme  cholmes   R       0:03      1 n2
     19    active  sleepme  cholmes   R       0:02      1 n3
[user@n8 ~]$ sbatch -N3 -n6 -p hipri ./sleepme 20
sbatch: Submitted batch job 20
[user@n8 ~]$ squeue -Si
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
     17    active  sleepme  cholmes   S       0:16      1 n1
     18    active  sleepme  cholmes   S       0:16      1 n2
     19    active  sleepme  cholmes   S       0:15      1 n3
     20     hipri  sleepme  cholmes   R       0:03      3 n[1-3]
[user@n8 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
active*      up   infinite     3  alloc n[1-3]
active*      up   infinite     2   idle n[4-5]
hipri        up   infinite     3  alloc n[1-3]
hipri        up   infinite     2   idle n[4-5]

It would be more ideal if the "hipri" job were placed on nodes n[3-5], which would allow jobs 17 and 18 to continue running. However, a new "intelligent" algorithm would have to include factors such as job size and required nodes in order to support ideal placements such as this, which can quickly complicate the design. Any and all help is welcome here!

Preemptive backfill: the current backfill scheduler plugin ("sched/backfill") is a nice way to make efficient use of otherwise idle resources. But SLURM only supports one scheduler plugin at a time. Fortunately, given the design of the new "sched/gang" plugin, there is no direct overlap between the backfill functionality and the gang-scheduling functionality. Thus, it's possible that these two plugins could technically be merged into a new scheduler plugin that supported preemption and backfill. NOTE: this is only an idea based on a code review so there would likely need to be some additional development, and plenty of testing!

Requeue a preempted job: In some situations is may be desirable to requeue a low-priority job rather than suspend it. Suspending a job leaves the job in memory. Requeuing a job involves terminating the job and resubmitting it again. The "sched/gang" plugin would need to be modified to recognize when a job is able to be requeued and when it can requeue a job (for preemption only, not for timeslicing!), and perform the requeue request.

Last modified 5 December 2008

Lawrence Livermore National Laboratory
7000 East Avenue • Livermore, CA 94550
Operated by Lawrence Livermore National Security, LLC, for the Department of Energy's
National Nuclear Security Administration
NNSA logo links to the NNSA Web site Department of Energy logo links to the DOE Web site