学习笔记之Slurm

Slurm Workload Manager - Overview

  • https://slurm.schedmd.com/overview.html
  • Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accountingadvanced reservationgang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selectionresource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.

Slurm Workload Manager - Quick Start User Guide

Slurm Workload Manager - Wikipedia


Slurm Workload Manager - sacct

sbatch - Submit a batch script to Slurm

 1 # =============================================================================
 2 # mytestscript.sh
 3 # =============================================================================
 4 #!/bin/sh
 5 date &
 6 
 7 # =============================================================================
 8 # mytestsbatch.sh
 9 # =============================================================================
10 #!/bin/sh
11 #SBATCH -N 2
12 #SBATCH -n 10
13 
14 srun -n10 -o testscript1.log mytestscript.sh
15 sleep 10; srun -n10 -o testscript2.log mytestscript.sh
16 wait
View Code

scancel - Used to signal jobs or job steps that are under the control of Slurm.

scontrol - view or modify Slurm configuration and state.

squeue - view information about jobs located in the Slurm scheduling queue.

srun - Run parallel jobs

  • https://slurm.schedmd.com/srun.html
  • $ cat testscript.sh
  • #!/bin/sh
  • python mytest.py --arg test
  • $ chmod +x testscript.sh
  • $ srun -N5 -n100 testscript.sh
    • Run it on 5 nodes with 100 tasks
  • $ srun -n5 --nodelist=host1, host2 -o testscript.log testscript.sh
  • $ srun -n10 -o testscript.log --begin=now+2hour testscript.sh
  • $ srun --begin=now+10 date &

Convenient SLURM Commands | FAS Research Computing


srun: error: --begin is ignored because nodes are already allocated.

srun: error: Unable to create job step: More processors requested than permitted

原文地址:https://www.cnblogs.com/pegasus923/p/11511332.html