Slurm is a system on the HPC that helps manage the cluster and schedule jobs. It allocates resources on the cluster to monitor,
execute and start your code. You can find documentation on Slurm here
and a quick start guide here. The compute nodes, like chela01
and chela-g01
are split into partitions. We have two partitions on the HPC. guru
ranges from chela01
to chela05
.mahaguru
contains the gpu node, chela-g01
.
To submit a job to the job queue, use sbatch
. We recommend you use the following format for your batch scripts.
#!/bin/bash
#SBATCH --partition=guru #the partition we're using.
#SBATCH --nodes=2 #the number of nodes to use.
#SBATCH --ntasks=4 #the number of tasks to run.
#SBATCH --ntasks-per-node=2 #the number of tasks on a single node.
#SBATCH --mem-per-cpu=1gb #the amount of memory to use per cpu.
#SBATCH --cpus-per-task=1 #The number of cpu cores per task.
#SBATCH --time=00:05:00 #The time limit of the job.
#SBATCH --job-name=Template #Job name.
#SBATCH --output=slurm-%j.out #The standard output of our console logs.
#SBATCH --error=slurm_error-%j.out #The output of the error log.
Next, if you wanted to run a job on the GPU node, make sure you use the following options.
#SBATCH --partition=mahaguru #The partition of the gpu nodes.
#SBATCH --gres=gpu:4 #The number of gpus to use.
#SBATCH --nodes=1 #the number of nodes to use.
#SBATCH --ntasks=4 #the number of tasks to run.
#SBATCH --ntasks-per-node=4 #the number of tasks on a single node.
#SBATCH --mem-per-cpu=1gb #the amount of memory to use per cpu.
#SBATCH --cpus-per-task=1 #The number of cpu cores per task.
#SBATCH --time=00:05:00 #The time limit of the job.
#SBATCH --job-name=Template #Job name.
#SBATCH --output=slurm-%j.out #The standard output of our console logs.
#SBATCH --error=slurm_error-%j.out #The output of the error log.
You can also run sbatch
with additional options like, --cpus-per-gpu=2
--job-name=test
sbatch --cpus-per-gpu=2 --job-name=test myScript.sh
Submitted batch job 48250
sbatch blastScript.sh
sbatch
here.
Finally, you can run an interactive slurm session with salloc
sinfo
to get a list of partitions on the system, along with their state and the nodes they contain.squeue
to get info on all the queued jobs along with their state.
scancel <jobId>
to stop your job. Example: scancel 48250
srun <script>
to run a parallel job in Slurm.
It uses the same allocation of the environment the command is run in. It is recommended that you use it in a slurm script to run tasks. Example: srun myScript.sh
,srun ls