Slurm is a system on the HPC that helps manage the cluster and schedule jobs. It allocates resources on the cluster to monitor,
execute and start your code. You can find documentation on Slurm here
and a quick start guide here. The compute nodes, like chela01 and chela-g01
are split into partitions. We have two partitions on the HPC. guru ranges from chela01 to chela05.mahaguru contains the gpu node, chela-g01.
To submit a job to the job queue, use sbatch. We recommend you use the following format for your batch scripts.
#!/bin/bash
#SBATCH --partition=guru #the partition we're using.
#SBATCH --nodes=2 #the number of nodes to use.
#SBATCH --ntasks=4 #the number of tasks to run.
#SBATCH --ntasks-per-node=2 #the number of tasks on a single node.
#SBATCH --mem-per-cpu=1gb #the amount of memory to use per cpu.
#SBATCH --cpus-per-task=1 #The number of cpu cores per task.
#SBATCH --time=00:05:00 #The time limit of the job.
#SBATCH --job-name=Template #Job name.
#SBATCH --output=slurm-%j.out #The standard output of our console logs.
#SBATCH --error=slurm_error-%j.out #The output of the error log.
Next, if you wanted to run a job on the GPU node, make sure you use the following options.
#SBATCH --partition=mahaguru #The partition of the gpu nodes.
#SBATCH --gres=gpu:4 #The number of gpus to use.
#SBATCH --nodes=1 #the number of nodes to use.
#SBATCH --ntasks=4 #the number of tasks to run.
#SBATCH --ntasks-per-node=4 #the number of tasks on a single node.
#SBATCH --mem-per-cpu=1gb #the amount of memory to use per cpu.
#SBATCH --cpus-per-task=1 #The number of cpu cores per task.
#SBATCH --time=00:05:00 #The time limit of the job.
#SBATCH --job-name=Template #Job name.
#SBATCH --output=slurm-%j.out #The standard output of our console logs.
#SBATCH --error=slurm_error-%j.out #The output of the error log.
You can also run sbatch with additional options like, --cpus-per-gpu=2--job-name=testsbatch --cpus-per-gpu=2 --job-name=test myScript.sh
Submitted batch job 48250
sbatch blastScript.shsbatch here.
Finally, you can run an interactive slurm session with sallocsinfo to get a list of partitions on the system, along with their state and the nodes they contain.squeue to get info on all the queued jobs along with their state.
scancel <jobId> to stop your job. Example: scancel 48250srun <script> to run a parallel job in Slurm.
It uses the same allocation of the environment the command is run in. It is recommended that you use it in a slurm script to run tasks. Example: srun myScript.sh,srun ls