What is Slurm, and how do I write and submit a Slurm job?

What is Slurm, and how do I submit a Slurm job?

Slurm is an open-source resource manager (batch queue) and job scheduler designed for Linux clusters of all sizes. HPC@UCD also uses Slurm resource manager which is responsible for scheduling and running jobs.

The general idea with a batch queue is that you don't have to babysit your jobs. You submit one, and it'll run until it dies or there is a problem. You can configure it to email you when that happens. This allows very efficient use of the cluster.

Whenever HPC@UCD users log into the clusters, they land on the head or login nodes. All HPC users are encouraged to submit jobs to the nodes via Slurm scheduler, and do not run code on the head node directly.

You can write this script using any editor and run it using the sbatch or srun commands.

Please visit the HPC@UCD documentation site for writing SBATCH scripts:-
https://docs.hpc.ucdavis.edu/scheduler/commands/

How do I submit a job to a GPU node via Slurm?

If your default Slurm association/account does not have GPU partition access. And if you want to request GPU from another account, you can request it using the flags:

#SBATCH --account=Alternate_Account_Name
#SBATCH --nodes=1
#SBATCH --gres=gpu:1

Or when you submit your job using srun, add these flags:

srun -A Alternate_Account_Name --gres=gpu:1 -t 01:00:00 --mem=20GB batch.sh

Read more about efficient use of Slurm scheduler and resources on our documentation:-
https://docs.hpc.ucdavis.edu/scheduler/resources/

How do I track and see the status of my Slurm jobs?

Here are some useful Slurm commands with their purpose:

sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

You can check on your currently running jobs using the commands:
squeue --me
squeue -u <userid>

If you want to check the running jobs under an Account or from a certain group:
squeue -A <groupName>

If you would like to see running jobs on a node and a partition:
squeue -p <PartitionID>
squeue -w <nodeID>

sacct is a utility used for retrieving and displaying accounting data and job information like job status, history, resource consumption, efficiency analysis, customizable output. For example, the following command can be used to extract information about processed jobs of yours over a time period:

sacct --starttime=2022-01-01 --endtime=2022-02-01 --format="user,account%15,jobid%15,nodelist%15,state%20,jobname%20,partition,start,elapsed,TotalCPU," | less

The command above will list down all the jobs that you ran over the year of 2022 from January 1st until February 1st.
If you want to add more arguments for the --format command to extract specifics of jobs, use the following command to see the sacct arguments:

sacct -e

srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.).

A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.

scancel is used to stop a job early. Example, when you queue the wrong script or you know it's going to fail because you forgot something.
See more in "Monitoring Jobs" in the Slurm Example Scripts article in the Help Documents.
More in-depth information at http://slurm.schedmd.com/documentation.html

How do I check my Slurm resource?

Users can see their Slurm associations using commands like:
sacctmgr show associations where user=$USER format=account,partition,qos

Users can see their group/association assigned partitions and memory:
sacctmgr show qos format=name%-40,priority,usagefactor,grptres%40 | egrep "GROUPID|UsageFactor"

How do I see Slurm partitions and nodes?

Users can see the available nodes on available partitions using the following command:

sinfo

This command lists all the available partitions, their state, and relevant nodes. If users want to see details of each partition or each node, they can use these commands:

scontrol show partition <partition-name>

scontrol show node <node-name>