Basics of Running Jobs¶
REPACSS utilizes the Slurm Workload Manager for resource allocation and job scheduling across its high-performance computing infrastructure. Slurm manages the assignment of compute resources, oversees execution and monitoring, and enables the scheduling of tasks for future execution.
Additional Resources¶
Job Definition¶
A job is defined as an allocation of compute resources granted to a user for a specified duration. Jobs may be executed interactively or as batch processes (via job scripts) and can be scheduled to run at a future time.
Note
REPACSS provides sample job submission scripts and templates for common workloads.
Upon accessing REPACSS, users arrive at a login node, which is intended for job preparation activities such as file editing or code compilation. All computational jobs must be submitted to compute nodes using Slurm commands.
The platform supports a wide range of workflows including interactive sessions, serial executions, GPU-based applications, and large-scale parallel computations.
Submitting Jobs¶
sbatch
¶
Submit a batch script:
$ sbatch my_job.sh
Submitted batch job 864933
Job scripts should include #SBATCH
directives and one or more srun
commands.
interactive
¶
Request an interactive session using the recommended wrapper script:
$ interactive -c 8 -p zen4
Tip
The interactive
command wraps around Slurm's allocation mechanisms and handles additional environment setup required for compute node access. Avoid using salloc
directly — especially in environments like Visual Studio Code — as it may fail to configure session parameters properly.
An interactive shell will be initiated on a compute node once resources are allocated.
srun
¶
Execute a job step in real-time:
$ srun -n 4 ./program
May be used within job scripts or interactive sessions.
Commonly Used Options¶
Option (long) | Short | Description | sbatch | srun |
---|---|---|---|---|
--time |
-t |
Maximum wall clock time | ✅ | ❌ |
--nodes |
-N |
Number of nodes | ✅ | ✅ |
--ntasks |
-n |
Number of parallel tasks (e.g., MPI) | ✅ | ✅ |
--cpus-per-task |
-c |
CPU cores allocated per task | ✅ | ✅ |
--gpus |
-G |
Number of GPUs requested | ✅ | ✅ |
--constraint |
-C |
Specific hardware or node type constraint | ✅ | ❌ |
--qos |
-q |
Quality of Service tier | ✅ | ❌ |
--account |
-A |
Project account for usage tracking | ✅ | ❌ |
--job-name |
-J |
Name assigned to the job | ✅ | ❌ |
Tip
It is advisable to use long-form flags (e.g., --nodes=2
) in scripts for clarity and maintainability.
Writing a Job Script¶
Sample job script:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --time=01:00:00
#SBATCH --partition=h100
#SBATCH --account=mXXXX
module load gcc
srun -n 4 ./a.out
Slurm options may also be specified at the command line:
sbatch -N 2 -p h100 ./job.sh
Option Inheritance and Overriding¶
Slurm options may be declared within the job script via #SBATCH
or specified directly on the command line. If both are present, command line options take precedence.
Environment variables such as SLURM_JOB_NUM_NODES
are automatically populated and propagated to srun
. Avoid redundant overrides within the script unless necessary. Behavior may vary when using non-Slurm launch mechanisms.
Sample Python Job Script¶
#!/bin/bash
#SBATCH --job-name=python_job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=01:00:00
#SBATCH --partition=zen4
module load gcc
source ~/miniforge3/etc/profile.d/conda.sh
conda activate myenv
python script.py
Submitting GPU Jobs¶
Warning
We are currently in the process of installing the CUDA module on our systems. In the meantime, users are advised to install CUDA via their conda environment to ensure compatibility with GPU workflows.
For detailed usage examples, please refer to the Job Examples section.
To request GPU resources, use the --gres
flag:
#SBATCH --gres=gpu:nvidia_h100_nvl:1
Ensure required modules such as CUDA are loaded:
module load cuda
Failure to request GPUs may result in errors such as:
No CUDA-capable device is detected
Job Monitoring¶
Check job queue status:
squeue -u $USER
Estimate job start time:
squeue --start -j <job_id>
Review completed job statistics:
sacct -j <job_id>
Troubleshooting¶
- Verify group disk quotas:
df -h /mnt/$(id -gn)
- Ensure job script includes required Slurm options
- Confirm partition and hardware constraints match available resources
- Load relevant modules before execution
- Inspect job output files (
.out
,.err
) for detailed error messages