Monitoring Jobs¶
Note
Avoid running multiple instances of watch squeue
or watch sqs
. This can overload the scheduler, which is a shared system resource. If you must use watch, use watch -n 60
and stop the process when finished.
Using squeue
¶
The squeue
command provides real-time job queue information directly from the Slurm scheduler. It is helpful for checking the current state of jobs, such as PENDING
, RUNNING
, or COMPLETED
.
squeue --me # Shows your jobs
squeue -u $USER # Equivalent to --me
squeue --me -t R # Only running jobs
squeue --me -t PD # Only pending jobs
squeue -j 1234,1235 # Filter by job IDs
To show jobs by account:
squeue -A your_project_name
To view job steps:
squeue --steps 1001.0
Using sacct
¶
The sacct
command retrieves accounting information about active and completed jobs.
Basic usage:
sacct
Customize output by specifying fields:
sacct --format=JobID,JobName,State,Start,Elapsed
Filter jobs by date:
sacct -S 2024-06-01 -E 2024-06-13
Display only failed jobs:
sacct -X --format=User,JobName,State -s F --start=2024-06-01 --end=now
Filter by specific job IDs:
sacct -j 123456,123457
Using sstat
¶
Use sstat
to report resource usage for jobs that are currently running:
sstat -j 123456 -o JobID,MaxRSS
Using jobstats
¶
jobstats
is a Python-based reporting tool that summarizes job activity using data from sacct
, squeue
, and sreport
.
module load python
jobstats
Example usage:
jobstats --user bsencer --start 2025-06-01 --end 2025-06-13
To display all options:
jobstats --help
Email Notifications¶
To receive notifications when your job begins, ends, or fails, add the following directives to your Slurm job script:
#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=your@email.com
Modifying or Canceling Jobs¶
To cancel a job:
scancel 123456
To cancel multiple jobs:
scancel 123456 123457
To cancel all jobs submitted by your user:
scancel -u $USER
To update a job’s time limit:
scontrol update jobid=123456 timelimit=02:00:00
Holding, Releasing, and Requeuing Jobs¶
Place a job on hold (prevent scheduling):
scontrol hold 123456
Release a held job:
scontrol release 123456
Requeue a job (e.g., after failure or timeout):
scontrol requeue 123456