Skip to content

Monitoring Jobs

Note

Avoid running multiple instances of watch squeue or watch sqs. This can overload the scheduler, which is a shared system resource. If you must use watch, use watch -n 60 and stop the process when finished.


Using squeue

The squeue command provides real-time job queue information directly from the Slurm scheduler. It is helpful for checking the current state of jobs, such as PENDING, RUNNING, or COMPLETED.

squeue --me          # Shows your jobs
squeue -u $USER      # Equivalent to --me
squeue --me -t R     # Only running jobs
squeue --me -t PD    # Only pending jobs
squeue -j 1234,1235  # Filter by job IDs

To show jobs by account:

squeue -A your_project_name

To view job steps:

squeue --steps 1001.0

Using sacct

The sacct command retrieves accounting information about active and completed jobs.

Basic usage:

sacct

Customize output by specifying fields:

sacct --format=JobID,JobName,State,Start,Elapsed

Filter jobs by date:

sacct -S 2024-06-01 -E 2024-06-13

Display only failed jobs:

sacct -X --format=User,JobName,State -s F --start=2024-06-01 --end=now

Filter by specific job IDs:

sacct -j 123456,123457

Using sstat

Use sstat to report resource usage for jobs that are currently running:

sstat -j 123456 -o JobID,MaxRSS

Using jobstats

jobstats is a Python-based reporting tool that summarizes job activity using data from sacct, squeue, and sreport.

module load python
jobstats

Example usage:

jobstats --user bsencer --start 2025-06-01 --end 2025-06-13

To display all options:

jobstats --help

Email Notifications

To receive notifications when your job begins, ends, or fails, add the following directives to your Slurm job script:

#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=your@email.com

Modifying or Canceling Jobs

To cancel a job:

scancel 123456

To cancel multiple jobs:

scancel 123456 123457

To cancel all jobs submitted by your user:

scancel -u $USER

To update a job’s time limit:

scontrol update jobid=123456 timelimit=02:00:00

Holding, Releasing, and Requeuing Jobs

Place a job on hold (prevent scheduling):

scontrol hold 123456

Release a held job:

scontrol release 123456

Requeue a job (e.g., after failure or timeout):

scontrol requeue 123456