Skip to content

Job Queues on REPACSS

This page provides a comprehensive explanation of job queuing mechanisms on REPACSS and guidance for users on how to effectively manage their job submissions.


Queue Overview

When a user submits a job to the REPACSS system, it is placed into a queue controlled by the Slurm workload manager. This queue is governed by a priority-based scheduling system, where jobs are executed based on criteria such as resource availability, job size, requested walltime, and policy-defined priorities.

Although all jobs are executed on the same underlying compute nodes, the order in which they begin is determined by these scheduling policies. Some jobs, especially those with smaller resource demands or shorter walltimes, may experience shorter wait times.


Queue Wait Times

It is typical for submitted jobs to remain in the queue for a duration longer than their execution time. Factors influencing wait times include:

  • Current system workload
  • Availability of requested resources (e.g., nodes, CPUs, GPUs)
  • Size and duration of the job
  • Assigned job priority

Smaller or flexible jobs may benefit from earlier scheduling through the backfill mechanism.


Viewing the Queue

Users may inspect the status of their queued jobs with the following command:

squeue -u $USER

To estimate when a job is expected to start, use:

squeue --start -j <job_id>

Pending jobs will include a status reason under the NODELIST(REASON) column, indicating why execution has not yet begun (e.g., "Resources", "Priority", or "Dependency").


Scheduling Mechanics

REPACSS employs Slurm to manage job execution and optimize resource utilization.

Priority and Job Aging

Each job is assigned a priority value that increases over time through a mechanism known as aging. This process ensures that jobs do not remain indefinitely in the queue. However, to maintain fairness, only a limited number of jobs per user can accumulate priority concurrently.

This approach prevents individual users from monopolizing the scheduling queue and promotes equitable access for all users.

Scheduling Algorithms

Slurm uses two complementary scheduling strategies:

  • Immediate Scheduling: Rapidly assembles a tentative schedule using the highest-priority jobs.
  • Backfill Scheduling: Identifies and executes smaller jobs that can be run in time gaps without delaying higher-priority jobs.

This hybrid model enables efficient system utilization while allowing for the timely execution of short-duration tasks. Because the schedule is recalculated frequently, job start times shown by the scheduler may fluctuate.


Recommendations for Efficient Scheduling

To minimize queue wait times and maximize job throughput, users are encouraged to:

  • Provide accurate resource and time estimates
  • Avoid requesting excessive compute resources
  • Utilize interactive or shared job modes for testing
  • Decompose large workflows into smaller, manageable tasks
  • Employ job arrays for repetitive or parameterized workloads

Interpreting Job States

Below is a summary of common Slurm job states:

State Description
PENDING (PD) Waiting for resources to become available
RUNNING (R) Currently executing on allocated resources
COMPLETED (CD) Successfully finished execution
FAILED (F) Terminated due to an error
CANCELLED (CA) Cancelled by the user or system administrator
TIMEOUT (TO) Exceeded requested walltime

To view historical job information:

sacct -u $USER --starttime today

Additional Support

If a job remains in the queue unexpectedly or assistance is needed for job optimization, users may contact the system administrators or file a support request.

Help documentation for Slurm commands is also available:

squeue --help
sacct --help

Related documentation includes: