Determining Resource Requirements¶

Before submitting production workloads to REPACSS, it is essential to accurately estimate the resources your jobs will require. Overestimating or underestimating resources can lead to inefficient scheduling, failed jobs, or unnecessary queuing delays. This exercise guides you through strategies to test and fine-tune resource requests in your SLURM job scripts.

Overview of SLURM Resource Requests¶

SLURM provides several directives to declare resource needs in your job script:

--ntasks: Number of tasks to launch (often 1 for serial jobs).
--cpus-per-task: Number of CPU cores required per task.
--mem: Amount of memory needed per node.
--time: Estimated maximum runtime for the job.
--gres: Specification for special resources, such as GPUs.

Explicitly specifying these parameters is recommended, especially when submitting multiple or resource-intensive jobs.

Consequences of Incorrect Resource Requests¶

Underestimating Resources¶

If your job exceeds the allocated resources (for example, memory), SLURM will terminate the job. Partial results may be lost. Therefore, it is critical to request sufficient resources to avoid interruptions.

Overestimating Resources¶

Excessive resource requests can increase wait times, as fewer nodes have sufficient free resources to start your job. Over-requesting also limits availability for other users. For fairness and efficiency, request only what you expect to use.

Estimating Resource Requirements¶

If you are unsure of your job’s memory or disk usage, you have two main options:

Estimate requirements in advance using local tests and monitoring.
Run a test job with conservative resource allocations and measure actual usage.

Observing Memory Usage Locally¶

Warning

Do not execute computationally intensive jobs on shared login nodes. Use your personal workstation or a designated test environment to avoid interfering with other users.

On macOS or Windows, you can monitor processes using Activity Monitor or Task Manager. On Linux, the ps and top utilities can help track memory usage.

Example: Using ps

ps ux

The RSS column shows approximate memory usage in kilobytes.

Example: Using top

top -u <username>

The RES column shows memory usage in real time. Press q to exit.

Estimating Disk Usage¶

Disk usage includes:

Executable binaries
Input files transferred to the job
Output files generated during execution
Temporary files created by your application

You can check file sizes on a local system with:

ls -lh
du -sh

Determining Resource Requirements by Running Test Jobs (Recommended)¶

The most reliable method to measure your job’s resource needs is to submit a small-scale test job and examine the SLURM job statistics after completion.

Example Test Script

#!/usr/bin/env python3
import time
size = 1000000
numbers = [str(i) for i in range(size)]
with open('numbers.txt', 'w') as f:
    f.write(' '.join(numbers))
time.sleep(60)

Example SLURM Job Script (test_job.slurm)

#!/bin/bash
#SBATCH --job-name=resource_test
#SBATCH --output=resource_test.out
#SBATCH --error=resource_test.err
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G

module load python

srun python test_script.py

Submit the job:

sbatch test_job.slurm

When the job completes, check memory and CPU usage:

seff <jobid>

Example seff Output:

Job ID: 12345
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:01:00
Memory Utilized: 98.50 MB

Specifying Resource Requests in Job Scripts¶

Based on test job results, update your SLURM script with appropriate resource estimates, rounding up modestly to allow for variability:

#SBATCH --mem=120M     # Rounded up from ~99 MB

Important Notes:

--mem units default to megabytes unless you specify G for gigabytes.
--time should reflect the maximum expected runtime.
--cpus-per-task should match the number of threads used by your program.

Verification¶

After updating your job script:

Re-submit the job.
Confirm successful completion.
Review seff or sacct output to ensure your estimates were adequate.
Adjust further if needed.

If the job failed due to memory or runtime limits, increment your requests incrementally and re-test.

Example of Updated Job Script¶

#!/bin/bash
#SBATCH --job-name=final_run
#SBATCH --output=final_run.out
#SBATCH --error=final_run.err
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=200M

module load python

srun python production_script.py

If you have any questions about estimating resource requirements or interpreting SLURM job reports, please contact REPACSS support.