Troubleshooting Guide¶
Warning
This page is still under development. The content below may be incomplete or subject to revision.
Use this page to identify and resolve common issues encountered on REPACSS.
Job Not Starting¶
- Check if your job is pending due to resource availability:
squeue --me
- Look for the NODELIST(REASON) column for hints like:
Resources
: Not enough nodes available.Priority
: Waiting for higher-priority jobs to clear.Dependency
: Job depends on another that hasn’t finished.
Invalid Account or Partition¶
If you see:
sbatch: error: Job request does not match any supported policy.
- Ensure your account/project is valid and available for the selected partition.
- Set your account using:
#SBATCH --account=m1234
Python Module Not Found¶
If you get:
ModuleNotFoundError: No module named 'numpy'
-
Check that you’ve activated the correct conda environment:
conda activate myenv
-
If it’s missing, reinstall it:
pip install numpy
SSH Connection Issues¶
- If login is slow or stuck:
-
Try verbose output to debug:
ssh -vvv your_username@repacss.ttu.edu
-
Make sure VPN is connected properly.
"command not found"¶
If a common command like gcc
, python
, or nvcc
fails:
- Check if the required module is loaded:
module list
module load gcc/12.2.0
Poor Performance¶
- Check if you’re using correct compiler flags and parallel settings.
-
Use profiling tools like:
time ./a.out perf stat ./a.out
-
Review CPU or GPU utilization via
sstat
orjobstats
.
Disk Quota or Storage Issues¶
If you see:
Disk quota exceeded
- Use
du -sh *
to identify large files. - Remove or move unnecessary data.
Data Transfer Fails¶
- If
scp
orrsync
fails: - Double-check path and permissions.
- For large transfers, consider:
rsync -avzP source/ user@repacss.ttu.edu:/destination/
Infinite Loop or Hanging Job¶
- Make sure loops have valid exit conditions.
- Add logging or print statements to verify execution flow.
- You can cancel the job with:
scancel <jobid>
Module Conflicts¶
- If loading a module breaks something:
module purge module load only_what_you_need
Need Help?¶
If all else fails:
- Reach out to your advisor or system admin.
- Provide job ID, error message, and the job script when asking for help.