Running Ollama on REPACSS¶
A comprehensive guide for running Ollama (local LLM inference) on REPACSS GPU clusters using containers.
๐ฏ Overview¶
Ollama is a popular AI application for running large language models locally. This guide shows you how to deploy Ollama using containers on REPACSS GPU nodes.
This is an example of using containers (user-installed) for AI applications. For general container usage, see Using Containers.
The REPACSS team also provides a repository with pre-configured scripts (https://github.com/nsfcac/ollama_repacss.git
) that includes:
- Helper scripts for easy setup
- Example Python scripts for testing
- Jupyter notebook tutorials
- Pre-configured environment settings
๐ One-Time Setup¶
Clone the Repository (Only Once)¶
Choose a directory and clone the Ollama configuration repository. This only needs to be done once:
cd $HOME
mkdir <your_ollama_working_dir>
cd <your_ollama_working_dir>
git clone https://github.com/nsfcac/repacss_ollama_configuration.git
This repository contains:
setup_ollama.sh
: A setup script to install and launch the Ollama serverollama.sh
: A script to set the configurations of the Ollama servertest.py
: An example Python script to verify your Ollama server is workingtutorial.ipynb
: A Jupyter notebook that connects to your Ollama serverrequirements.txt
: Python libraries needed to run the notebook
๐งช Running Ollama (Every Time)¶
Step 1: Request GPU Resources¶
Use the following command to start an interactive job with 1 GPU for 2 hours:
interactive -p h100 -t 02:00:00 -g 1
Step 2: Set Up Environment and Start Ollama¶
We are using ollama 0.6.8 in this version. The first time will take several minutes to download the .sif engine file:
# Navigate to your repository
cd $HOME/<your_ollama_working_dir>/repacss_ollama_configuration
# Set up and launch the Ollama server
source setup_ollama.sh
๐ฏ Using Ollama¶
Step 1: Check Available Models¶
We are using a shared LLM directory on REPACSS. Check the model list first:
ollama list
Step 2: Run Inference with Existing Models¶
Use existing models (example with falcon3:1b):
ollama run falcon3:1b
Step 3: Pull and Use New Models (Optional)¶
If you want to use a new model, choose a model supported by Ollama and pull it:
# Pull a new model
ollama pull llama3.1:8b
# Test the new model
ollama run llama3.1:8b
>>> what is the capital of Texas?
Step 4: Verify Server Status from Login Node¶
From a login node (not the GPU node), check if the Ollama server is running:
curl http://<hostname>:<port>
You can find a host.txt
and port.txt
in your <your_ollama_working_dir>/ollama
folder. Replace <hostname>
with your GPU node's hostname and <port>
with your Ollama server's port number.
You should see:
Ollama is running
๐ Running Ollama in Batch Mode¶
For production workloads or long-running inference, use batch jobs:
#!/bin/bash
#SBATCH --job-name=ollama-inference
#SBATCH --partition=h100
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=20
#SBATCH --mem=64G
#SBATCH --time=02:00:00
#SBATCH --output=ollama_%j.out
#SBATCH --error=ollama_%j.err
# Load required modules
module purge
# Start Ollama server
ollama serve >ollama.log 2>ollama.err &
# Wait for server to start
sleep 10
# Run inference
ollama run llama3.1:8b << EOF
What is the capital of France?
EOF
# Clean up
pkill ollama
๐ Remote Access¶
Using Jupyter Notebook¶
You can also access Ollama through Jupyter Notebook for interactive development:
# From login node, start Jupyter
jupyter notebook --no-browser --ip=127.0.0.1 --port=8081
# From your local machine, create SSH tunnel
ssh -L 8081:127.0.0.1:8081 -l <your_account> -fN repacss.ttu.edu
Then open http://127.0.0.1:8081
in your browser and explore the tutorial.ipynb
notebook.
๐ฏ Best Practices¶
Resource Management¶
- Use appropriate GPU resources - Start with 1 GPU, scale as needed
- Monitor memory usage - Large models require significant RAM
- Exit sessions promptly - Don't leave GPU sessions idle
- Use batch jobs for long-running inference
Model Selection¶
- Start with smaller models (1B-3B parameters) for testing
- Use shared models when available to save download time
- Consider model requirements - Some models need specific CUDA versions
Performance Optimization¶
- Use appropriate batch sizes for your GPU memory
- Monitor GPU utilization with
nvidia-smi
- Consider model quantization for faster inference
๐ ๏ธ Troubleshooting¶
Common Issues¶
-
Port conflicts
# Use dynamic port allocation export OLPORT=$(python -c "import socket; s = socket.socket(); s.bind(('', 0));print(s.getsockname()[1]);s.close()")
-
Memory issues
# Reduce CPU count or use smaller models # Check available memory free -h
-
Model download problems
# Check network connectivity # Use shared models when available ollama list
Debug Mode¶
Enable debug output for troubleshooting:
export OLLAMA_DEBUG=1
ollama serve
Log Files¶
Check log files for detailed information:
# Check server logs
tail -f ollama.log
tail -f ollama.err
๐ Related Documentation¶
- Software Management Overview - General software management
- Using Containers - General container usage
- Module System - System software via modules
- Running Jupyter Notebook - Another trending application
- Running Jobs - Job submission and management
๐ Support¶
For Ollama-specific issues or assistance, contact:
repacss.support@ttu.edu