Ollama
Ollama is a local model runner that allows you to run various open-source large language models locally, ensuring data privacy and offline usage.
Supported Models
Meta LLaMA Series
- llama3 - LLaMA 3 series models
- llama3:8b - LLaMA 3 8B parameter version
- llama3:70b - LLaMA 3 70B parameter version
- llama2 - LLaMA 2 series models
Other Popular Models
- codellama - Code-specialized model
- mistral - Mistral AI's open-source model
- qwen - Alibaba Qwen (Tongyi Qianwen)
- gemma - Google's open-source model
- phi3 - Microsoft's Phi-3 model
Configuration
Basic Configuration
Configure in config.yaml or ~/.bytebuddy/config.yaml:
yaml
models:
- name: "local-llama3"
provider: "ollama"
model: "llama3"
apiBase: "http://localhost:11434"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 2000Multi-Model Configuration
yaml
models:
- name: "llama3-chat"
provider: "ollama"
model: "llama3"
apiBase: "http://localhost:11434"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 2000
- name: "codellama-edit"
provider: "ollama"
model: "codellama"
apiBase: "http://localhost:11434"
roles: ["edit"]
defaultCompletionOptions:
temperature: 0.3
maxTokens: 1500
- name: "qwen-autocomplete"
provider: "ollama"
model: "qwen:7b"
apiBase: "http://localhost:11434"
roles: ["autocomplete"]
defaultCompletionOptions:
temperature: 0.1
maxTokens: 500Custom Endpoint Configuration
yaml
models:
- name: "remote-ollama"
provider: "ollama"
model: "llama3"
apiBase: "http://192.168.1.100:11434"
roles: ["chat"]
requestOptions:
timeout: 120000
verifySsl: falseAdvanced Configuration
With Authentication
yaml
models:
- name: "secure-ollama"
provider: "ollama"
model: "llama3"
apiBase: "http://localhost:11434"
roles: ["chat"]
requestOptions:
timeout: 60000
headers:
"Authorization": "Bearer ${OLLAMA_TOKEN}"Complete Configuration Example
yaml
models:
- name: "ollama-complete"
provider: "ollama"
model: "llama3:8b"
apiBase: "http://localhost:11434"
roles: ["chat", "edit", "apply"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 4000
topP: 0.9
stream: true
requestOptions:
timeout: 120000
verifySsl: trueUse Case Configurations
Local Development
yaml
models:
- name: "local-dev"
provider: "ollama"
model: "codellama:7b"
apiBase: "http://localhost:11434"
roles: ["chat", "edit"]
defaultCompletionOptions:
temperature: 0.3
maxTokens: 2000Privacy Protection
yaml
models:
- name: "private-chat"
provider: "ollama"
model: "llama3"
apiBase: "http://localhost:11434"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 1000
requestOptions:
timeout: 60000Fast Response
yaml
models:
- name: "fast-local"
provider: "ollama"
model: "phi3:mini"
apiBase: "http://localhost:11434"
roles: ["autocomplete"]
defaultCompletionOptions:
temperature: 0.1
maxTokens: 200
requestOptions:
timeout: 10000Installation and Setup
1. Install Ollama
bash
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download2. Start Ollama Service
bash
# Start service
ollama serve
# Or run in background
nohup ollama serve > ollama.log 2>&1 &3. Download Models
bash
# Download LLaMA 3
ollama pull llama3
# Download code model
ollama pull codellama
# Download lightweight model
ollama pull phi3:mini
# List downloaded models
ollama list4. Test Models
bash
# Test conversation
ollama run llama3 "Hello, how are you?"
# Interactive chat
ollama run llama3Advanced Usage
Custom Models
bash
# Create custom model
ollama create mymodel -f Modelfile
# Modelfile example
FROM llama3
SYSTEM """You are a helpful AI assistant."""
PARAMETER temperature 0.7
PARAMETER top_p 0.9GPU Acceleration
Ensure system has proper GPU drivers:
bash
# NVIDIA GPU
# Install CUDA drivers
# Ollama will automatically detect and use GPU
# Apple Silicon (M1/M2/M3)
# Ollama will automatically use Metal Performance ShadersRemote Access
bash
# Allow remote access
OLLAMA_HOST=0.0.0.0 ollama serve
# Configure firewall rules
sudo ufw allow 11434Troubleshooting
Q: Ollama connection failed?
A: Check the following:
- Ollama service is running:
ps aux | grep ollama - Port 11434 is available:
netstat -an | grep 11434 - Firewall settings are correct
Q: Model download is slow?
A:
- Use mirror sources or proxy
- Choose smaller model versions
- Consider using pre-downloaded model files
Q: Out of memory?
A:
- Choose smaller models (7B instead of 70B)
- Close other memory-intensive programs
- Consider increasing system memory
Q: How to update models?
A:
bash
# Update model to latest version
ollama pull llama3:latest
# View available versions
ollama show llama3Q: Response time too long?
A:
- Use smaller models
- Enable GPU acceleration
- Reduce maxTokens setting
- Use quantized versions (e.g., q4 version)
Best Practices
1. Model Selection
- Development/Testing: Use smaller 7B models
- Production: Choose size based on hardware
- Privacy Sensitive: Always use local models
- Speed Priority: Use quantized versions
2. Hardware Optimization
- GPU: Use CUDA or Metal acceleration
- Memory: At least 16GB RAM for 7B models
- Storage: Use SSD for faster loading
3. Security Considerations
- Local deployment ensures data privacy
- Regularly update Ollama and models
- Limit network access (localhost only if needed)
4. Performance Tuning
- Preload commonly used models
- Use appropriate temperature parameters
- Enable streaming responses
Model Recommendations
| Use Case | Recommended Model | VRAM Required | Features |
|---|---|---|---|
| General Chat | llama3:8b | ~8GB | Balanced performance |
| Code Generation | codellama:7b | ~8GB | Code-specialized |
| Lightweight Tasks | phi3:mini | ~4GB | Fast response |
| High-Quality Chat | llama3:70b | ~40GB | High quality |
| Chinese Optimized | qwen:7b | ~8GB | Good Chinese support |