Skip to content

How to Self-Host a Model

Self-hosting AI models gives you complete control over your data and enables offline AI assistance. This guide covers various approaches to self-host models for use with ByteBuddy.

Why Self-Host Models?

Privacy and Security

  • Data Control: Keep sensitive code and data on-premises
  • No External API Calls: Eliminate external data transmission
  • Compliance: Meet regulatory requirements for data handling

Cost Management

  • Eliminate API Costs: No per-request charges
  • Predictable Expenses: Fixed infrastructure costs
  • Scalability: Scale according to your needs

Performance Benefits

  • Low Latency: Direct access to models
  • Custom Hardware: Optimize for your specific hardware
  • Priority Access: No queueing for shared resources

Self-Hosting Options

The easiest way to get started with local models:

bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model
ollama pull llama3:8b
ollama run llama3:8b

Configure in ByteBuddy:

yaml
models:
  - name: "local-ollama"
    provider: "ollama"
    model: "llama3:8b"
    baseURL: "http://localhost:11434"

2. Hugging Face Transformers

Run models using the transformers library:

bash
# Install dependencies
pip install transformers torch accelerate

# Run a model server
python -m transformers-cli serve --model-id meta-llama/Llama-3-8b

3. Text Generation WebUI

A web-based interface for running models:

bash
# Clone the repository
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui

# Install dependencies
pip install -r requirements.txt

# Run the web UI
python server.py --model llama-3-8b

Hardware Requirements

CPU-Only Setup

Minimum requirements:

  • RAM: 16GB (32GB recommended)
  • Storage: 50GB free space
  • CPU: Modern multi-core processor

Example configuration:

yaml
models:
  - name: "cpu-model"
    provider: "ollama"
    model: "llama3:8b"
    baseURL: "http://localhost:11434"
    options:
      num_thread: 8 # Limit CPU threads

GPU-Accelerated Setup

Recommended for better performance:

  • NVIDIA GPU: 12GB+ VRAM (RTX 3080 or better)
  • AMD GPU: 12GB+ VRAM (RX 6800 or better)
  • Apple Silicon: M1/M2 with 16GB+ unified memory

NVIDIA setup:

bash
# Install NVIDIA drivers and CUDA toolkit
# Then install Ollama (automatically uses CUDA)

# Or use text-generation-webui with CUDA
python server.py --model llama-3-8b --cuda

Multi-GPU Setup

For enterprise deployments:

yaml
models:
  - name: "multi-gpu-model"
    provider: "textgen"
    model: "llama-3-70b"
    baseURL: "http://localhost:5000"
    options:
      gpu_split: "20,20" # Split across 2 GPUs

Model Selection

Small Models (5-10GB)

Good for basic tasks:

  • Mistral 7B: 4.1GB, good balance of size and capability
  • Phi-3 3.8B: 2.4GB, Microsoft's compact model
  • Gemma 2B: 1.6GB, Google's lightweight model
bash
# Download small models
ollama pull mistral:7b
ollama pull phi3:3.8b
ollama pull gemma:2b

Medium Models (10-30GB)

Good for most development tasks:

  • Llama 3 8B: 4.7GB, versatile and capable
  • CodeLlama 7B: 4.1GB, coding-optimized
  • Mixtral 8x7B: 45GB, Mixture-of-Experts model
bash
# Download medium models
ollama pull llama3:8b
ollama pull codellama:7b
ollama pull mixtral:8x7b

Large Models (30GB+)

For complex tasks requiring maximum capability:

  • Llama 3 70B: 40GB, state-of-the-art performance
  • Mixtral 8x22B: 140GB, powerful MoE model
bash
# Download large models (requires significant resources)
ollama pull llama3:70b

Deployment Strategies

Single Machine Deployment

Simple setup for individual developers:

bash
# Start Ollama service
ollama serve

# Pull required models
ollama pull llama3:8b
ollama pull codellama:7b

# Configure ByteBuddy
yaml
models:
  - name: "primary-model"
    provider: "ollama"
    model: "llama3:8b"
    baseURL: "http://localhost:11434"

  - name: "coding-model"
    provider: "ollama"
    model: "codellama:7b"
    baseURL: "http://localhost:11434"

Docker Deployment

Containerized deployment for consistency:

dockerfile
# Dockerfile
FROM ollama/ollama:latest

COPY models/ /root/.ollama/models/

EXPOSE 11434

CMD ["ollama", "serve"]
bash
# Build and run
docker build -t my-ollama .
docker run -d -p 11434:11434 my-ollama

Kubernetes Deployment

For enterprise-scale deployments:

yaml
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          volumeMounts:
            - name: models
              mountPath: /root/.ollama/models
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models-pvc

Model Optimization

Quantization

Reduce model size and improve inference speed:

bash
# Use quantized models (automatically handled by Ollama)
ollama pull llama3:8b-q4_0  # 4-bit quantized
ollama pull llama3:8b-q8_0  # 8-bit quantized

Model Pruning

Remove unnecessary parameters:

python
# Example using Hugging Face
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
# Apply pruning techniques

Knowledge Distillation

Create smaller, faster student models:

python
# Train a smaller model to mimic a larger one
# This requires significant ML expertise

Security Considerations

Network Security

Secure your model servers:

yaml
# Use HTTPS for model endpoints
models:
  - name: "secure-model"
    provider: "ollama"
    model: "llama3:8b"
    baseURL: "https://models.company.internal:11434"
    options:
      ssl_verify: true

Authentication

Add authentication to model servers:

bash
# For text-generation-webui
python server.py --model llama-3-8b --api-auth username:password

Access Control

Restrict model access:

yaml
# Configure firewall rules
# Only allow connections from trusted IPs
# Use VPN for remote access

Monitoring and Maintenance

Health Monitoring

Monitor model server health:

bash
# Check Ollama status
ollama list
curl http://localhost:11434/api/tags

# Monitor system resources
htop
nvidia-smi  # for GPU monitoring

Performance Metrics

Track performance metrics:

yaml
# Enable logging and metrics
models:
  - name: "monitored-model"
    provider: "ollama"
    model: "llama3:8b"
    baseURL: "http://localhost:11434"
    options:
      log_level: "info"

Model Updates

Keep models up to date:

bash
# Update Ollama
ollama pull llama3:8b  # Pulls latest version

# Or create a maintenance schedule
# Weekly: ollama pull llama3:8b

Backup and Recovery

Model Backups

Backup important models:

bash
# Export models
ollama cp llama3:8b backup-llama3:8b

# Save to external storage
# Copy ~/.ollama/models to backup location

Configuration Backups

Backup configurations:

bash
# Backup ByteBuddy config
cp .bytebuddy/config.yaml ~/backups/bytebuddy-config-$(date +%Y%m%d).yaml

# Backup model configurations
cp -r ~/.ollama ~/backups/ollama-backup-$(date +%Y%m%d)

Troubleshooting

Common Issues

Model Loading Failures

bash
# Check available memory
free -h  # Linux
vm_stat  # macOS

# Check disk space
df -h

# Re-pull model
ollama rm llama3:8b
ollama pull llama3:8b

Performance Problems

bash
# Monitor resource usage
htop
iotop  # Disk I/O monitoring

# Adjust model parameters
yaml
models:
  - name: "optimized-model"
    provider: "ollama"
    model: "llama3:8b"
    options:
      num_thread: 6
      num_gpu: 1

Connection Issues

bash
# Check if service is running
ps aux | grep ollama

# Test connection
curl http://localhost:11434/api/tags

# Check firewall settings

Debugging Commands

bash
# Enable debug logging
OLLAMA_DEBUG=1 ollama serve

# Check logs
journalctl -u ollama -f  # Linux
tail -f /usr/local/var/log/ollama.log  # macOS

# Test model directly
echo '{"model":"llama3:8b","prompt":"Hello"}' | curl -X POST -H "Content-Type: application/json" -d @- http://localhost:11434/api/generate

Best Practices

Model Management

  1. Version Control: Keep track of model versions
  2. Regular Updates: Update models periodically
  3. Performance Testing: Test models before deployment
  4. Resource Planning: Plan for adequate hardware resources

Security

  1. Network Isolation: Keep model servers isolated
  2. Access Logging: Log all model access
  3. Regular Audits: Audit model usage regularly
  4. Data Encryption: Encrypt data in transit and at rest

Cost Optimization

  1. Right-Sizing: Choose appropriate model sizes
  2. Usage Monitoring: Monitor model usage
  3. Scheduled Scaling: Scale resources based on demand
  4. Model Sharing: Share models across teams

Enterprise Deployment

High Availability

Deploy redundant model servers:

yaml
# Load balancer configuration
models:
  - name: "ha-model-primary"
    provider: "ollama"
    model: "llama3:8b"
    baseURL: "http://model-server-1:11434"

  - name: "ha-model-secondary"
    provider: "ollama"
    model: "llama3:8b"
    baseURL: "http://model-server-2:11434"

Disaster Recovery

Plan for disaster recovery:

bash
# Regular backups
# Automated failover
# Cross-region replication

Next Steps

After setting up self-hosted models, explore these related guides: