Skip to content

LLaMA.cpp

LLaMA.cpp is an open-source LLaMA model inference engine that supports efficient running of large language models on various hardware.

Supported Models

LLaMA Series

  • LLaMA 2 (7B, 13B, 70B)
  • LLaMA 3 (8B, 70B)

Other Models

  • Mistral (7B, Mixtral 8x7B)
  • Qwen (7B, 14B, 72B)
  • Code Llama
  • Other GGUF format models

Configuration

Basic Configuration

Configure in config.yaml or ~/.bytebuddy/config.yaml:

yaml
models:
  - name: "llamacpp-local"
    provider: "llamacpp"
    model: "llama-2-7b-chat"
    apiBase: "http://localhost:8080"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 2048

High-Performance Configuration

yaml
models:
  - name: "llamacpp-performance"
    provider: "llamacpp"
    model: "mixtral-8x7b-instruct"
    apiBase: "http://localhost:8080"
    roles: ["chat", "edit"]
    defaultCompletionOptions:
      temperature: 0.5
      maxTokens: 4096
      topP: 0.9

Multi-Model Configuration

yaml
models:
  - name: "llamacpp-chat"
    provider: "llamacpp"
    model: "llama-2-7b-chat"
    apiBase: "http://localhost:8080"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 2048

  - name: "llamacpp-code"
    provider: "llamacpp"
    model: "codellama-7b-instruct"
    apiBase: "http://localhost:8080"
    roles: ["edit", "apply"]
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 1024

Configuration Fields

Required Fields

  • name: Unique identifier for the model configuration
  • provider: Set to "llamacpp"
  • apiBase: LLaMA.cpp server address

Optional Fields

  • model: Model name
  • roles: Model roles [chat, edit, apply, autocomplete]
  • defaultCompletionOptions:
    • temperature: Control randomness (0-2)
    • maxTokens: Maximum tokens
    • topP: Nucleus sampling parameter
    • topK: Sampling candidates count
    • repeatPenalty: Repeat penalty (1.0-2.0)
    • seed: Random seed

Installation and Setup

1. Download and Compile

bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

2. Download Models

Download GGUF format models from HuggingFace:

bash
# Example: Download Llama 2 7B Chat
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

3. Start Server

bash
./server -m llama-2-7b-chat.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 2048

4. Verify Connection

bash
curl http://localhost:8080/health

Use Case Configurations

Local Development

yaml
models:
  - name: "local-dev"
    provider: "llamacpp"
    model: "llama-2-7b-chat"
    apiBase: "http://localhost:8080"
    roles: ["chat", "edit"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 2048

Code Generation

yaml
models:
  - name: "code-gen"
    provider: "llamacpp"
    model: "codellama-7b-instruct"
    apiBase: "http://localhost:8080"
    roles: ["edit", "apply"]
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 1024
      topK: 40

Remote Server

yaml
models:
  - name: "remote-server"
    provider: "llamacpp"
    model: "mixtral-8x7b-instruct"
    apiBase: "http://192.168.1.100:8080"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 4096

Server Parameters

Common Startup Parameters

bash
./server \
  -m model.gguf \           # Model file
  --host 0.0.0.0 \         # Listen address
  --port 8080 \            # Listen port
  -c 2048 \                # Context size
  --n-gpu-layers 99 \      # GPU layers (GPU acceleration)
  --batch-size 512 \       # Batch size
  --ctx-size 2048          # Context window size

Performance Optimization Parameters

bash
./server \
  -m model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 4096 \
  --n-gpu-layers 99 \
  --threads 8 \            # CPU thread count
  --batch-size 1024 \
  --temperature 0.7 \
  --repeat-penalty 1.1

Hardware Requirements

Minimum Requirements

  • RAM: 8GB
  • Storage: 10GB
  • CPU: 4 cores
  • RAM: 16GB+
  • GPU: VRAM 8GB+ (for GPU acceleration)
  • CPU: 8+ cores
  • Storage: SSD

High-Performance Configuration

  • RAM: 32GB+
  • GPU: VRAM 16GB+
  • CPU: 16+ cores
  • High-speed SSD

Model Quantization

GGUF format provides multiple quantization levels:

  • Q2_K: Smallest file, lower quality
  • Q4_K_M: Recommended balanced choice
  • Q5_K_M: Better quality, slightly larger file
  • Q8_0: Best quality, largest file

Troubleshooting

Common Issues

  1. Connection Refused

    • Ensure server is running
    • Check port and address
    • Verify firewall settings
  2. Out of Memory

    • Use smaller models
    • Enable GPU acceleration
    • Increase virtual memory
  3. Slow Response

    • Enable GPU layers
    • Increase thread count
    • Optimize batch size

Debugging Steps

  1. Check server logs
  2. Monitor system resources
  3. Test different models
  4. Adjust server parameters

Best Practices

1. Model Selection

  • Choose appropriate size based on hardware
  • Use appropriate quantization level
  • Consider purpose when selecting model type

2. Performance Optimization

  • Enable GPU acceleration
  • Adjust batch size
  • Optimize thread settings

3. Resource Management

  • Monitor memory usage
  • Periodically clear cache
  • Set reasonable context size

4. Security

  • Local deployment ensures data privacy
  • Limit network access (localhost only)
  • Regularly update LLaMA.cpp