LLaMA.cpp
LLaMA.cpp is an open-source LLaMA model inference engine that supports efficient running of large language models on various hardware.
Supported Models
LLaMA Series
- LLaMA 2 (7B, 13B, 70B)
- LLaMA 3 (8B, 70B)
Other Models
- Mistral (7B, Mixtral 8x7B)
- Qwen (7B, 14B, 72B)
- Code Llama
- Other GGUF format models
Configuration
Basic Configuration
Configure in config.yaml or ~/.bytebuddy/config.yaml:
yaml
models:
- name: "llamacpp-local"
provider: "llamacpp"
model: "llama-2-7b-chat"
apiBase: "http://localhost:8080"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 2048High-Performance Configuration
yaml
models:
- name: "llamacpp-performance"
provider: "llamacpp"
model: "mixtral-8x7b-instruct"
apiBase: "http://localhost:8080"
roles: ["chat", "edit"]
defaultCompletionOptions:
temperature: 0.5
maxTokens: 4096
topP: 0.9Multi-Model Configuration
yaml
models:
- name: "llamacpp-chat"
provider: "llamacpp"
model: "llama-2-7b-chat"
apiBase: "http://localhost:8080"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 2048
- name: "llamacpp-code"
provider: "llamacpp"
model: "codellama-7b-instruct"
apiBase: "http://localhost:8080"
roles: ["edit", "apply"]
defaultCompletionOptions:
temperature: 0.2
maxTokens: 1024Configuration Fields
Required Fields
- name: Unique identifier for the model configuration
- provider: Set to
"llamacpp" - apiBase: LLaMA.cpp server address
Optional Fields
- model: Model name
- roles: Model roles [
chat,edit,apply,autocomplete] - defaultCompletionOptions:
temperature: Control randomness (0-2)maxTokens: Maximum tokenstopP: Nucleus sampling parametertopK: Sampling candidates countrepeatPenalty: Repeat penalty (1.0-2.0)seed: Random seed
Installation and Setup
1. Download and Compile
bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make2. Download Models
Download GGUF format models from HuggingFace:
bash
# Example: Download Llama 2 7B Chat
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf3. Start Server
bash
./server -m llama-2-7b-chat.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 20484. Verify Connection
bash
curl http://localhost:8080/healthUse Case Configurations
Local Development
yaml
models:
- name: "local-dev"
provider: "llamacpp"
model: "llama-2-7b-chat"
apiBase: "http://localhost:8080"
roles: ["chat", "edit"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 2048Code Generation
yaml
models:
- name: "code-gen"
provider: "llamacpp"
model: "codellama-7b-instruct"
apiBase: "http://localhost:8080"
roles: ["edit", "apply"]
defaultCompletionOptions:
temperature: 0.2
maxTokens: 1024
topK: 40Remote Server
yaml
models:
- name: "remote-server"
provider: "llamacpp"
model: "mixtral-8x7b-instruct"
apiBase: "http://192.168.1.100:8080"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 4096Server Parameters
Common Startup Parameters
bash
./server \
-m model.gguf \ # Model file
--host 0.0.0.0 \ # Listen address
--port 8080 \ # Listen port
-c 2048 \ # Context size
--n-gpu-layers 99 \ # GPU layers (GPU acceleration)
--batch-size 512 \ # Batch size
--ctx-size 2048 # Context window sizePerformance Optimization Parameters
bash
./server \
-m model.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096 \
--n-gpu-layers 99 \
--threads 8 \ # CPU thread count
--batch-size 1024 \
--temperature 0.7 \
--repeat-penalty 1.1Hardware Requirements
Minimum Requirements
- RAM: 8GB
- Storage: 10GB
- CPU: 4 cores
Recommended Configuration
- RAM: 16GB+
- GPU: VRAM 8GB+ (for GPU acceleration)
- CPU: 8+ cores
- Storage: SSD
High-Performance Configuration
- RAM: 32GB+
- GPU: VRAM 16GB+
- CPU: 16+ cores
- High-speed SSD
Model Quantization
GGUF format provides multiple quantization levels:
- Q2_K: Smallest file, lower quality
- Q4_K_M: Recommended balanced choice
- Q5_K_M: Better quality, slightly larger file
- Q8_0: Best quality, largest file
Troubleshooting
Common Issues
Connection Refused
- Ensure server is running
- Check port and address
- Verify firewall settings
Out of Memory
- Use smaller models
- Enable GPU acceleration
- Increase virtual memory
Slow Response
- Enable GPU layers
- Increase thread count
- Optimize batch size
Debugging Steps
- Check server logs
- Monitor system resources
- Test different models
- Adjust server parameters
Best Practices
1. Model Selection
- Choose appropriate size based on hardware
- Use appropriate quantization level
- Consider purpose when selecting model type
2. Performance Optimization
- Enable GPU acceleration
- Adjust batch size
- Optimize thread settings
3. Resource Management
- Monitor memory usage
- Periodically clear cache
- Set reasonable context size
4. Security
- Local deployment ensures data privacy
- Limit network access (localhost only)
- Regularly update LLaMA.cpp