HuggingFace Inference API
HuggingFace Inference API provides instant access to thousands of open-source models, supporting various tasks and model types.
Supported Model Types
Chat Models
- meta-llama/Llama-2-70b-chat-hf - Llama 2 chat model
- mistralai/Mixtral-8x7B-Instruct-v0.1 - Mixtral instruction model
- microsoft/DialoGPT-large - Dialogue GPT
Code Models
- bigcode/starcoder - StarCoder code model
- Salesforce/codegen-16B-multi - CodeGen multilingual
- microsoft/CodeGPT-small-py - Python Code GPT
Text Generation Models
- bigscience/bloom - BLOOM large model
- EleutherAI/gpt-neox-20b - GPT-NeoX
- facebook/opt-30b - OPT model
Multilingual Models
- google/flan-t5-xxl - FLAN-T5
- bert-base-multilingual-cased - Multilingual BERT
Configuration
Basic Configuration
Configure in config.yaml or ~/.bytebuddy/config.yaml:
yaml
models:
- name: "hf-llama"
provider: "huggingfaceinferenceapi"
model: "meta-llama/Llama-2-70b-chat-hf"
apiKey: "${HF_API_KEY}"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 4096Code Generation Configuration
yaml
models:
- name: "hf-starcoder"
provider: "huggingfaceinferenceapi"
model: "bigcode/starcoder"
apiKey: "${HF_API_KEY}"
roles: ["edit", "apply"]
defaultCompletionOptions:
temperature: 0.2
maxTokens: 2048Multi-Model Configuration
yaml
models:
- name: "hf-chat"
provider: "huggingfaceinferenceapi"
model: "meta-llama/Llama-2-70b-chat-hf"
apiKey: "${HF_API_KEY}"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 4096
- name: "hf-code"
provider: "huggingfaceinferenceapi"
model: "bigcode/starcoder"
apiKey: "${HF_API_KEY}"
roles: ["edit"]
defaultCompletionOptions:
temperature: 0.2
maxTokens: 2048
- name: "hf-multilingual"
provider: "huggingfaceinferenceapi"
model: "google/flan-t5-xxl"
apiKey: "${HF_API_KEY}"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.5
maxTokens: 1024Configuration Fields
Required Fields
- name: Unique identifier for the model configuration
- provider: Set to
"huggingfaceinferenceapi" - model: Model identifier (HuggingFace model path)
- apiKey: HuggingFace API key
Optional Fields
- apiBase: API endpoint (default: https://api-inference.huggingface.co)
- roles: Model roles [
chat,edit,apply,autocomplete] - defaultCompletionOptions:
temperature: Control randomness (0-2)maxTokens: Maximum tokenstopP: Nucleus sampling parametertopK: Sampling candidates countrepetitionPenalty: Repetition penalty
Environment Variables
bash
# ~/.bashrc or ~/.zshrc
export HF_API_KEY="your-huggingface-api-key"Getting API Key
- Visit HuggingFace
- Register and log in to account
- Navigate to Settings > Access Tokens
- Create new access token
- Save token to environment variable
Use Case Configurations
General Chat
yaml
models:
- name: "general-chat"
provider: "huggingfaceinferenceapi"
model: "mistralai/Mixtral-8x7B-Instruct-v0.1"
apiKey: "${HF_API_KEY}"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.7
maxTokens: 4096Code Assistant
yaml
models:
- name: "code-assistant"
provider: "huggingfaceinferenceapi"
model: "bigcode/starcoder"
apiKey: "${HF_API_KEY}"
roles: ["edit", "apply"]
defaultCompletionOptions:
temperature: 0.2
maxTokens: 2048Multilingual Translation
yaml
models:
- name: "translator"
provider: "huggingfaceinferenceapi"
model: "google/flan-t5-xxl"
apiKey: "${HF_API_KEY}"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.3
maxTokens: 1024Advanced Configuration
Custom Endpoint
yaml
models:
- name: "custom-endpoint"
provider: "huggingfaceinferenceapi"
model: "meta-llama/Llama-2-70b-chat-hf"
apiBase: "https://api-inference.huggingface.co"
apiKey: "${HF_API_KEY}"
roles: ["chat"]Dedicated Inference Endpoint
yaml
models:
- name: "dedicated-endpoint"
provider: "huggingfaceinferenceapi"
model: "meta-llama/Llama-2-70b-chat-hf"
apiBase: "https://your-endpoint.endpoints.huggingface.cloud"
apiKey: "${HF_API_KEY}"
roles: ["chat"]
defaultCompletionOptions:
temperature: 0.5
maxTokens: 4096Model Discovery
Browse HuggingFace Hub to find available models:
Troubleshooting
Common Errors
- 401 Unauthorized: Check if API key is correct
- 503 Service Unavailable: Model is loading, wait and retry
- 429 Too Many Requests: Rate limit reached
- Model Not Found: Confirm model path is correct
Debugging Steps
- Verify API key format and validity
- Check model identifier is correct
- Confirm model is available on HuggingFace Hub
- View HuggingFace status page
- Check rate limits and quotas
Usage Limits
- Free Tier: 30,000 calls per month
- Paid Tier: Higher call limits
- Concurrency Limit: Limited number of simultaneous requests
- Model Loading: First request may require waiting for model loading
Best Practices
1. Model Selection
- Choose appropriate model based on task requirements
- Consider model size and response time
- Prioritize verified popular models
- Review model card for usage restrictions
2. Performance Optimization
- Speed Priority: Choose smaller models
- Quality Priority: Choose larger models
- Production Environment: Consider dedicated inference endpoints
- Implement request caching mechanism
3. Cost Management
- Monitor API usage
- Choose appropriate model size
- Consider inference endpoints (billed hourly)
- Set quota alerts
4. Error Handling
- Implement retry mechanism
- Handle model loading wait times
- Gracefully handle API limits
- Log errors
Inference Endpoints
For production environments, dedicated inference endpoints are recommended:
Advantages
- Guaranteed availability
- Lower latency
- No cold starts
- Higher throughput
Configuration Example
yaml
models:
- name: "production-endpoint"
provider: "huggingfaceinferenceapi"
model: "meta-llama/Llama-2-70b-chat-hf"
apiBase: "https://your-endpoint.endpoints.huggingface.cloud"
apiKey: "${HF_ENDPOINT_TOKEN}"
roles: ["chat", "edit"]
defaultCompletionOptions:
temperature: 0.5
maxTokens: 4096