Skip to content

HuggingFace Inference API

HuggingFace Inference API provides instant access to thousands of open-source models, supporting various tasks and model types.

Supported Model Types

Chat Models

  • meta-llama/Llama-2-70b-chat-hf - Llama 2 chat model
  • mistralai/Mixtral-8x7B-Instruct-v0.1 - Mixtral instruction model
  • microsoft/DialoGPT-large - Dialogue GPT

Code Models

  • bigcode/starcoder - StarCoder code model
  • Salesforce/codegen-16B-multi - CodeGen multilingual
  • microsoft/CodeGPT-small-py - Python Code GPT

Text Generation Models

  • bigscience/bloom - BLOOM large model
  • EleutherAI/gpt-neox-20b - GPT-NeoX
  • facebook/opt-30b - OPT model

Multilingual Models

  • google/flan-t5-xxl - FLAN-T5
  • bert-base-multilingual-cased - Multilingual BERT

Configuration

Basic Configuration

Configure in config.yaml or ~/.bytebuddy/config.yaml:

yaml
models:
  - name: "hf-llama"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 4096

Code Generation Configuration

yaml
models:
  - name: "hf-starcoder"
    provider: "huggingfaceinferenceapi"
    model: "bigcode/starcoder"
    apiKey: "${HF_API_KEY}"
    roles: ["edit", "apply"]
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 2048

Multi-Model Configuration

yaml
models:
  - name: "hf-chat"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 4096

  - name: "hf-code"
    provider: "huggingfaceinferenceapi"
    model: "bigcode/starcoder"
    apiKey: "${HF_API_KEY}"
    roles: ["edit"]
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 2048

  - name: "hf-multilingual"
    provider: "huggingfaceinferenceapi"
    model: "google/flan-t5-xxl"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.5
      maxTokens: 1024

Configuration Fields

Required Fields

  • name: Unique identifier for the model configuration
  • provider: Set to "huggingfaceinferenceapi"
  • model: Model identifier (HuggingFace model path)
  • apiKey: HuggingFace API key

Optional Fields

  • apiBase: API endpoint (default: https://api-inference.huggingface.co)
  • roles: Model roles [chat, edit, apply, autocomplete]
  • defaultCompletionOptions:
    • temperature: Control randomness (0-2)
    • maxTokens: Maximum tokens
    • topP: Nucleus sampling parameter
    • topK: Sampling candidates count
    • repetitionPenalty: Repetition penalty

Environment Variables

bash
# ~/.bashrc or ~/.zshrc
export HF_API_KEY="your-huggingface-api-key"

Getting API Key

  1. Visit HuggingFace
  2. Register and log in to account
  3. Navigate to Settings > Access Tokens
  4. Create new access token
  5. Save token to environment variable

Use Case Configurations

General Chat

yaml
models:
  - name: "general-chat"
    provider: "huggingfaceinferenceapi"
    model: "mistralai/Mixtral-8x7B-Instruct-v0.1"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 4096

Code Assistant

yaml
models:
  - name: "code-assistant"
    provider: "huggingfaceinferenceapi"
    model: "bigcode/starcoder"
    apiKey: "${HF_API_KEY}"
    roles: ["edit", "apply"]
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 2048

Multilingual Translation

yaml
models:
  - name: "translator"
    provider: "huggingfaceinferenceapi"
    model: "google/flan-t5-xxl"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.3
      maxTokens: 1024

Advanced Configuration

Custom Endpoint

yaml
models:
  - name: "custom-endpoint"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiBase: "https://api-inference.huggingface.co"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]

Dedicated Inference Endpoint

yaml
models:
  - name: "dedicated-endpoint"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiBase: "https://your-endpoint.endpoints.huggingface.cloud"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.5
      maxTokens: 4096

Model Discovery

Browse HuggingFace Hub to find available models:

Troubleshooting

Common Errors

  1. 401 Unauthorized: Check if API key is correct
  2. 503 Service Unavailable: Model is loading, wait and retry
  3. 429 Too Many Requests: Rate limit reached
  4. Model Not Found: Confirm model path is correct

Debugging Steps

  1. Verify API key format and validity
  2. Check model identifier is correct
  3. Confirm model is available on HuggingFace Hub
  4. View HuggingFace status page
  5. Check rate limits and quotas

Usage Limits

  • Free Tier: 30,000 calls per month
  • Paid Tier: Higher call limits
  • Concurrency Limit: Limited number of simultaneous requests
  • Model Loading: First request may require waiting for model loading

Best Practices

1. Model Selection

  • Choose appropriate model based on task requirements
  • Consider model size and response time
  • Prioritize verified popular models
  • Review model card for usage restrictions

2. Performance Optimization

  • Speed Priority: Choose smaller models
  • Quality Priority: Choose larger models
  • Production Environment: Consider dedicated inference endpoints
  • Implement request caching mechanism

3. Cost Management

  • Monitor API usage
  • Choose appropriate model size
  • Consider inference endpoints (billed hourly)
  • Set quota alerts

4. Error Handling

  • Implement retry mechanism
  • Handle model loading wait times
  • Gracefully handle API limits
  • Log errors

Inference Endpoints

For production environments, dedicated inference endpoints are recommended:

Advantages

  • Guaranteed availability
  • Lower latency
  • No cold starts
  • Higher throughput

Configuration Example

yaml
models:
  - name: "production-endpoint"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiBase: "https://your-endpoint.endpoints.huggingface.cloud"
    apiKey: "${HF_ENDPOINT_TOKEN}"
    roles: ["chat", "edit"]
    defaultCompletionOptions:
      temperature: 0.5
      maxTokens: 4096