HuggingFace Inference API

HuggingFace Inference API provides instant access to thousands of open-source models, supporting various tasks and model types.

Supported Model Types

Chat Models

meta-llama/Llama-2-70b-chat-hf - Llama 2 chat model
mistralai/Mixtral-8x7B-Instruct-v0.1 - Mixtral instruction model
microsoft/DialoGPT-large - Dialogue GPT

Code Models

bigcode/starcoder - StarCoder code model
Salesforce/codegen-16B-multi - CodeGen multilingual
microsoft/CodeGPT-small-py - Python Code GPT

Text Generation Models

bigscience/bloom - BLOOM large model
EleutherAI/gpt-neox-20b - GPT-NeoX
facebook/opt-30b - OPT model

Multilingual Models

google/flan-t5-xxl - FLAN-T5
bert-base-multilingual-cased - Multilingual BERT

Configuration

Basic Configuration

Configure in config.yaml or ~/.bytebuddy/config.yaml:

yaml

models:
  - name: "hf-llama"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 4096

Code Generation Configuration

yaml

models:
  - name: "hf-starcoder"
    provider: "huggingfaceinferenceapi"
    model: "bigcode/starcoder"
    apiKey: "${HF_API_KEY}"
    roles: ["edit", "apply"]
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 2048

Multi-Model Configuration

yaml

models:
  - name: "hf-chat"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 4096

  - name: "hf-code"
    provider: "huggingfaceinferenceapi"
    model: "bigcode/starcoder"
    apiKey: "${HF_API_KEY}"
    roles: ["edit"]
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 2048

  - name: "hf-multilingual"
    provider: "huggingfaceinferenceapi"
    model: "google/flan-t5-xxl"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.5
      maxTokens: 1024

Configuration Fields

Required Fields

name: Unique identifier for the model configuration
provider: Set to "huggingfaceinferenceapi"
model: Model identifier (HuggingFace model path)
apiKey: HuggingFace API key

Optional Fields

apiBase: API endpoint (default: https://api-inference.huggingface.co)
roles: Model roles [chat, edit, apply, autocomplete]
defaultCompletionOptions:
- temperature: Control randomness (0-2)
- maxTokens: Maximum tokens
- topP: Nucleus sampling parameter
- topK: Sampling candidates count
- repetitionPenalty: Repetition penalty

Environment Variables

bash

# ~/.bashrc or ~/.zshrc
export HF_API_KEY="your-huggingface-api-key"

Getting API Key

Visit HuggingFace
Register and log in to account
Navigate to Settings > Access Tokens
Create new access token
Save token to environment variable

Use Case Configurations

General Chat

yaml

models:
  - name: "general-chat"
    provider: "huggingfaceinferenceapi"
    model: "mistralai/Mixtral-8x7B-Instruct-v0.1"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.7
      maxTokens: 4096

Code Assistant

yaml

models:
  - name: "code-assistant"
    provider: "huggingfaceinferenceapi"
    model: "bigcode/starcoder"
    apiKey: "${HF_API_KEY}"
    roles: ["edit", "apply"]
    defaultCompletionOptions:
      temperature: 0.2
      maxTokens: 2048

Multilingual Translation

yaml

models:
  - name: "translator"
    provider: "huggingfaceinferenceapi"
    model: "google/flan-t5-xxl"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.3
      maxTokens: 1024

Advanced Configuration

Custom Endpoint

yaml

models:
  - name: "custom-endpoint"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiBase: "https://api-inference.huggingface.co"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]

Dedicated Inference Endpoint

yaml

models:
  - name: "dedicated-endpoint"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiBase: "https://your-endpoint.endpoints.huggingface.cloud"
    apiKey: "${HF_API_KEY}"
    roles: ["chat"]
    defaultCompletionOptions:
      temperature: 0.5
      maxTokens: 4096

Model Discovery

Browse HuggingFace Hub to find available models:

Troubleshooting

Common Errors

401 Unauthorized: Check if API key is correct
503 Service Unavailable: Model is loading, wait and retry
429 Too Many Requests: Rate limit reached
Model Not Found: Confirm model path is correct

Debugging Steps

Verify API key format and validity
Check model identifier is correct
Confirm model is available on HuggingFace Hub
View HuggingFace status page
Check rate limits and quotas

Usage Limits

Free Tier: 30,000 calls per month
Paid Tier: Higher call limits
Concurrency Limit: Limited number of simultaneous requests
Model Loading: First request may require waiting for model loading

Best Practices

1. Model Selection

Choose appropriate model based on task requirements
Consider model size and response time
Prioritize verified popular models
Review model card for usage restrictions

2. Performance Optimization

Speed Priority: Choose smaller models
Quality Priority: Choose larger models
Production Environment: Consider dedicated inference endpoints
Implement request caching mechanism

3. Cost Management

Monitor API usage
Choose appropriate model size
Consider inference endpoints (billed hourly)
Set quota alerts

4. Error Handling

Implement retry mechanism
Handle model loading wait times
Gracefully handle API limits
Log errors

Inference Endpoints

For production environments, dedicated inference endpoints are recommended:

Advantages

Guaranteed availability
Lower latency
No cold starts
Higher throughput

Configuration Example

yaml

models:
  - name: "production-endpoint"
    provider: "huggingfaceinferenceapi"
    model: "meta-llama/Llama-2-70b-chat-hf"
    apiBase: "https://your-endpoint.endpoints.huggingface.cloud"
    apiKey: "${HF_ENDPOINT_TOKEN}"
    roles: ["chat", "edit"]
    defaultCompletionOptions:
      temperature: 0.5
      maxTokens: 4096

Popular Providers

More Providers

HuggingFace Inference API

Supported Model Types

Chat Models

Code Models

Text Generation Models

Multilingual Models

Configuration

Basic Configuration

Code Generation Configuration

Multi-Model Configuration

Configuration Fields

Required Fields

Optional Fields

Environment Variables

Getting API Key

Use Case Configurations

General Chat

Code Assistant

Multilingual Translation

Advanced Configuration

Custom Endpoint

Dedicated Inference Endpoint

Model Discovery

Troubleshooting

Common Errors

Debugging Steps

Usage Limits

Best Practices

1. Model Selection

2. Performance Optimization

3. Cost Management

4. Error Handling

Inference Endpoints

Advantages

Configuration Example

HuggingFace Inference API ​

Supported Model Types ​

Chat Models ​

Code Models ​

Text Generation Models ​

Multilingual Models ​

Configuration ​

Basic Configuration ​

Code Generation Configuration ​

Multi-Model Configuration ​

Configuration Fields ​

Required Fields ​

Optional Fields ​

Environment Variables ​

Getting API Key ​

Use Case Configurations ​

General Chat ​

Code Assistant ​

Multilingual Translation ​

Advanced Configuration ​

Custom Endpoint ​

Dedicated Inference Endpoint ​

Model Discovery ​

Troubleshooting ​

Common Errors ​

Debugging Steps ​

Usage Limits ​

Best Practices ​

1. Model Selection ​

2. Performance Optimization ​

3. Cost Management ​

4. Error Handling ​

Inference Endpoints ​

Advantages ​

Configuration Example ​

HuggingFace Inference API

Supported Model Types

Chat Models

Code Models

Text Generation Models

Multilingual Models

Configuration

Basic Configuration

Code Generation Configuration

Multi-Model Configuration

Configuration Fields

Required Fields

Optional Fields

Environment Variables

Getting API Key

Use Case Configurations

General Chat

Code Assistant

Multilingual Translation

Advanced Configuration

Custom Endpoint

Dedicated Inference Endpoint

Model Discovery

Troubleshooting

Common Errors

Debugging Steps

Usage Limits

Best Practices

1. Model Selection

2. Performance Optimization

3. Cost Management

4. Error Handling

Inference Endpoints

Advantages

Configuration Example