Inference Provider Routing

Route requests to the best inference provider

Infron AI routes requests to the best available providers for your model.

By default, requests are load balanced across the top providers to maximize uptime and best price.

You can customize how your requests are routed using the provider object in the request body for Chat Completions and Completions.

The provider object can contain the following fields:

Field

Type

Default

Description

order

string[]

List of provider slugs to try in order (e.g. ["anthropic", "openai"]).

allow_fallbacks

boolean

true

Whether to allow backup providers when the primary is unavailable.

sort

string | object

Sort providers by price, throughput, or latency. (e.g. "price")

preferred_min_throughput

number | object

Preferred minimum throughput (tokens/sec). Can be a number or an object with percentile cutoffs (p50, p75, p90, p99).

preferred_max_latency

number | object

Preferred maximum latency (seconds). Can be a number or an object with percentile cutoffs (p50, p75, p90, p99).

require_parameters

boolean

true

Only use providers that support all parameters in your request.

data_collection

"allow" | "deny"

"allow"

Control whether to use providers that may store data.

zdr

boolean

false

Restrict routing to only ZDR (Zero Data Retention) endpoints.

enforce_distillable_text

boolean

false

Restrict routing to only models that allow text distillation.

only

string[]

List of provider slugs to allow for this request.

ignore

string[]

List of provider slugs to skip for this request.

quantizations

string[]

List of quantization levels to filter by (e.g. ["int4", "int8"]).

Cost-effective Load Balancing (Default Strategy)

For each model in your request, Infron's default behavior is to load balance requests across providers, balancing the best throughput, lowest latency, and lowest price.

When you send a model request, Infron automatically evaluates multiple providers in real time. It considers factors such as latency, throughput, reliability, and price—based on the default weight distribution shown above.

For instance, if Provider A offers slightly higher throughput but at a higher cost, while Provider B is more affordable with moderate latency, Infron will intelligently balance requests across both to achieve the best overall performance and cost efficiency.

If you are more sensitive to throughput than price, you can use the sort field to explicitly prioritize throughput.

If you have sort or order set in your provider preferences, load balancing default strategy will be disabled.

Ordering Specific Providers (order)

You can set the providers that Infron AI will prioritize for your request using the order field.

Field

Type

Default

Description

order

string[]

List of provider slugs to try in order (e.g. ["anthropic", "openai"]).

Infron AI will prioritize providers in this order, for the model you're using. If you don't set this field, the router will use the default strategy.

You can use the copy button next to provider names on model pages to get the exact provider slug, for example like "anthropic"、"openai"、“novita”

Order example (enable allow_fallbacks by default):

azure is hosting the "anthropic/claude-sonnet-4.5"
anthropic is hosting the "anthropic/claude-sonnet-4.5"
openai is hosting the "anthropic/claude-sonnet-4.5"

You set the order filed as ["anthropic", "openai"], and you're calling the "anthropic/claude-sonnet-4.5" model.

If Provider anthropic fails, then Provider openai will be tried next.
If Provider openai also fails, then backup provider (may be azure) will be tried last.

Infron will try all the providers which are specified in order one at a time, and proceed to other backup providers if none are operational.

If you don't want to allow any other providers, you should disable allow_fallbacks as well.

Order example (disable allow_fallbacks by default):

azure is hosting the "anthropic/claude-sonnet-4.5"
anthropic is hosting the "anthropic/claude-sonnet-4.5"
openai is hosting the "anthropic/claude-sonnet-4.5"

You set the order filed as ["anthropic", "openai"], and you're calling the "anthropic/claude-sonnet-4.5" model.

You set the allow_fallbacks as false.

If Provider anthropic fails, then Provider openai will be tried next.
If Provider openai also fails, then this request will fails finally.

Example: Specifying providers with fallbacks

In the example below, your request will first be sent to Google AI Studio, and only when Google AI Studio experiences a serious outage will the request be forwarded to Google Vertex.

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'order': ['novita', 'deepinfra'],
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "order": ["novita", "deepinfra"]
  }
}'

Example: Specifying providers with fallbacks disabled

Here's an example with allow_fallbacks set to false,your request will first be sent to Google AI Studio, and then fails if Google AI Studio fails

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'google/gemini-3-flash-preview',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'order': ['google-ai-studio'],
    'allow_fallbacks': False
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "google/gemini-3-flash-preview",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "order": ["google-ai-studio"],
      "allow_fallbacks": false
  }
}'

Example: Targeting Specific Provider Endpoints

Each provider on Infron may host multiple endpoints for the same model, such as a default endpoint and a specialized "quantizations" endpoint. To target a specific endpoint, you can use the copy button next to the provider name on the model detail page to obtain the exact provider slug.

For example, MiniMax offers MiniMax M2.1 through multiple endpoints:

Default endpoint with slug minimax/fp8
Lightning endpoint with slug minimax/lightning

By copying the exact provider slug and using it in your request's order array, you can ensure your request is routed to the specific endpoint you want:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'minimax/minimax-m2.1',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'order': ['minimax/fp8'],
    'allow_fallbacks': False,
  },
})

This approach is especially useful when you want to consistently use a specific variant of a model from a particular provider.

Provider Sorting (sort)

If you instead want to explicitly prioritize a particular provider attribute, you can include the sort field in the provider preferences. Default strategy will be disabled, and the router will try providers in order.

The three sort options are:

"price": prioritize lowest price
"throughput": prioritize highest throughput
"latency": prioritize lowest latency

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json',
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'sort': 'price',
  },
})

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json',
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'sort': 'throughput',
  },
})

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json',
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'sort': 'latency',
  },
})

To always prioritize low prices, set sort to "price".

To always prioritize highest throughput, set sort to "throughput".

To always prioritize low latency, set sort to "latency".

Performance Thresholds (preferred_min_throughput / preferred_max_latency)

You can set minimum throughput or maximum latency thresholds to filter endpoints.

Endpoints that don't meet these thresholds are deprioritized (moved to the end of the list) rather than excluded entirely.

Field

Type

Default

Description

preferred_min_throughput

number | object

Preferred minimum throughput in tokens per second.

Can be

a number (applies to p50)
or an object with percentile cutoffs.

preferred_max_latency

number | object

Preferred maximum latency in seconds.

Can be

a number (applies to p50)
or an object with percentile cutoffs.

How Percentiles Work

Infron tracks latency and throughput metrics for each model and provider using percentile statistics calculated over a rolling 5-minute window. The available percentiles are:

p50 (median): 50% of requests perform better than this value
p75: 75% of requests perform better than this value
p90: 90% of requests perform better than this value
p99: 99% of requests perform better than this value

Higher percentiles (like p90 or p99) give you more confidence about worst-case performance, while lower percentiles (like p50) reflect typical performance. For example, if a model and provider has a p90 latency of 2 seconds, that means 90% of requests complete in under 2 seconds.

When to Use Percentile Preferences

Percentile-based routing is useful when you need predictable performance characteristics:

Real-time applications: Use p90 or p99 latency thresholds to ensure consistent response times for user-facing features
Batch processing: Use p50 throughput thresholds when you care more about average performance than worst-case scenarios
SLA compliance: Use multiple percentile cutoffs to ensure providers meet your service level agreements across different performance tiers
Cost optimization: Combine with sort: "price" to get the cheapest provider that still meets your performance requirements

Example: Find the Cheapest Model Meeting Performance Requirements

Combine 'sort': 'price' with performance thresholds to find the cheapest option that meets your performance requirements. This is useful when you have a performance floor but want to minimize costs.

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json',
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'models': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'sort': 'price',
    },
    'preferred_min_throughput': {
      'p90': 50, # Prefer providers with >50 tokens/sec for 90% of requests in last 5 minutes
    },
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "models": "deepseek/deepseek-v3.2",
    "messages": [{ "role": "user", "content": "Hello" }],
    "provider": {
      "sort": "price",
      "preferred_min_throughput": {
        "p90": 50
      }
    }
  }'

In this example, Infron will find the cheapest provider that has at least 50 tokens/second throughput at the p90 level (meaning 90% of requests achieve this throughput or better). Providers below this threshold are still available as fallbacks if all preferred options fail.

You can also use preferred_max_latency to set a maximum acceptable latency:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json',
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'models': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'sort': 'price',
    'preferred_max_latency': {
      'p90': 10, # Prefer providers with <10 second latency for 90% of requests in last 5 minutes
    },
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "models": "deepseek/deepseek-v3.2",
    "messages": [{ "role": "user", "content": "Hello" }],
    "provider": {
      "sort": "price",
      "preferred_max_latency": {
        "p90": 3
      }
    }
  }'

Example: Using Multiple Percentile Cutoffs

You can specify multiple percentile cutoffs to set both typical and worst-case performance requirements. All specified cutoffs must be met for a provider to be in the preferred group.

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json',
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'preferred_max_latency': {
      'p50': 1, # Prefer providers with <1 second latency for 50% of requests in last 5 minutes
      'p90': 3, # Prefer providers with <3 second latency for 90% of requests in last 5 minutes
      'p99': 5, # Prefer providers with <5 second latency for 99% of requests in last 5 minutes
    },
    'preferred_min_throughput': {
      'p50': 100, # Prefer providers with >100 tokens/sec for 50% of requests in last 5 minutes
      'p90': 50, # Prefer providers with >50 tokens/sec for 90% of requests in last 5 minutes
    },
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek/deepseek-v3.2",
    "messages": [{ "role": "user", "content": "Hello" }],
    "provider": {
      "preferred_max_latency": {
        "p50": 1,
        "p90": 3,
        "p99": 5
      },
      "preferred_min_throughput": {
        "p50": 100,
        "p90": 50
      }
    }
  }'

Requiring Providers to Support All Parameters (require_parameters)

You can restrict requests only to providers that support all parameters in your request using the require_parameters field.

When you send a request with `tools` or `tool_choice`, Infron will only route to providers that support tool use. Similarly, if you set a `max_tokens`, then Infron will only route to providers that support a response of that length.

Field

Type

Default

Description

require_parameters

boolean

true

Only use providers that support all parameters in your request.

With the default routing strategy (set require_parameters to true), providers that don't support all the LLM parameters specified in your request can still receive the request, but will ignore unknown parameters.
When you set require_parameters to false, the request won't even be routed to that provider.

Example: Excluding providers that don't support JSON formatting

For example, to only use providers that support JSON formatting:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'require_parameters': True,
  },
  'response_format': { 'type': 'json_object' },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "require_parameters": true,
      "response_format": { "type": "json_object" }  
    }
}'

Requiring Providers to Comply with Data Policies (data_collection)

You can restrict requests only to providers that comply with your data policies using the data_collection field.

Field

Type

Default

Description

data_collection

"allow" | "deny"

"allow"

Control whether to use providers that may store data.

allow: (default) allow providers which store user data non-transiently and may train on it
deny: use only providers which do not collect user data

Some model providers may log prompts, so we display them with a Data Policy tag on model pages. This is not a definitive source of third party data policies, but represents our best knowledge.

Example: Excluding providers that don't comply with data policies

To exclude providers that don't comply with your data policies, set data_collection to deny:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2'
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'data_collection': 'deny', # or "allow"
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "data_collection": "deny" 
    }
}'

Zero Data Retention Enforcement (zdr)

You can enforce Zero Data Retention (ZDR) on a per-request basis using the zdr parameter, ensuring your request only routes to endpoints that do not retain prompts.

Field

Type

Default

Description

zdr

boolean

falses

Restrict routing to only ZDR (Zero Data Retention) endpoints.

When zdr is set to true, the request will only be routed to endpoints that have a Zero Data Retention policy.
When zdr is false or not provided, it has no effect on routing.

Example: Enforcing ZDR for a specific request

To ensure a request only uses ZDR endpoints, set zdr to true:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json',
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'zdr': True,
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "zdr": true 
    }
}'

This is useful for customers who don't want to globally enforce ZDR but need to ensure specific requests only route to ZDR endpoints.

Distillable Text Enforcement (enforce_distillable_text)

You can enforce distillable text filtering on a per-request basis using the enforce_distillable_text parameter, ensuring your request only routes to models where the author has allowed text distillation.

Field

Type

Default

Description

enforce_distillable_text

boolean

false

Restrict routing to only models that allow text distillation.

When enforce_distillable_text is set to true, the request will only be routed to models where the author has explicitly enabled text distillation.
When enforce_distillable_text is false or not provided, it has no effect on routing.

This parameter is useful for applications that need to ensure their requests only use models that allow text distillation for training purposes, such as when building datasets for model fine-tuning or distillation workflows.

Example: Enforcing distillable text for a specific request

To ensure a request only uses models that allow text distillation, set enforce_distillable_text to true:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'enforce_distillable_text': True,
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "enforce_distillable_text": true 
    }
}'

Disabling Fallbacks (allow_fallbacks)

Example: Always choose the cheapest provider with fallbacks disabled

To guarantee that your request is only served by the lowest-cost provider, you can disable fallbacks.

This is combined with the order field to restrict the providers that Infron will prioritize to just your chosen list.

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'sort': 'price',
    'allow_fallbacks': False,
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "sort": "price",
      "allow_fallbacks": false 
    }
}'

Example: Always choose the specific providers with fallbacks disabled

Here's an example with allow_fallbacks set to false,your request will first be sent to Google AI Studio, and then fails if Google AI Studio fails

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'google/gemini-3-flash-preview',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'order': ['google-ai-studio'],
    'allow_fallbacks': False
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "google/gemini-3-flash-preview",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "order": ["google-ai-studio"],
      "allow_fallbacks": false 
    }
}'

Allowing Only Specific Providers (only)

You can allow only specific providers for a request by setting the only field in the provider object.

Field

Type

Default

Description

only

string[]

List of provider slugs to allow for this request.

Only allowing some providers may significantly reduce fallback options and limit request recovery.

Example: Only allow Azure for a request calling GPT-4 Omni

Here's an example that will only use Azure for a request calling GPT-4 Omni:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'openai/gpt-5-mini',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'only': ['azure'],
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "openai/gpt-5-mini",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "only": ["azure"]
    }
}'

Ignoring Providers (ignore)

You can ignore providers for a request by setting the ignore field in the provider object.

Field

Type

Default

Description

ignore

string[]

List of provider slugs to skip for this request.

Ignoring multiple providers may significantly reduce fallback options and limit request recovery.

Example: Ignoring some provider for a request

Here's an example that will ignore some provider:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'ignore': ['deepinfra'],
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "ignore": ["deepinfra"]
    }
}'

Quantization (quantizations)

Quantization reduces model size and computational requirements while aiming to preserve performance. Most LLMs today use FP16 or BF16 for training and inference, cutting memory requirements in half compared to FP32. Some optimizations use FP8 or quantization to reduce size further (e.g., INT8, INT4).

Field

Type

Default

Description

quantizations

string[]

List of quantization levels to filter by (e.g. ["int4", "int8"]).

Quantized models may exhibit degraded performance for certain prompts, depending on the method used.

Providers can support various quantization levels for open-weight models.

Quantization Levels

To filter providers by quantization level, specify the quantizations field in the provider parameter with the following values:

int4: Integer (4 bit)
int8: Integer (8 bit)
fp4: Floating point (4 bit)
fp6: Floating point (6 bit)
fp8: Floating point (8 bit)
fp16: Floating point (16 bit)
bf16: Brain floating point (16 bit)
fp32: Floating point (32 bit)
unknown: Unknown

Example: Requesting FP8 Quantization

Here's an example that will only use providers that support FP8 quantization:

import requests

headers = {
  'Authorization': 'Bearer <API_KEY>',
  'Content-Type': 'application/json'
}

response = requests.post('https://llm.onerouter.pro/v1/chat/completions', headers=headers, json={
  'model': 'deepseek/deepseek-v3.2',
  'messages': [{ 'role': 'user', 'content': 'Hello' }],
  'provider': {
    'quantizations': ['fp8'],
  },
})

curl https://llm.onerouter.pro/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "model": "deepseek/deepseek-v3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ],
  "provider": {
      "quantizations": ["fp8"]
    }
}'

PreviousPricing and Fee Structure NextBYOK

Last updated 9 hours ago

hashtagCost-effective Load Balancing (Default Strategy)

hashtagOrdering Specific Providers (order)

hashtagExample: Specifying providers with fallbacks

hashtagExample: Specifying providers with fallbacks disabled

hashtagExample: Targeting Specific Provider Endpoints

hashtagProvider Sorting (sort)

hashtagPerformance Thresholds (preferred_min_throughput / preferred_max_latency)

hashtagHow Percentiles Work

hashtagWhen to Use Percentile Preferences

hashtagExample: Find the Cheapest Model Meeting Performance Requirements

hashtagExample: Using Multiple Percentile Cutoffs

hashtagRequiring Providers to Support All Parameters (require_parameters)

hashtagExample: Excluding providers that don't support JSON formatting

hashtagRequiring Providers to Comply with Data Policies (data_collection)

hashtagExample: Excluding providers that don't comply with data policies

hashtagZero Data Retention Enforcement (zdr)

hashtagExample: Enforcing ZDR for a specific request

hashtagDistillable Text Enforcement (enforce_distillable_text)

hashtagExample: Enforcing distillable text for a specific request

hashtagDisabling Fallbacks (allow_fallbacks)

hashtagExample: Always choose the cheapest provider with fallbacks disabled

hashtagExample: Always choose the specific providers with fallbacks disabled

hashtagAllowing Only Specific Providers (only)

hashtagExample: Only allow Azure for a request calling GPT-4 Omni

hashtagIgnoring Providers (ignore)

hashtagExample: Ignoring some provider for a request

hashtagQuantization (quantizations)

hashtagQuantization Levels

hashtagExample: Requesting FP8 Quantization

Cost-effective Load Balancing (Default Strategy)

Ordering Specific Providers (order)

Example: Specifying providers with fallbacks

Example: Specifying providers with fallbacks disabled

Example: Targeting Specific Provider Endpoints

Provider Sorting (sort)

Performance Thresholds (preferred_min_throughput / preferred_max_latency)

How Percentiles Work

When to Use Percentile Preferences

Example: Find the Cheapest Model Meeting Performance Requirements

Example: Using Multiple Percentile Cutoffs

Requiring Providers to Support All Parameters (require_parameters)

Example: Excluding providers that don't support JSON formatting

Requiring Providers to Comply with Data Policies (data_collection)

Example: Excluding providers that don't comply with data policies

Zero Data Retention Enforcement (zdr)

Example: Enforcing ZDR for a specific request

Distillable Text Enforcement (enforce_distillable_text)

Example: Enforcing distillable text for a specific request

Disabling Fallbacks (allow_fallbacks)

Example: Always choose the cheapest provider with fallbacks disabled

Example: Always choose the specific providers with fallbacks disabled

Allowing Only Specific Providers (only)

Example: Only allow Azure for a request calling GPT-4 Omni

Ignoring Providers (ignore)

Example: Ignoring some provider for a request

Quantization (quantizations)

Quantization Levels

Example: Requesting FP8 Quantization