Sticky Routing

Sticky Routing: Your Cache Hit Rate Is a Routing Problem

Date

Jun 17, 2026

Author

Andrew Zheng

Sticky Routing: Your Cache Hit Rate Is a Routing Problem

Your AI app resends the same long block of instructions and context at the top of every request. Prompt caching is supposed to make that block nearly free after the first time: a cache read costs about a tenth of full price. So you turn it on, and your bill drops.

Then it climbs back, and you never touched your code.

The problem is not your prompt. It is where your requests are going. This post is about that, what we measured against OpenRouter, and the feature we built to fix it: Sticky Routing.

The problem: a gateway scatters your cache

Start with the thing caching depends on. A "prefix" is just the part of your prompt that stays the same every time: your tool definitions, system instructions, and documents, sitting up front. When a provider sees a prefix it just processed, it skips the work and reads it back from cache for a fraction of the price. That fraction is the whole savings.

It works perfectly on a single provider, because every request lands in the same place. A gateway is different. A gateway is the layer between your app and many providers, and its job is to send each request to whichever provider looks best right now. Great for reliability and price. Not great for your cache.

Watch one conversation, the same prefix every turn, on a gateway with no memory of where it sent you last:

You paid to warm up three providers and read from almost none of them. Your cache is not broken. It is scattered. And on most turns you are paying full price for context the system already processed, plus a slower first token while it does that work over again.

The more providers in the pool, the more places your cache can scatter to. A gateway is supposed to save you money. On a repetitive workload, careless routing does the opposite.

What we measured against OpenRouter

We did not want to just assert this, so we tested it head to head with OpenRouter.

The setup is simple on purpose. For each platform we send the same request twice in a row. The first call warms the cache. The second call is the one we check, because a request can succeed without reusing anything. The number we read is how many tokens the second call got back from cache. We ran 120 rounds per platform (3 groups of 40), the same model (deepseek/deepseek-v4-flash), caching on for both, the same 12,000-character prefix.

What we measured	Infron	OpenRouter
Rounds that actually read from cache	120 of 120	103 of 120
Cache hit rate (by call)	100%	85.83%
Cache hit rate (by token)	97.22%	83.40%
Avg tokens read from cache, 2nd call	4096.0	3513.6
Total measured cost, all 120 rounds	$0.036364	$0.046970

Same model, same prompt, and OpenRouter cost 1.2917 times as much. That is 22.58% more, for one reason: it missed the cache on 17 of 120 rounds, and every miss pays full price again.

To be straight about what this is: it is a repeated-request test that shows how well each platform holds onto a warm cache, not an on/off test of a single system. The 100% is 120 clean hits in a controlled run, not a promise for every workload. The method above is simple enough to rerun on your own prompt and model. The part that travels is the point underneath the numbers: on a gateway, your cache hit rate depends on routing, and routing is ours to manage.

The fix: Sticky Routing

Sticky Routing keeps one conversation on the same provider that already worked for it, instead of letting it wander. Your prefix stays where it is already warm.

The same conversation, Sticky Routing on:

The word that matters is healthy. This is "reuse it while it is working well," not "lock onto one provider no matter what."

How it works

The first request in a conversation routes normally: we build the candidate list, filter it, check health, and score. Whichever provider succeeds, we remember it for this conversation, keyed to the user, the model, and a fingerprint of the conversation.
On the next turns, before scoring runs, we check that memory. We reuse the remembered provider only if it is still a candidate, still up, and not degraded. If anything looks off, we drop the memory and route normally, with no noise.
Every time it succeeds, we refresh the memory.
The memory only lasts a short while, matched to how long providers keep their own cache warm. After that, the conversation routes fresh again and re-stickies on its next success.

Matching the memory is necessary, but never enough on its own. The provider still has to pass the health check at that moment.

You do not have to do anything

Sticky Routing is on by default. No flag, no code change. You point at Infron the same way you point at any OpenAI-style endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.onerouter.pro/v1",       # the only line that differs
    api_key="YOUR_INFRON_KEY",
)

# multi-turn conversation, stable part of the prompt first
client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",
    messages=[system_prompt, *history, user_turn],
)

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.onerouter.pro/v1",       # the only line that differs
    api_key="YOUR_INFRON_KEY",
)

# multi-turn conversation, stable part of the prompt first
client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",
    messages=[system_prompt, *history, user_turn],
)

Keep the stable part of your prompt first and the changing part last, and we hold the route steady underneath. Everything else about your call stays the same.

It also gets out of your way the moment you steer yourself. If your request sets provider.order, provider.only, provider.ignore, provider.sort, or allow_fallbacks, we skip Sticky Routing for that request, because we will not override a routing choice you made on purpose.

When it does not help

Sticky Routing is not magic, and some workloads get nothing from it.

If the start of your prompt changes every time, there is no warm prefix to protect.
If every request is a brand new conversation, there is nothing to keep sticky.
If you already pin a provider per request, you have turned it off yourself.

The payoff grows with how much your prompt repeats, which is exactly the agent, RAG, and assistant traffic that fills most production systems today.

FAQ

Does it hurt reliability? No. It only decides which healthy provider wins a close call. If the sticky one fails, the retry skips it and routes normally, then re-stickies to whatever actually worked. Anything going down, degrading, or timing out sends you back to normal routing.

Could it pin me to a slow or pricey provider? Only while that provider is healthy, and only for a short window. The memory does not survive failures and expires on its own.

How do you tell one conversation from another? A key built from the user, the model, and a fingerprint of the conversation. Different conversations get different keys and route on their own.

Do I send a parameter? No. It is on by default, and it steps aside when you set your own provider controls.

Try it

Sticky Routing is live. Put the stable part of your prompt first, point your traffic at Infron, and watch your own cache hit rate and cost move. Rerun the same test on your own workload, or head to infron.ai to start.

Less orchestration. More innovation.

Sticky Routing: Your Cache Hit Rate Is a Routing Problem

Then it climbs back, and you never touched your code.

The problem is not your prompt. It is where your requests are going. This post is about that, what we measured against OpenRouter, and the feature we built to fix it: Sticky Routing.

The problem: a gateway scatters your cache

Watch one conversation, the same prefix every turn, on a gateway with no memory of where it sent you last:

The more providers in the pool, the more places your cache can scatter to. A gateway is supposed to save you money. On a repetitive workload, careless routing does the opposite.

What we measured against OpenRouter

We did not want to just assert this, so we tested it head to head with OpenRouter.

What we measured	Infron	OpenRouter
Rounds that actually read from cache	120 of 120	103 of 120
Cache hit rate (by call)	100%	85.83%
Cache hit rate (by token)	97.22%	83.40%
Avg tokens read from cache, 2nd call	4096.0	3513.6
Total measured cost, all 120 rounds	$0.036364	$0.046970

Same model, same prompt, and OpenRouter cost 1.2917 times as much. That is 22.58% more, for one reason: it missed the cache on 17 of 120 rounds, and every miss pays full price again.

The fix: Sticky Routing

Sticky Routing keeps one conversation on the same provider that already worked for it, instead of letting it wander. Your prefix stays where it is already warm.

The same conversation, Sticky Routing on:

The word that matters is healthy. This is "reuse it while it is working well," not "lock onto one provider no matter what."

How it works

The first request in a conversation routes normally: we build the candidate list, filter it, check health, and score. Whichever provider succeeds, we remember it for this conversation, keyed to the user, the model, and a fingerprint of the conversation.
On the next turns, before scoring runs, we check that memory. We reuse the remembered provider only if it is still a candidate, still up, and not degraded. If anything looks off, we drop the memory and route normally, with no noise.
Every time it succeeds, we refresh the memory.
The memory only lasts a short while, matched to how long providers keep their own cache warm. After that, the conversation routes fresh again and re-stickies on its next success.

Matching the memory is necessary, but never enough on its own. The provider still has to pass the health check at that moment.

You do not have to do anything

Sticky Routing is on by default. No flag, no code change. You point at Infron the same way you point at any OpenAI-style endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.onerouter.pro/v1",       # the only line that differs
    api_key="YOUR_INFRON_KEY",
)

# multi-turn conversation, stable part of the prompt first
client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",
    messages=[system_prompt, *history, user_turn],
)

Keep the stable part of your prompt first and the changing part last, and we hold the route steady underneath. Everything else about your call stays the same.

When it does not help

Sticky Routing is not magic, and some workloads get nothing from it.

If the start of your prompt changes every time, there is no warm prefix to protect.
If every request is a brand new conversation, there is nothing to keep sticky.
If you already pin a provider per request, you have turned it off yourself.

The payoff grows with how much your prompt repeats, which is exactly the agent, RAG, and assistant traffic that fills most production systems today.

FAQ

Could it pin me to a slow or pricey provider? Only while that provider is healthy, and only for a short window. The memory does not survive failures and expires on its own.

How do you tell one conversation from another? A key built from the user, the model, and a fingerprint of the conversation. Different conversations get different keys and route on their own.

Do I send a parameter? No. It is on by default, and it steps aside when you set your own provider controls.

Try it

Less orchestration. More innovation.

Seedance 2.0 Real Human Pipeline

How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

Seedance 2.0 Real Human Pipeline

How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

From Image Model to Finished Clip

Seedance 2.0 Real Human Video API: Access, Setup, and Prompting

From Image Model to Finished Clip

Seedance 2.0 Real Human Video API: Access, Setup, and Prompting

Research

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Research

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Seedance 2.0 Real Human Pipeline

How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

From Image Model to Finished Clip

Seedance 2.0 Real Human Video API: Access, Setup, and Prompting

Less orchestration.
More innovation.

Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.

Book a Demo

Less orchestration.
More innovation.

Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.

Book a Demo

Less orchestration.
More innovation.

Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.

Book a Demo

Sticky Routing: Your Cache Hit Rate Is a Routing Problem

Date

Author

Sticky Routing: Your Cache Hit Rate Is a Routing Problem

The problem: a gateway scatters your cache

What we measured against OpenRouter

The fix: Sticky Routing

How it works

You do not have to do anything

When it does not help

FAQ

Try it

Sticky Routing: Your Cache Hit Rate Is a Routing Problem

The problem: a gateway scatters your cache

What we measured against OpenRouter

The fix: Sticky Routing

How it works

You do not have to do anything

When it does not help

FAQ

Try it

More Articles

How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

Seedance 2.0 Real Human Video API: Access, Setup, and Prompting

Seedance 2.0 Real Human Video API: Access, Setup, and Prompting

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

Seedance 2.0 Real Human Video API: Access, Setup, and Prompting

Less orchestration.More innovation.

Less orchestration.More innovation.

Less orchestration.More innovation.

Less orchestration.
More innovation.

Less orchestration.
More innovation.

Less orchestration.
More innovation.