Sticky Routing
Sticky Routing: Your Cache Hit Rate Is a Routing Problem


Date
Author
Andrew Zheng
Sticky Routing: Your Cache Hit Rate Is a Routing Problem
Your AI app resends the same long block of instructions and context at the top of every request. Prompt caching is supposed to make that block nearly free after the first time: a cache read costs about a tenth of full price. So you turn it on, and your bill drops.
Then it climbs back, and you never touched your code.
The problem is not your prompt. It is where your requests are going. This post is about that, what we measured against OpenRouter, and the feature we built to fix it: Sticky Routing.
The problem: a gateway scatters your cache
Start with the thing caching depends on. A "prefix" is just the part of your prompt that stays the same every time: your tool definitions, system instructions, and documents, sitting up front. When a provider sees a prefix it just processed, it skips the work and reads it back from cache for a fraction of the price. That fraction is the whole savings.
It works perfectly on a single provider, because every request lands in the same place. A gateway is different. A gateway is the layer between your app and many providers, and its job is to send each request to whichever provider looks best right now. Great for reliability and price. Not great for your cache.
Watch one conversation, the same prefix every turn, on a gateway with no memory of where it sent you last:
You paid to warm up three providers and read from almost none of them. Your cache is not broken. It is scattered. And on most turns you are paying full price for context the system already processed, plus a slower first token while it does that work over again.
The more providers in the pool, the more places your cache can scatter to. A gateway is supposed to save you money. On a repetitive workload, careless routing does the opposite.
What we measured against OpenRouter
We did not want to just assert this, so we tested it head to head with OpenRouter.
The setup is simple on purpose. For each platform we send the same request twice in a row. The first call warms the cache. The second call is the one we check, because a request can succeed without reusing anything. The number we read is how many tokens the second call got back from cache. We ran 120 rounds per platform (3 groups of 40), the same model (deepseek/deepseek-v4-flash), caching on for both, the same 12,000-character prefix.
What we measured | Infron | OpenRouter |
|---|---|---|
Rounds that actually read from cache | 120 of 120 | 103 of 120 |
Cache hit rate (by call) | 100% | 85.83% |
Cache hit rate (by token) | 97.22% | 83.40% |
Avg tokens read from cache, 2nd call | 4096.0 | 3513.6 |
Total measured cost, all 120 rounds | $0.036364 | $0.046970 |
Same model, same prompt, and OpenRouter cost 1.2917 times as much. That is 22.58% more, for one reason: it missed the cache on 17 of 120 rounds, and every miss pays full price again.
To be straight about what this is: it is a repeated-request test that shows how well each platform holds onto a warm cache, not an on/off test of a single system. The 100% is 120 clean hits in a controlled run, not a promise for every workload. The method above is simple enough to rerun on your own prompt and model. The part that travels is the point underneath the numbers: on a gateway, your cache hit rate depends on routing, and routing is ours to manage.
The fix: Sticky Routing
Sticky Routing keeps one conversation on the same provider that already worked for it, instead of letting it wander. Your prefix stays where it is already warm.
The same conversation, Sticky Routing on:
The word that matters is healthy. This is "reuse it while it is working well," not "lock onto one provider no matter what."
How it works
The first request in a conversation routes normally: we build the candidate list, filter it, check health, and score. Whichever provider succeeds, we remember it for this conversation, keyed to the user, the model, and a fingerprint of the conversation.
On the next turns, before scoring runs, we check that memory. We reuse the remembered provider only if it is still a candidate, still up, and not degraded. If anything looks off, we drop the memory and route normally, with no noise.
Every time it succeeds, we refresh the memory.
The memory only lasts a short while, matched to how long providers keep their own cache warm. After that, the conversation routes fresh again and re-stickies on its next success.
Matching the memory is necessary, but never enough on its own. The provider still has to pass the health check at that moment.
You do not have to do anything
Sticky Routing is on by default. No flag, no code change. You point at Infron the same way you point at any OpenAI-style endpoint:
from openai import OpenAI client = OpenAI( base_url="https://llm.onerouter.pro/v1", # the only line that differs api_key="YOUR_INFRON_KEY", ) # multi-turn conversation, stable part of the prompt first client.chat.completions.create( model="deepseek/deepseek-v4-flash", messages=[system_prompt, *history, user_turn], )
from openai import OpenAI client = OpenAI( base_url="https://llm.onerouter.pro/v1", # the only line that differs api_key="YOUR_INFRON_KEY", ) # multi-turn conversation, stable part of the prompt first client.chat.completions.create( model="deepseek/deepseek-v4-flash", messages=[system_prompt, *history, user_turn], )
Keep the stable part of your prompt first and the changing part last, and we hold the route steady underneath. Everything else about your call stays the same.
It also gets out of your way the moment you steer yourself. If your request sets provider.order, provider.only, provider.ignore, provider.sort, or allow_fallbacks, we skip Sticky Routing for that request, because we will not override a routing choice you made on purpose.
When it does not help
Sticky Routing is not magic, and some workloads get nothing from it.
If the start of your prompt changes every time, there is no warm prefix to protect.
If every request is a brand new conversation, there is nothing to keep sticky.
If you already pin a provider per request, you have turned it off yourself.
The payoff grows with how much your prompt repeats, which is exactly the agent, RAG, and assistant traffic that fills most production systems today.
FAQ
Does it hurt reliability? No. It only decides which healthy provider wins a close call. If the sticky one fails, the retry skips it and routes normally, then re-stickies to whatever actually worked. Anything going down, degrading, or timing out sends you back to normal routing.
Could it pin me to a slow or pricey provider? Only while that provider is healthy, and only for a short window. The memory does not survive failures and expires on its own.
How do you tell one conversation from another? A key built from the user, the model, and a fingerprint of the conversation. Different conversations get different keys and route on their own.
Do I send a parameter? No. It is on by default, and it steps aside when you set your own provider controls.
Try it
Sticky Routing is live. Put the stable part of your prompt first, point your traffic at Infron, and watch your own cache hit rate and cost move. Rerun the same test on your own workload, or head to infron.ai to start.
Less orchestration. More innovation.
Sticky Routing: Your Cache Hit Rate Is a Routing Problem
Your AI app resends the same long block of instructions and context at the top of every request. Prompt caching is supposed to make that block nearly free after the first time: a cache read costs about a tenth of full price. So you turn it on, and your bill drops.
Then it climbs back, and you never touched your code.
The problem is not your prompt. It is where your requests are going. This post is about that, what we measured against OpenRouter, and the feature we built to fix it: Sticky Routing.
The problem: a gateway scatters your cache
Start with the thing caching depends on. A "prefix" is just the part of your prompt that stays the same every time: your tool definitions, system instructions, and documents, sitting up front. When a provider sees a prefix it just processed, it skips the work and reads it back from cache for a fraction of the price. That fraction is the whole savings.
It works perfectly on a single provider, because every request lands in the same place. A gateway is different. A gateway is the layer between your app and many providers, and its job is to send each request to whichever provider looks best right now. Great for reliability and price. Not great for your cache.
Watch one conversation, the same prefix every turn, on a gateway with no memory of where it sent you last:
You paid to warm up three providers and read from almost none of them. Your cache is not broken. It is scattered. And on most turns you are paying full price for context the system already processed, plus a slower first token while it does that work over again.
The more providers in the pool, the more places your cache can scatter to. A gateway is supposed to save you money. On a repetitive workload, careless routing does the opposite.
What we measured against OpenRouter
We did not want to just assert this, so we tested it head to head with OpenRouter.
The setup is simple on purpose. For each platform we send the same request twice in a row. The first call warms the cache. The second call is the one we check, because a request can succeed without reusing anything. The number we read is how many tokens the second call got back from cache. We ran 120 rounds per platform (3 groups of 40), the same model (deepseek/deepseek-v4-flash), caching on for both, the same 12,000-character prefix.
What we measured | Infron | OpenRouter |
|---|---|---|
Rounds that actually read from cache | 120 of 120 | 103 of 120 |
Cache hit rate (by call) | 100% | 85.83% |
Cache hit rate (by token) | 97.22% | 83.40% |
Avg tokens read from cache, 2nd call | 4096.0 | 3513.6 |
Total measured cost, all 120 rounds | $0.036364 | $0.046970 |
Same model, same prompt, and OpenRouter cost 1.2917 times as much. That is 22.58% more, for one reason: it missed the cache on 17 of 120 rounds, and every miss pays full price again.
To be straight about what this is: it is a repeated-request test that shows how well each platform holds onto a warm cache, not an on/off test of a single system. The 100% is 120 clean hits in a controlled run, not a promise for every workload. The method above is simple enough to rerun on your own prompt and model. The part that travels is the point underneath the numbers: on a gateway, your cache hit rate depends on routing, and routing is ours to manage.
The fix: Sticky Routing
Sticky Routing keeps one conversation on the same provider that already worked for it, instead of letting it wander. Your prefix stays where it is already warm.
The same conversation, Sticky Routing on:
The word that matters is healthy. This is "reuse it while it is working well," not "lock onto one provider no matter what."
How it works
The first request in a conversation routes normally: we build the candidate list, filter it, check health, and score. Whichever provider succeeds, we remember it for this conversation, keyed to the user, the model, and a fingerprint of the conversation.
On the next turns, before scoring runs, we check that memory. We reuse the remembered provider only if it is still a candidate, still up, and not degraded. If anything looks off, we drop the memory and route normally, with no noise.
Every time it succeeds, we refresh the memory.
The memory only lasts a short while, matched to how long providers keep their own cache warm. After that, the conversation routes fresh again and re-stickies on its next success.
Matching the memory is necessary, but never enough on its own. The provider still has to pass the health check at that moment.
You do not have to do anything
Sticky Routing is on by default. No flag, no code change. You point at Infron the same way you point at any OpenAI-style endpoint:
from openai import OpenAI client = OpenAI( base_url="https://llm.onerouter.pro/v1", # the only line that differs api_key="YOUR_INFRON_KEY", ) # multi-turn conversation, stable part of the prompt first client.chat.completions.create( model="deepseek/deepseek-v4-flash", messages=[system_prompt, *history, user_turn], )
Keep the stable part of your prompt first and the changing part last, and we hold the route steady underneath. Everything else about your call stays the same.
It also gets out of your way the moment you steer yourself. If your request sets provider.order, provider.only, provider.ignore, provider.sort, or allow_fallbacks, we skip Sticky Routing for that request, because we will not override a routing choice you made on purpose.
When it does not help
Sticky Routing is not magic, and some workloads get nothing from it.
If the start of your prompt changes every time, there is no warm prefix to protect.
If every request is a brand new conversation, there is nothing to keep sticky.
If you already pin a provider per request, you have turned it off yourself.
The payoff grows with how much your prompt repeats, which is exactly the agent, RAG, and assistant traffic that fills most production systems today.
FAQ
Does it hurt reliability? No. It only decides which healthy provider wins a close call. If the sticky one fails, the retry skips it and routes normally, then re-stickies to whatever actually worked. Anything going down, degrading, or timing out sends you back to normal routing.
Could it pin me to a slow or pricey provider? Only while that provider is healthy, and only for a short window. The memory does not survive failures and expires on its own.
How do you tell one conversation from another? A key built from the user, the model, and a fingerprint of the conversation. Different conversations get different keys and route on their own.
Do I send a parameter? No. It is on by default, and it steps aside when you set your own provider controls.
Try it
Sticky Routing is live. Put the stable part of your prompt first, point your traffic at Infron, and watch your own cache hit rate and cost move. Rerun the same test on your own workload, or head to infron.ai to start.
Less orchestration. More innovation.
More Articles

Seedance 2.0 Real Human Pipeline
How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

Seedance 2.0 Real Human Pipeline
How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

From Image Model to Finished Clip
Seedance 2.0 Real Human Video API: Access, Setup, and Prompting

From Image Model to Finished Clip
Seedance 2.0 Real Human Video API: Access, Setup, and Prompting

Research
SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Research
SEAR: Schema-Based Evaluation and Routing for LLM Gateways
Less orchestration.
More innovation.
Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.
Less orchestration.
More innovation.
Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.
Less orchestration.
More innovation.
Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.