Sticky Cache

Why Prompt Cache Can Go Cold on the Same Provider?

Date

Author

Andrew Zheng

Why Prompt Cache Can Go Cold on the Same Provider?


On long-context calls, your prompt cache can quietly stop working even though you never leave the same provider: the cache read drops to zero, you pay full price again, and nothing in your code or prompt changed. We fix this by attaching a stable cache key for you, so the provider keeps sending your calls back to the machine that already holds your prompt. In a controlled test, that took the token-level cache hit rate from 0% to 98%, with no code change.

The rest of this post is why it happens and how the fix works. A textbook case: for 35 calls in a row, almost the whole prompt read from cache. Then on the 36th, the cache read dropped to zero and stayed there for dozens of calls. Same model, same prompt, same provider. Nothing changed.

Why a working cache goes cold

A provider like OpenAI does not run on one machine. It spreads your traffic across many, and a cached prompt lives on the one machine that built it. Only that machine can serve your prompt from cache; any other one reprocesses it from scratch. Under bursty, long-context load, a request can land on a different machine and read nothing, even though your prompt is byte-for-byte the same as the call before. The cache never disappeared. Your request just stopped landing on the machine that holds it.

We set the cache key for you

OpenAI gives you one lever for this: prompt_cache_key, a hint that tells the provider to send related calls to the same place. It works, but you have to know it exists, set it on every call, and pick the right granularity. Get it wrong, a fresh key on every request, and you spin up a brand-new cache every single time, which is worse than doing nothing.

So on cache-capable OpenAI and Azure paths, Infron attaches a stable key for you, derived from your account, so the provider keeps sending your related calls back to the machine that already holds your prompt. You change no code. If you already set your own key, we use it.

0% to 98%, with the key on

We isolated just this one lever: the same long prompt, sent repeatedly, once with no stable key and once with one.



Same prompt, same model. The only difference was the key. Without it, the prompt never recovered across six calls. With it, the call read 5,376 of its 5,486 tokens straight from cache.

One honest note: this is a small, controlled probe, a roughly 5,500-token prompt over six calls, built to isolate the key. It is not a long-context production benchmark, so do not read 98% as a number you will hit on every workload. For the full long-context, multi-step picture, see the cache cliff post.

What you still need to do

The key pins you to your cache; it cannot help if the prompt it is pinning changes every time. Keep the stable part of your prompt actually stable: standing instructions and context first, the per-request bits last, no timestamp or random ID slipped in front. (The across-providers version of the same idea is Sticky Routing.)

FAQ

Do I have to change my code?
No. On cache-capable OpenAI and Azure paths the key is attached for you. If you already set your own prompt_cache_key, we use it.

I run several different workloads on one account. Will one key hurt them?
The default key is scoped per customer and kept low-cardinality. If two unrelated high-throughput flows start competing, the key can be split by workflow or application scope so each keeps its own cache.

What happens to my prompt data?
The cache key we add is derived from your account, not computed from your prompt's text, and your prompt is passed through to the provider. [confirm with team: state the exact logging/retention policy and whether prompt content is stored or used for training.]

Does it guarantee every call hits cache?
No. A key makes the cache reliable to reach; it does not guarantee a hit at every context boundary. The goal is steady reads, not a perfect number.

The bottom line

A cold cache on the same provider is not a prompt problem, it is a routing one: your prompt was ready, your request just stopped landing on it. Infron keeps it landing on the machine that holds your cache, so the cheap path stays cheap and you do not have to think about it.

Less orchestration. More innovation.

Why Prompt Cache Can Go Cold on the Same Provider?


On long-context calls, your prompt cache can quietly stop working even though you never leave the same provider: the cache read drops to zero, you pay full price again, and nothing in your code or prompt changed. We fix this by attaching a stable cache key for you, so the provider keeps sending your calls back to the machine that already holds your prompt. In a controlled test, that took the token-level cache hit rate from 0% to 98%, with no code change.

The rest of this post is why it happens and how the fix works. A textbook case: for 35 calls in a row, almost the whole prompt read from cache. Then on the 36th, the cache read dropped to zero and stayed there for dozens of calls. Same model, same prompt, same provider. Nothing changed.

Why a working cache goes cold

A provider like OpenAI does not run on one machine. It spreads your traffic across many, and a cached prompt lives on the one machine that built it. Only that machine can serve your prompt from cache; any other one reprocesses it from scratch. Under bursty, long-context load, a request can land on a different machine and read nothing, even though your prompt is byte-for-byte the same as the call before. The cache never disappeared. Your request just stopped landing on the machine that holds it.

We set the cache key for you

OpenAI gives you one lever for this: prompt_cache_key, a hint that tells the provider to send related calls to the same place. It works, but you have to know it exists, set it on every call, and pick the right granularity. Get it wrong, a fresh key on every request, and you spin up a brand-new cache every single time, which is worse than doing nothing.

So on cache-capable OpenAI and Azure paths, Infron attaches a stable key for you, derived from your account, so the provider keeps sending your related calls back to the machine that already holds your prompt. You change no code. If you already set your own key, we use it.

0% to 98%, with the key on

We isolated just this one lever: the same long prompt, sent repeatedly, once with no stable key and once with one.


Same prompt, same model. The only difference was the key. Without it, the prompt never recovered across six calls. With it, the call read 5,376 of its 5,486 tokens straight from cache.

One honest note: this is a small, controlled probe, a roughly 5,500-token prompt over six calls, built to isolate the key. It is not a long-context production benchmark, so do not read 98% as a number you will hit on every workload. For the full long-context, multi-step picture, see the cache cliff post.

What you still need to do

The key pins you to your cache; it cannot help if the prompt it is pinning changes every time. Keep the stable part of your prompt actually stable: standing instructions and context first, the per-request bits last, no timestamp or random ID slipped in front. (The across-providers version of the same idea is Sticky Routing.)

FAQ

Do I have to change my code?
No. On cache-capable OpenAI and Azure paths the key is attached for you. If you already set your own prompt_cache_key, we use it.

I run several different workloads on one account. Will one key hurt them?
The default key is scoped per customer and kept low-cardinality. If two unrelated high-throughput flows start competing, the key can be split by workflow or application scope so each keeps its own cache.

What happens to my prompt data?
The cache key we add is derived from your account, not computed from your prompt's text, and your prompt is passed through to the provider. [confirm with team: state the exact logging/retention policy and whether prompt content is stored or used for training.]

Does it guarantee every call hits cache?
No. A key makes the cache reliable to reach; it does not guarantee a hit at every context boundary. The goal is steady reads, not a perfect number.

The bottom line

A cold cache on the same provider is not a prompt problem, it is a routing one: your prompt was ready, your request just stopped landing on it. Infron keeps it landing on the machine that holds your cache, so the cheap path stays cheap and you do not have to think about it.

Less orchestration. More innovation.

Less orchestration.
More innovation.

Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.

Less orchestration.
More innovation.

Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.

Less orchestration.
More innovation.

Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.