Cache Cliff
Why Long-Context Agent Slows Down Mid-Task?


Date
Author
Andrew Zheng
Why Long-Context Agent Slows Down Mid-Task
On a long-context agent, your prompt cache can silently stop working partway through a task: the agent slows to a crawl, the bill climbs, and nothing in your code changed. Infron fixes this by keeping your requests landing on the machine that already holds your cache, so it stays warm. In a real retest, that turned a three-step stall back into two of three steps reusing cache, about 300,000 tokens saved.
The rest of this post is how that breaks and how the fix works. Here is the cliff itself, on a real 150k-token agent. Seven steps:
The first four steps read about 145k to 148k tokens from cache. Steps five, six, and seven read nothing. Three from-scratch reprocesses back to back, on an agent where each step waits on the one before it, so the whole back half of the workflow stalls.
Why the cliff happens
The easy thing is to blame the prompt. But the prompt never changed. What changed was where the request landed.
A provider this size does not run on one machine. It spreads traffic across many, and a cached prompt lives on the specific machine that built it. Only that machine can serve your prompt from cache; any other one reprocesses it from scratch. Under bursty, long-context load, a request can get routed to a machine that has never seen your prompt, and it starts over, even though your prompt is byte-for-byte identical to the call before it.
So the cache never disappeared. Your request just stopped landing where it lives. Keeping a long-context cache warm is really about keeping your requests on the machine that already holds it.
How we keep the cache warm
Infron keeps your related calls landing on the same warm cache. Two moves, both automatic, both needing zero code changes from you:
Keep the conversation on one provider. We route the steps of a workflow back to the same healthy provider and machine instead of letting them scatter. We shipped this as Sticky Routing.
Pin the request to its cache. On OpenAI and Azure, we attach a stable key derived from your account, which tells the provider to send your related calls back to the machine that already holds your prompt. We shipped this as Sticky Cache. If you already set your own key, we use it.
Together, your long-context calls keep landing on the warm cache instead of scattering to cold machines.
What the fix recovered
We replayed the exact failing sequence with this on, same order, same long context, same tools. The cliff broke:
The stretch that had been reprocessing everything from scratch went from zero of three steps reusing cache to two of three. That is 300,288 tokens, about two full 150k-token prompts, that no longer had to be recomputed at full price. Steps six and seven were back to reading about 150k from cache, just like the first four.
One honest note: this is a replay of one real sequence, not a benchmark across every workload. It shows the mechanism working, not a number you will hit every time.
The one miss we did not fix
We will not pretend it is 100%. Step five still missed before six and seven recovered. When your context crosses a certain size or its shape changes, the first request has to rebuild the cache entry before the next can reuse it. We shortened the cliff from three steps to one. We did not erase that single boundary miss, and on a long enough context nothing fully will.
The goal is not a perfect score. It is to allow the occasional one-off miss and never let it snowball into a sustained cliff.
The one part that stays with you: keep the stable part of your prompt actually stable. Standing instructions and context first, the per-request bits last, no timestamps or random IDs slipped in front. The key pins you to your cache; it cannot help if the prompt it is pinning changes every time.
FAQ
Is this the same thing as Sticky Routing?
No. Routing affinity is one of the two moves; the other is the automatic cache key. On a real agent you need both, plus an eye on the cache reads. We shipped routing affinity as Sticky Routing and the key as Sticky Cache.
Do I have to change anything in my code?
No. Both moves are on by default for cache-capable OpenAI and Azure paths. If you already pass your own prompt_cache_key, we keep it.
Will my long-context cache always hit now?
No, and we would not claim it. A boundary request can still rebuild the cache, the way step five did. The goal is no sustained cliffs, not a perfect score.
How do I tell if this is happening to me?
Watch your cache-read tokens across consecutive steps. A single dip is normal; a run of near-zero reads on a prompt that did not change is a cliff.
What this means past 100k tokens
Past 100k tokens, whether your cache keeps hitting matters as much as how fast the model runs, and it comes down to where your requests land, not just what they say. Infron keeps them landing on the warm cache, so your agent stays fast and you do not have to manage any of it.
Less orchestration. More innovation.
Why Long-Context Agent Slows Down Mid-Task
On a long-context agent, your prompt cache can silently stop working partway through a task: the agent slows to a crawl, the bill climbs, and nothing in your code changed. Infron fixes this by keeping your requests landing on the machine that already holds your cache, so it stays warm. In a real retest, that turned a three-step stall back into two of three steps reusing cache, about 300,000 tokens saved.
The rest of this post is how that breaks and how the fix works. Here is the cliff itself, on a real 150k-token agent. Seven steps:
The first four steps read about 145k to 148k tokens from cache. Steps five, six, and seven read nothing. Three from-scratch reprocesses back to back, on an agent where each step waits on the one before it, so the whole back half of the workflow stalls.
Why the cliff happens
The easy thing is to blame the prompt. But the prompt never changed. What changed was where the request landed.
A provider this size does not run on one machine. It spreads traffic across many, and a cached prompt lives on the specific machine that built it. Only that machine can serve your prompt from cache; any other one reprocesses it from scratch. Under bursty, long-context load, a request can get routed to a machine that has never seen your prompt, and it starts over, even though your prompt is byte-for-byte identical to the call before it.
So the cache never disappeared. Your request just stopped landing where it lives. Keeping a long-context cache warm is really about keeping your requests on the machine that already holds it.
How we keep the cache warm
Infron keeps your related calls landing on the same warm cache. Two moves, both automatic, both needing zero code changes from you:
Keep the conversation on one provider. We route the steps of a workflow back to the same healthy provider and machine instead of letting them scatter. We shipped this as Sticky Routing.
Pin the request to its cache. On OpenAI and Azure, we attach a stable key derived from your account, which tells the provider to send your related calls back to the machine that already holds your prompt. We shipped this as Sticky Cache. If you already set your own key, we use it.
Together, your long-context calls keep landing on the warm cache instead of scattering to cold machines.
What the fix recovered
We replayed the exact failing sequence with this on, same order, same long context, same tools. The cliff broke:
The stretch that had been reprocessing everything from scratch went from zero of three steps reusing cache to two of three. That is 300,288 tokens, about two full 150k-token prompts, that no longer had to be recomputed at full price. Steps six and seven were back to reading about 150k from cache, just like the first four.
One honest note: this is a replay of one real sequence, not a benchmark across every workload. It shows the mechanism working, not a number you will hit every time.
The one miss we did not fix
We will not pretend it is 100%. Step five still missed before six and seven recovered. When your context crosses a certain size or its shape changes, the first request has to rebuild the cache entry before the next can reuse it. We shortened the cliff from three steps to one. We did not erase that single boundary miss, and on a long enough context nothing fully will.
The goal is not a perfect score. It is to allow the occasional one-off miss and never let it snowball into a sustained cliff.
The one part that stays with you: keep the stable part of your prompt actually stable. Standing instructions and context first, the per-request bits last, no timestamps or random IDs slipped in front. The key pins you to your cache; it cannot help if the prompt it is pinning changes every time.
FAQ
Is this the same thing as Sticky Routing?
No. Routing affinity is one of the two moves; the other is the automatic cache key. On a real agent you need both, plus an eye on the cache reads. We shipped routing affinity as Sticky Routing and the key as Sticky Cache.
Do I have to change anything in my code?
No. Both moves are on by default for cache-capable OpenAI and Azure paths. If you already pass your own prompt_cache_key, we keep it.
Will my long-context cache always hit now?
No, and we would not claim it. A boundary request can still rebuild the cache, the way step five did. The goal is no sustained cliffs, not a perfect score.
How do I tell if this is happening to me?
Watch your cache-read tokens across consecutive steps. A single dip is normal; a run of near-zero reads on a prompt that did not change is a cliff.
What this means past 100k tokens
Past 100k tokens, whether your cache keeps hitting matters as much as how fast the model runs, and it comes down to where your requests land, not just what they say. Infron keeps them landing on the warm cache, so your agent stays fast and you do not have to manage any of it.
Less orchestration. More innovation.
More Articles

Sticky Cache
Why Prompt Cache Can Go Cold on the Same Provider?

Sticky Cache
Why Prompt Cache Can Go Cold on the Same Provider?

Sticky Routing
Sticky Routing: Your Cache Hit Rate Is a Routing Problem

Sticky Routing
Sticky Routing: Your Cache Hit Rate Is a Routing Problem

Seedance 2.0 Real Human Pipeline
How to Build a Seedance 2.0 Real Human Pipeline With Reference Images

Seedance 2.0 Real Human Pipeline
How to Build a Seedance 2.0 Real Human Pipeline With Reference Images
Less orchestration.
More innovation.
Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.
Less orchestration.
More innovation.
Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.
Less orchestration.
More innovation.
Seamlessly integrate Infron with just a few lines of code and unlock unlimited AI power.