# Prompt Caching

### What's Prompt Cache

Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt.&#x20;

*"Prompt"* in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the service is able to retain a temporary cache of processed input token computations to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost.&#x20;

{% hint style="info" %}
Typically, cache read fees are about **10%-25%** of the original input cost, saving **up to 90%** of input costs.
{% endhint %}

### Best Practices for Prompt Cache <a href="#best-practices" id="best-practices"></a>

#### Maximizing Cache Hit Rate <a href="#maximizing-cache-hit-rate" id="maximizing-cache-hit-rate"></a>

Optimization Recommendations

* **Maintain Prefix Consistency**: Place static content at the beginning of prompts, variable content at the end
* **Use Breakpoints Wisely**: Set different cache breakpoints based on content update frequency
* **Avoid Minor Changes**: Ensure cached content remains completely consistent across multiple requests
* **Control Cache Time Window**: Initiate subsequent requests within 5 minutes to hit cache

**Extending Cache Time (1-hour TTL)**

If your request intervals may exceed 5 minutes, consider using 1-hour cache:

```json
{
    "type": "text",
    "text": "Long document content...",
    "cache_control": {
        "type": "ephemeral",
        "ttl": "1h" # Extend to 1 hour #
    }
}
```

The write cost for 1-hour cache is 2x the base fee (compared to 1.25x for 5-minute cache), only worthwhile in low-frequency but regular call scenarios.

#### Avoiding Common Pitfalls <a href="#avoiding-common-pitfalls" id="avoiding-common-pitfalls"></a>

Common Issues

1. **Cached Content Too Short**: Ensure cached content meets minimum token requirements
2. **Content Inconsistency**: Changes in JSON object key order will invalidate cache (certain languages like Go, Swift)
3. **Mixed Format Usage**: Using different formatting approaches for the same content
4. **Ignoring Cache Validity Period**: Cache becomes invalid after 5 minutes

### Caching Types

Models supported by Infron offer two types of prompt caching mechanisms:

<table><thead><tr><th width="165.2176513671875">Caching Type</th><th width="559.982666015625">Usage Method</th></tr></thead><tbody><tr><td><strong>Implicit Caching</strong></td><td>No configuration needed, <code>automatically managed by model provider</code></td></tr><tr><td><strong>Explicit Caching</strong></td><td>Requires <code>cache_control</code> parameter</td></tr></tbody></table>

#### Implicit Caching <a href="#type-1-implicit-caching" id="type-1-implicit-caching"></a>

The following model providers provide implicit automatic prompt caching, requiring no special parameters in requests—the model automatically detects and caches reusable content.

| Model Provider | Official Documentation                                                              | Quick Start                                |
| -------------- | ----------------------------------------------------------------------------------- | ------------------------------------------ |
| **OpenAI**     | [Prompt Caching](https://platform.openai.com/docs/guides/prompt-caching)            | [#openai](#openai "mention")               |
| **DeepSeek**   | [Prompt Caching](https://api-docs.deepseek.com/guides/kv_cache)                     |                                            |
| **xAI**        | [Prompt Caching](https://docs.x.ai/docs/models#models-and-pricing)                  | [#grok](#grok "mention")                   |
| **Google**     | [Prompt Caching](https://ai.google.dev/gemini-api/docs/caching)                     | [#google-gemini](#google-gemini "mention") |
| **Alibaba**    | [Prompt Caching](https://www.alibabacloud.com/help/en/model-studio/context-cache)   |                                            |
| **MoonshotAI** | [Prompt Caching](https://platform.moonshot.ai/old/caching.en-US#request-parameters) |                                            |
| **Z.AI**       | [Prompt Caching](https://docs.z.ai/guides/capabilities/cache)                       |                                            |

💡 Optimization Recommendations

To maximize cache hit rate, follow these best practices:

1. **Static-to-Dynamic Ordering**: Place stable, reusable content (such as system instructions, few-shot examples, document context) at the beginning of the messages array
2. **Variable Content at End**: Place variable, request-specific content (such as current user question, dynamic data) at the end of the array
3. **Maintain Prefix Consistency**: Ensure cached content remains completely consistent across multiple requests (including spaces and punctuation)

#### Explicit Caching <a href="#type-2-explicit-caching" id="type-2-explicit-caching"></a>

Anthropic Claude and Qwen series models can explicitly specify caching strategies through specific parameters. This approach provides the finest control but requires developers to actively manage caching strategies.

| Model Provider       | Official Documentation                                                                                                      | Quick Start                                      |
| -------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ |
| **Anthropic Claude** | [Prompt Caching](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)                                      | [#anthropic-claude](#anthropic-claude "mention") |
| **Google**           | [Context caching overview](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview) | [#google-gemini](#google-gemini "mention")       |

**Caching Working Principle**

When you send a request with `cache_control` markers:

1. The system checks if a reusable cache prefix exists
2. If a matching cache is found, cached content is used (reducing cost)
3. If no match is found, the complete prompt is processed and a new cache entry is created

Cached content includes the complete prefix in the request: `tools` → `system` → `messages` (in this order), up to where `cache_control` is marked.

**Automatic Prefix Check**

You only need to add a cache breakpoint at the end of static content, and the system will automatically check approximately the preceding 20 content blocks for reusable cache boundaries. If the prompt contains more than 20 content blocks, consider adding additional `cache_control` breakpoints to ensure all content can be cached.

### Getting Started <a href="#getting-started" id="getting-started"></a>

#### Anthropic Claude

**Minimum Cache Length**

Minimum cacheable token count for different models:

<table><thead><tr><th width="367.7137451171875">Model Series</th><th>Minimum Cache Tokens</th></tr></thead><tbody><tr><td>Claude Opus 4.1/4</td><td>1024 tokens</td></tr><tr><td>Claude Haiku 3.5</td><td>2048 tokens</td></tr><tr><td>Sonnet 4.5/4/3.7</td><td>1024 tokens</td></tr></tbody></table>

**Caching Price**

* **Cache writes**: charged at 1.25x the price of the original input pricing
* **Cache reads**: charged at 0.1x the price of the original input pricing

**Cache Breakpoint Count**

Prompt caching with Anthropic requires the use of `cache_control` breakpoints. There is a limit of `4 breakpoints`, and the cache will expire within `5 minutes`. Therefore, it is recommended to reserve the cache breakpoints for large bodies of text, such as character cards, CSV data, RAG data, book chapters, etc. And there is a minimum prompt size of `1024 tokens.`

[Click here to read more about Anthropic prompt caching and its limitation.](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)

The `cache_control` breakpoint can only be inserted into the text part of a multipart message. Prompts shorter than the minimum token count will not be cached even if marked with `cache_control`. Requests will be processed normally but no cache will be created.

**Cache Validity Period**

* **Default TTL**: 5 minutes
* **Extended TTL**: 1 hour (requires additional fee)

Cache automatically refreshes with each use at no additional cost.

**System message caching example:**

```json
{
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a historian studying the fall of the Roman Empire. You know the following book very well:"
        },
        {
          "type": "text",
          "text": "HUGE TEXT BODY",
          "cache_control": {
            "type": "ephemeral"
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What triggered the collapse?"
        }
      ]
    }
  ]
}
```

**User message caching example:**

```json
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Given the book below:"
        },
        {
          "type": "text",
          "text": "HUGE TEXT BODY",
          "cache_control": {
            "type": "ephemeral"
          }
        },
        {
          "type": "text",
          "text": "Name all the characters in the above book"
        }
      ]
    }
  ]
}
```

**Basic Usage: Caching System Prompts**

{% tabs %}
{% tab title="Python" %}

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.onerouter.pro/v1",
    api_key="<API_KEY>",
)

# First request - create cache
response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929", 
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant specializing in literary analysis. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"
                },
                {
                    "type": "text",
                    "text": "<Complete content of Pride and Prejudice>",
                    "cache_control": {"type": "ephemeral"} 
                }
            ]
        },
        {
            "role": "user",
            "content": "Analyze the main themes of Pride and Prejudice."
        }
    ]
)

print(response.choices[0].message.content)

# Second request - cache hit
response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant specializing in literary analysis. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"
                },
                {
                    "type": "text",
                    "text": "<Complete content of Pride and Prejudice>",
                    "cache_control": {"type": "ephemeral"} # Same content hits cache #
                }
            ]
        },
        {
            "role": "user",
            "content": "Who are the main characters in this book?" # Only question differs #
        }
    ]
)

print(response.choices[0].message.content)
```

{% endtab %}
{% endtabs %}

**Advanced Usage: Caching Tool Definitions**

When your application uses many tools, caching tool definitions can significantly reduce costs:

{% tabs %}
{% tab title="Python" %}

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.onerouter.pro/v1",
    api_key="<API_KEY>",
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929",
    tools=[ 
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a specified location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and province, e.g. Beijing, Beijing"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "Temperature unit"
                        }
                    },
                    "required": ["location"]
                }
            }
        },
        # Can define more tools...
        {
            "type": "function",
            "function": {
                "name": "get_time",
                "description": "Get current time for a specified timezone",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "timezone": {
                            "type": "string",
                            "description": "IANA timezone name, e.g. Asia/Shanghai"
                        }
                    },
                    "required": ["timezone"]
                }
            },
            "cache_control": {"type": "ephemeral"} # Mark cache on last tool #
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "What's the current weather and time in Beijing?"
        }
    ]
)

print(response.choices[0].message)
```

{% endtab %}
{% endtabs %}

By adding a `cache_control` marker on the last tool definition, the system will automatically cache all tool definitions as a complete prefix.

**Advanced Usage: Caching Conversation History**

In long conversation scenarios, you can cache the entire conversation history:

{% tabs %}
{% tab title="Python" %}

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.onerouter.pro/v1",
    api_key="<API_KEY>",
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "...long system prompt",
                    "cache_control": {"type": "ephemeral"} # Cache system prompt #
                }
            ]
        },
        # Previous conversation history
        {
            "role": "user",
            "content": "Hello, can you tell me more about the solar system?"
        },
        {
            "role": "assistant",
            "content": "Of course! The solar system is a collection of celestial bodies orbiting the sun. It consists of eight planets, numerous satellites, asteroids, comets and other celestial objects..."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Great."
                },
                {
                    "type": "text",
                    "text": "Tell me more about Mars.",
                    "cache_control": {"type": "ephemeral"} # Cache all conversation up to here #
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)
```

{% endtab %}
{% endtabs %}

By adding `cache_control` to the last message of each conversation round, the system will automatically find and use the longest matching prefix from previously cached content. Even if content was previously marked with `cache_control`, as long as it's used within 5 minutes, it will automatically hit the cache and refresh the validity period.

**Advanced Usage: Multi-Breakpoint Combination**

When you have multiple content segments with different update frequencies, you can use multiple cache breakpoints:

{% tabs %}
{% tab title="Python" %}

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.onerouter.pro/v1",
    api_key="<API_KEY>",
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929",
    tools=[ 
        # Tool definitions (rarely change)
        {
            "type": "function",
            "function": {
                "name": "search_documents",
                "description": "Search knowledge base",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string", "description": "Search query"}
                    },
                    "required": ["query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "get_document",
                "description": "Retrieve document by ID",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "doc_id": {"type": "string", "description": "Document ID"}
                    },
                    "required": ["doc_id"]
                }
            },
            "cache_control": {"type": "ephemeral"} # Breakpoint 1: Tool definitions #
        }
    ],
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a research assistant with access to a document knowledge base.\n\n# Instructions\n- Always search for relevant documents first\n- Provide citations...",
                    "cache_control": {"type": "ephemeral"} # Breakpoint 2: System instructions #
                },
                {
                    "type": "text",
                    "text": "# Knowledge Base Context\n\nHere are the relevant documents for this conversation:\n\n## Document 1: Solar System Overview\nThe solar system consists of the sun and all celestial bodies orbiting it...\n\n## Document 2: Planetary Characteristics\nEach planet has unique characteristics...",
                    "cache_control": {"type": "ephemeral"} # Breakpoint 3: RAG documents #
                }
            ]
        },
        {
            "role": "user",
            "content": "Can you search for information about Mars rovers?"
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "tool_use",
                    "id": "tool_1",
                    "name": "search_documents",
                    "input": {"query": "Mars rovers"}
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "tool_result",
                    "tool_use_id": "tool_1",
                    "content": "Found 3 relevant documents..."
                }
            ]
        },
        {
            "role": "assistant",
            "content": "I found 3 relevant documents. Let me get more details from the Mars exploration document."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Okay, please tell me specific information about the Perseverance rover.",
                    "cache_control": {"type": "ephemeral"} # Breakpoint 4: Conversation history #
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)
```

{% endtab %}
{% endtabs %}

Using multiple cache breakpoints allows content with different update frequencies to be cached independently:

* **Breakpoint 1**: Tool definitions (almost never change)
* **Breakpoint 2**: System instructions (rarely change)
* **Breakpoint 3**: RAG documents (may update daily)
* **Breakpoint 4**: Conversation history (changes every round)

When only the conversation history is updated, the cache for the first three breakpoints remains valid, maximizing cost savings.

**What Invalidates Cache**

The following operations will invalidate part or all of the cache:

<table><thead><tr><th>Changed Content</th><th width="122.83740234375">Tool Cache</th><th width="101.639404296875">System Cache</th><th width="100.879150390625">Message Cache</th><th>Impact Description</th></tr></thead><tbody><tr><td><strong>Tool Definitions</strong></td><td>✘</td><td>✘</td><td>✘</td><td>Modifying tool definitions invalidates entire cache</td></tr><tr><td><strong>System Prompt</strong></td><td>✓</td><td>✘</td><td>✘</td><td>Modifying system prompt invalidates system and message cache</td></tr><tr><td><strong>tool_choice Parameter</strong></td><td>✓</td><td>✓</td><td>✘</td><td>Only affects message cache</td></tr><tr><td><strong>Add/Remove Images</strong></td><td>✓</td><td>✓</td><td>✘</td><td>Only affects message cache</td></tr></tbody></table>

#### OpenAI

Caching price changes:

* **Cache writes**: no cost
* **Cache reads**: charged at `0.1x ~ 0.5x the price` of the original input pricing

[Click here to view OpenAI's cache pricing per model.](https://platform.openai.com/docs/pricing)

Prompt caching with OpenAI is automated and does not require any additional configuration. There is a minimum prompt size of `1024 tokens`.

[Click here to read more about OpenAI prompt caching and its limitation.](https://platform.openai.com/docs/guides/prompt-caching)

#### Grok

Caching price changes:

* **Cache writes**: no cost
* **Cache reads**: charged at 0.25x the price of the original input pricing

[Click here to view Grok's cache pricing per model.](https://docs.x.ai/docs/models#models-and-pricing)

Prompt caching with Grok is automated and does not require any additional configuration.

#### Google Gemini <a href="#google-gemini" id="google-gemini"></a>

**Implicit Caching**

Gemini 2.5 Pro and 2.5 Flash models now support **implicit caching**, providing automatic caching functionality similar to OpenAI’s automatic caching. Implicit caching works seamlessly — no manual setup or additional `cache_control` breakpoints required.

Pricing Changes:

* No cache write or storage costs.
* Cached tokens are charged at `0.1x the price` of original input token cost.

Note that the TTL is on average 3-5 minutes, but will vary. There is a minimum of 1028 tokens for Gemini 2.5 Flash, and 2048 tokens for Gemini 2.5 Pro for requests to be eligible for caching.

[Official announcement from Google](https://developers.googleblog.com/en/gemini-2-5-models-now-support-implicit-caching/)

{% hint style="info" %}
To maximize implicit cache hits, keep the initial portion of your message arrays consistent between requests. Push variations (such as user questions or dynamic context elements) toward the end of your prompt/requests.
{% endhint %}

**Explicit Caching**

Gemini caching in Infron requires you to insert `cache_control` breakpoints explicitly within message content, similar to Anthropic and Qwen. We recommend using caching primarily for large content pieces (such as CSV files, lengthy character cards, retrieval augmented generation (RAG) data, or extensive textual sources).

{% hint style="info" %}
There is not a limit on the number of `cache_control` breakpoints you can include in your request. Infron will use `only the last breakpoint` for Gemini caching. Including multiple breakpoints is safe and can help maintain compatibility with Anthropic, but only `the final one` will be used for Gemini.
{% endhint %}

**Cache Validity Period**

* **Default TTL**: 5 minutes
* **Extended TTL**: 1 hour (requires additional fee)

**Caching Working Principle**

When you send a request with `cache_control` markers:

1. The system checks if a reusable cache prefix exists
2. If a matching cache is found, cached content is used (reducing cost)
3. If no match is found, the complete prompt is processed and a new cache entry is created

Cached content includes the complete prefix in the request: `tools` → `system` → `messages` (in this order), up to where `cache_control` is marked.

**Examples**

**System Message Caching Example**

{% tabs %}
{% tab title="Python" %}

```python
import requests
import json

response = requests.post(
  url="https://llm.onerouter.pro/v1/chat/completions",
  headers={
    "Authorization": "Bearer YOUR-API-KEY",
    "Content-Type": "application/json"
  },
  data=json.dumps({
    "model": "google/gemini-2.5-flash", 
    "messages": [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a historian studying the fall of the Roman Empire. Below is an extensive reference book:"
                },
                {
                    "type": "text",
                    "text": "HUGE TEXT BODY HERE",
                    "cache_control": {
                        "type": "ephemeral",
                        "ttl": 300
                    }
                }
            ]
        },{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What triggered the collapse?"
                }
            ]
        }
    ],
    "provider": {
        "order": ["google-vertex"]
    },
    "max_tokens": 1024
  })
)
print(response.json())
```

{% endtab %}
{% endtabs %}

**User Message Caching Example**

{% tabs %}
{% tab title="Python" %}

```python
import requests
import json

response = requests.post(
  url="https://llm.onerouter.pro/v1/chat/completions",
  headers={
    "Authorization": "Bearer YOUR-API-KEY",
    "Content-Type": "application/json"
  },
  data=json.dumps({
    "model": "google/gemini-2.5-flash", 
    "messages": [
        {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Based on the book text below:"
            },
            {
                "type": "text",
                "text": "HUGE TEXT BODY HERE",
                "cache_control": {
                    "type": "ephemeral",
                    "ttl": 300
                }
            },
            {
                "type": "text",
                "text": "List all main characters mentioned in the text above."
            }
        ]
        }
    ],
    "provider": {
        "order": ["google-vertex"]
    },
    "max_tokens": 1024
  })
)
print(response.json())
```

{% endtab %}
{% endtabs %}

**History User Message Caching Example**

{% tabs %}
{% tab title="Python" %}

```python
import requests
import json
import time


long_prompt = """
### Prompt Title:
**The Shattered Continent — A Comprehensive World‑Building and Narrative Instruction**

---

You are to imagine and describe, in vivid, cinematic, and intellectually coherent detail, a vast fictional world known as *Aelyndra*, a continent that was once united under luminous orders of scholars, mages, engineers, and philosophers, but is now fragmented by centuries of arcane wars, plagues, and ideological rifts. The purpose of this prompt is to generate an elaborate tapestry of interlocking stories, characters, cultures, technologies, and metaphysical mysteries. Every generated text based on this prompt should feel immersive, multi‑layered, and historically grounded within its own logic. The tone should balance grounded realism with mythic resonance, evoking both awe and melancholy.

Below are detailed aspects, lore structures, stylistic expectations, sensory directions, metaphysical principles, and narrative possibilities you should elaborate upon.

---

#### 1. **Historical Overview**

Describe a timeline spanning thousands of years, from the primordial formation of Aelyndra to its contemporary fractured age.
Include eras such as **The Genesis Fires**, when the first luminous beings descended and shaped the continents; **The Chain‑Forge Epoch**, when mortal civilizations learned to harness resonant metals that could channel thought; **The Concordant Millennia**, the golden age of united knowledge; and **The Sundering**, a cataclysmic fracturing that split both geography and the collective memory of humankind.

Every historical event must feel internally consistent: show cause and consequence. For instance, the loss of one coastal city’s library should have ripples across distant temples and later generations’ philosophies. The tone should be reflective and slightly tragic, as though the chronicler recounts a glorious but forgotten lineage.

---

#### 2. **Geography and Environment**

Construct a geography of striking variety and symbolic resonance — volcanic shores, glass deserts, cities built within petrified forests, islands that drift through the mist like sleeping giants. For each region, define climate, flora, fauna, and the materials used in architecture. The
**Amber Steppes**, for example, might shimmer with grasses that refract sunlight into living colors, while **The Hollow Expanse** could be a wasteland where the air hums with residual magic from ancient wars.

Integrate ecological logic: how trade winds, oceanic currents, or tectonic activity affect culture and migration. Mountains may separate kingdoms physically but rivers and undersea tunnels connect them secretly. Give attention to sensory cues — the smell of resin in mountain villages, the sound of iron insects ticking in the deserts at twilight, the taste of mineral dust in air after storms.

---

#### 3. **Peoples and Cultures**

List multiple civilizations and describe how they diverged culturally, linguistically, and spiritually after the Sundering. Avoid simplistic binaries of good versus evil; each culture must hold a mixture of beauty, cruelty, and contradiction.
For instance:

- **The Dathenians**, descendants of former astronomer‑priests, now live beneath great dome observatories shattered by meteor showers; their language evolved around the concept of cyclical silence, and their rituals involve rebuilding and unbuilding stone circles.
- **The Marquorians**, sea‑bound artisans who sculpt coral into living fortresses; they treat navigation as a spiritual rite, believing each voyage mirrors the journey of the soul beyond death.
- **The Oruvian Clans**, desert dwellers who master the remnants of sonic engineering, forging instruments that can blast sandstorms into harmonious patterns visible for miles.

Each description should anchor political systems, economic practices, mythological origins, and interpersonal customs: how they greet each other, mourn their dead, or repair their tools. Include the etymology of cultural names, food habits, clothing textures, and color symbolism.

---

#### 4. **Religions, Philosophy, and Magic Systems**

Magic in this world arises not from childish incantation but from **resonant cognition**, a symbiotic interaction between thought, mineral vibration, and light frequency. Those talented in the craft can bind emotion into material forms — forging “sentient metals” that remember their wielders’ fears or hopes. Magic is thus both scientific and spiritual, blurring boundaries between psychology, physics, and theology.

Develop diverse schools of philosophy debating ethical use of such power:

- The **Solace Theorists** argue that controlling resonance is an act of compassion — to heal broken matter.
- The **Iron Aesthetes** consider creation a cruel necessity, insisting that only destruction brings cosmic symmetry.
- The **Children of Echo** worship silence and claim that every magical act pollutes the universal rhythm.

In your generated text, treat these doctrines not merely as background flavor but as intellectual frameworks shaping language, law, art, and personal relationships.

---

#### 5. **Technology and Architecture**

Aelyndra’s civilizations developed hybrid science combining clockwork engineering, bio‑alchemy, and energy crystallization. Describe towers powered by luminous conduits that pulse in rhythm with heartbeat sensors, skyships navigated by harmonic crystals, temples where gears and vines intertwine as living mechanisms. Highlight how technology evolves according to resource distribution: coastal regions rely on fungal luminescence, while mountain regimes mine “thought‑ore.” The interplay of invention and superstition drives narrative tension: progress both liberates and curses.

Architectural imagery should emphasize scale and mood: narrow alleys carved into obsidian cliffs, floating monasteries tethered by cables of woven gold, and markets illuminated by singing light globes whose hum forms improvised melodies as people pass.

---

#### 6. **Narrative Archetypes**

Encourage stories about rediscovery, reconciliation, and ambiguity rather than simple triumph. Possible archetypes include:

- The **Historian Without Records**, traveling to piece together memories hidden in ruins.
- The **Exile Engineer**, carrying an artifact that generates voices of those it once killed.
- The **Dream Cartographer**, mapping emotions that alter geography in real time.
- The **Queen of Mirrors**, who governs through reflection because her actual body has dissolved into glass.

All characters must confront both external danger and metaphysical uncertainty. Their heroism is subtle — the courage to remember or forgive rather than to conquer.

---

#### 7. **Sensory and Emotional Atmosphere**

When generating scenes, prioritize evocative sensory layering:
- Sound: the low chime of suspended glass, whispering wind through broken halls, distant chanting over water.
- Sight: refracted twilight on metallic dunes, murals shimmering with bioluminescent ink.
- Texture: the contrast between rusted ruins and the softness of moss growing over them.
- Emotion: nostalgia, intellectual awe, gentle melancholy, quiet rebellion.

Narrative pacing should oscillate between stillness and momentum — slow revelation punctuated by flashes of insight or dread. Readers should feel as though they’re remembering a place they never visited.

---

#### 8. **Metaphysics and Ethics**

Articulate the metaphysical principle that the universe is a dialogue between **Memory** and **Entropy**. Every act of creation defies forgetting but accelerates decay elsewhere. As a result, civilizations in Aelyndra constantly face moral trade‑offs: Should they preserve ancient resonance‑engines at the cost of ecological balance, or let their light fade naturally? These philosophical dilemmas should infuse even ordinary conversations.

Include thought experiments, fragmentary proverbs, and paradoxical hymns: “What we rebuild, we erase differently.” Avoid clichés of prophecy; instead, show how destiny might itself be a side effect of collective guilt or yearning.

---

#### 9. **Language, Names, and Symbol Codes**

Build naming conventions that suggest linguistic diversity — alternating consonant clusters and harmonic vowels, or syntax where verbs precede emotion markers. Indicate how written language has evolved: maybe modern scribes use glowing ink, and every sentence emits faint music depending on its meaning. Each culture’s writing system reveals worldview: linear scripts for materialists, spiral glyphs for those who worship recursion.
Allow symbols like tri‑circles, mirrored sigils, or broken hexagrams to recur as motifs linking spirituality and mathematics.

---

#### 10. **Storytelling Mode and Style**

When generating prose or dialogue from this prompt:

- **Tone:** intellectual lyricism blended with tactile realism.
- **Point of View:** optional mixture of omniscient chronicler, first‑person witness, or mosaic of journal entries.
- **Pacing:** start with environment or philosophical reflection before advancing plot.
- **Voice:** maintain rich vocabulary and musical rhythm, avoiding modern slang.
- **Conflict Portrayal:** inner struggle takes precedence; physical battles should mirror psychological or ideological clashes.

Comparison points: the emotional gravity of high epic poetry, the forensic detail of travelogues, the mournful tone of lost civilizations.

---

#### 11. **Prompts for Expansion**

After establishing the world, encourage detailed responses to sub‑prompts such as:

1. Describe a festival in a ruined city rebuilt with living vines; include sensory details, songs, rituals, and philosophical conversations heard between drunk scholars.
2. Write letters exchanged between two philosophers debating whether machines can dream. Use subtle metaphors instead of direct exposition.
3. Paint a panoramic view of the continent from orbit after centuries of regrowth — show what remains luminous when human memory fades.
4. Chronicle a court where sentences are sung rather than spoken, and justice is determined by the harmony of the choir’s tone.
5. Depict children discovering an artifact that records emotions. Show how it alters their personal identities.

Each of these sub‑prompts must align with the metaphysical and cultural logic above.

---

#### 12. **Ethos of Generation**

When using this master prompt, emphasize imagination rooted in coherence. Every fantastical element should follow some rationale — whether physical, symbolic, or emotional. Avoid default tropes (knights, elves, dragons) unless reinvented with purpose. Portray diversity of belief and appearance; suggest realistic emotions amid mythic context. The world should feel *earned*, as though history genuinely unfolded there.

---

#### 13. **Purpose and Audience**

This is designed for creators seeking an inexhaustible setting for stories, poems, games, or conceptual art. It invites introspection, exploration of morality, and appreciation for transient beauty. Its ideal audience values depth over spectacle, meaning over mere ornament.

---

#### 14. **Instruction to the AI (if applicable)**

When generating content from this prompt, the AI should:

- Adopt a deliberate, reflective tone.
- Prioritize atmosphere and reasoning before action.
- Honor contradictions without resolving them.
- Provide continuity: refer back to established geography and philosophies.
- Avoid repetition, clichés, or superficial heroism.
- Strive for prose that reads like the memory of a dream encoded into scripture.

Output should feel semi‑academic yet emotionally resonant — a mixture of archived myth and eyewitness recollection.

Please limit the output content to within 32 characters.
---

### End of Prompt
""".strip()

messages_data = {
    "model": "google/gemini-2.5-flash", 
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Based on the book text below:"
                },
                {
                    "type": "text",
                    "text": f"{long_prompt}",
                    "cache_control": {
                        "type": "ephemeral"
                    }
                },
                {
                    "type": "text",
                    "text": "List all main characters mentioned in the text above."
                }
            ]
        },
    ],
    "max_tokens": 1024,
    "usage": {
        "include": True
    },
    "reasoning": {
        "effort": "none"
    }
}

def chat_with_explicit_caching():
    response = requests.post(
        url="https://llm.onerouter.pro/v1/chat/completions",
        headers={
            "Authorization": "Bearer YOUR-API-KEY",
            "Content-Type": "application/json"
        },
        data=json.dumps(messages_data)
    )

    res = response.json()
    print(res)
    return res


if __name__ == '__main__':
    # calculate the Input Price
    explicit_caching_res = chat_with_explicit_caching()

    # Add the completion into the messages
    messages_data['messages'].append(
        {
            "role": explicit_caching_res['choices'][0]['message']['role'],
            "content": [
                {
                    "type": "text",
                    "text": explicit_caching_res['choices'][0]['message']['content'],
                }
            ]
        }
    )

    # calculate the Cache Read price
    explicit_caching_res = chat_with_explicit_caching()
    prompt_cache_read_cost = explicit_caching_res['cost_details']['prompt_cache_read_cost']
    cached_tokens = explicit_caching_res['usage']['prompt_tokens_details']['cached_tokens']
    print(f"prompt_cache_read_cost = {prompt_cache_read_cost}")
    print(f"cached_tokens = {cached_tokens}")
    print(f"google/gemini-2.5-flash - Cache Read / M tokens =  {prompt_cache_read_cost / cached_tokens * 1000000}")

```

{% endtab %}
{% endtabs %}

Response example:

```
prompt_cache_read_cost = 6.117e-05
cached_tokens = 2039
google/gemini-2.5-flash - Cache Read / M tokens =  0.03
```

<figure><img src="https://3822312837-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FZ9C9AjT7j46HAcQrOVWw%2Fuploads%2FY15ISnEzedqsoW9hdwLC%2Fimage.png?alt=media&#x26;token=8d7ab1ae-9a63-4239-a1ae-51324ce8d89d" alt=""><figcaption></figcaption></figure>

**Best Practices**

Optimization Recommendations

* **Maintain Prefix Consistency**: Place static content at the beginning of prompts, variable content at the end
* **Avoid Minor Changes**: Ensure cached content remains completely consistent across multiple requests
* **Control Cache Time Window**: Initiate subsequent requests within 5 minutes to hit cache
* **Extending Cache Time (1-hour TTL):** If your request intervals may exceed 5 minutes, consider using 1-hour cache:

```json
{
    "type": "text",
    "text": "Long document content...",
    "cache_control": {
        "type": "ephemeral",
        "ttl": 3600 # Extend to 1 hour #
    }
}
```
