A short guide to getting consistent prompt-cache hits from any client or agent harness (Pi, Claude Code, custom curl, SDKs, etc.).
The one thing to know
For Claude / Anthropic-style models, caching is opt-in at the API level, so whether it happens depends on your client. Pioneer forwards your request as-is and never adds cache markers for you. If the request does not include a cache_control marker, the stable prefix is not cached and you pay full input price on every turn. Some clients opt in for you — Claude Code, for example, uses Anthropic’s endpoints (which require cache_control), so it sends markers automatically. A bare custom client that sends plain-string prompts does not, and gets no caching.
Models that cache automatically (the OpenAI/GPT family) do so transparently and reject explicit markers — for those you don’t need to do anything, and you should not add cache_control.
| Model | What you do |
|---|
| Claude / Anthropic-style | Add cache_control markers (see below) |
| OpenAI / GPT family (auto-caching) | Nothing — caching is automatic |
| Other models | If unsure, add a marker; it’s honored where supported and ignored where the model auto-caches |
The reliable lever you control is cache_control on Claude models. The rest of this guide is about using it correctly.
How to mark a cacheable prefix
Send the stable part of your prompt as a content block with a cache_control marker. Everything up to and including the marked block becomes the cached prefix and is reused on later requests that share that exact prefix.
Markers are honored on every message-based endpoint and normalized identically:
POST /v1/chat/completions (OpenAI-compatible) — mark a system/user message content block.
POST /v1/messages (Anthropic-compatible) — mark a system block, message block, or tools block.
POST /v1/responses (OpenAI Responses) — mark a content block on an input message. The stable prefix is marked via a system/developer input message block (the plain instructions string has no place for a marker).
POST /inference (Pioneer native, decoder generation) — send a message content as a content-block array and mark a block.
POST /v1/completions (legacy text completions) sends a single opaque prompt string with no content blocks, so it has no place to attach a marker. Caching there depends entirely on the upstream’s automatic caching.
Example (/v1/chat/completions)
curl https://api.pioneer.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-API-Key: $PIONEER_API_KEY" \
-d '{
"model": "claude-opus-4-6",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Large stable system prompt or reusable context goes here.",
"cache_control": { "type": "ephemeral" }
}
]
},
{ "role": "user", "content": "Answer in one sentence: what is prompt caching?" }
],
"max_tokens": 512
}'
The system content must be a content-block array (not a plain string) so the marker has a block to attach to. A plain-string system prompt cannot carry a marker and will not be cached.
Where to put markers
You can mark up to 4 breakpoints per request. Put them on the parts that stay constant from one turn to the next, in prefix order:
- System prompt — your stable instructions / reusable context.
- Tool definitions — mark the tools block if you send a large, fixed tool set.
- A message boundary — mark the last message of a stable conversation prefix to cache the history up to that point in a multi-turn session.
Anything that changes every turn (the latest user question) should come after your last marker so the cached prefix stays identical and keeps hitting.
An optional ttl (e.g. { "type": "ephemeral", "ttl": "1h" }) controls the cache lifetime; omit it to use the provider’s default ephemeral window.
Minimum cacheable size
Caching only kicks in once the cacheable prefix is large enough — on the order of ~1K–4K tokens depending on the model. A short system prompt may fall below the threshold, in which case no cache entry is created even with a correct marker. If you expect caching but see no cache tokens, check that the marked prefix is genuinely large.
Configuring an agent harness
Most harnesses can emit this shape but need to be told to use Anthropic-style cache control for Pioneer’s Claude models. When configuring a custom Pioneer provider, make sure the client is set to:
- Send the system prompt as a content block with
cache_control (not a bare string), and
- Keep the stable instruction on
role: "system".
If your harness exposes a provider compatibility option for cache-control format, set it to the Anthropic style. If it only sends plain-string prompts with no cache_control, you will not get caching on Claude models even though the same harness caches fine on auto-caching models.
Verifying it works
Check the usage block in the response. Field names differ by endpoint:
/v1/chat/completions — reports cache reads under usage.prompt_tokens_details.cached_tokens.
/v1/messages — reports cache_read_input_tokens and cache_creation_input_tokens.
What to expect:
- First request with a new prefix: a non-zero cache write/creation count.
- Later requests reusing that prefix: a non-zero cache read count and a much lower uncached input count.
- All zeros means the marker was missing, the prefix was below the minimum size, the prefix changed between turns, or the model auto-caches (savings still apply but aren’t reported as cache tokens).
Some harnesses normalize these into their own fields (e.g. cacheRead / cacheWrite in their logs) — same numbers, different names.
How to read it back
Token usage on every response splits input tokens by cache status:
| Field | Meaning |
|---|
prompt_tokens | Non-cached input tokens |
cache_read_tokens | Input tokens served from cache (discounted) |
cache_write_tokens | Input tokens written to cache this request |
completion_tokens | Output tokens |
total_tokens | Sum of the four above |
Billing
Cached input is cheaper than fresh input. Rates are relative to a model’s input price:
| Provider | Cache read | Cache write |
|---|
| Claude / Opus | 0.1× input | 1.25× input |
| GPT-4 family | 0.5× input | billed at input rate |
| GPT-5 family | 0.1× input | billed at input rate |
The first request that populates the cache pays the write rate on those tokens; subsequent requests that hit the cache pay the lower read rate. Caches are short-lived, so the savings come from sending similar prompts close together.
Caching charges are visible in Settings → Credits and are displayed broken down by model or by individual request.
For most Anthropic models, the system prompt must be at least 4096 tokens for caching to activate. Cache entries expire after 5 minutes of inactivity.
Common pitfalls
- No
cache_control on a Claude request → no caching, full price every turn. This is the most common cause of a sudden cost jump.
- System prompt sent as a plain string → nothing to attach the marker to.
- Prefix too small → below the minimum cacheable size, so no entry is made.
- The cached prefix changes between turns (e.g. a timestamp early in the system prompt) → the cache can’t be reused. Keep the marked prefix byte-for-byte stable.
- Marking on
/v1/completions → the legacy text endpoint has no content blocks; use any message-based endpoint instead.
- On
/v1/responses, marking the instructions field → it’s a plain string and can’t carry a marker; put the stable prefix in a system/developer input message block instead.
- Adding
cache_control to an auto-caching model → ignored or rejected; let it cache on its own.
- More than 4 markers → only the first 4 are applied.