For the curious. Your AI agent declares cache blocks when a workflow has multiple LLM calls sharing context. This explains what those declarations do and why they reduce cost.
When a workflow chains many LLM calls that all reference the same context (a project brief, a long document, a fixed persona), the LLM provider re-tokenizes that context on every call. With provider-side prompt caching, that context gets tokenized once on the first call and read from a cached prefix on every subsequent call within the TTL window. Cached reads cost ~10% of normal input tokens on Anthropic and Gemini; OpenAI auto-caches at no extra rate.
On a workflow with 15 sequential LLM calls sharing a 10k-token context, this typically cuts input cost 50-70% on the first run and 80%+ on reruns within the TTL.
Two cache layers
pflow has two independent cache layers. Don’t confuse them — they solve different problems.
| Layer | What it is | How to opt out |
|---|
| Memo cache | pflow’s local re-execution cache. Skips a node entirely if its inputs match a prior run. | cache: false per node, or --no-cache on pflow run |
| Provider prompt cache | Anthropic / OpenAI / Gemini server-side caching of the system prompt prefix. Reduces input cost; the LLM still runs. | Don’t declare ## Cache / prompt_cache: |
--no-cache only disables the memo layer. Provider prompt caching still fires when declared.
Declaring shared context
The ## Cache block sits alongside ## Inputs and ## Steps in a workflow file. It lists stable values that flow into multiple LLM calls — workflow inputs and upstream node outputs:
## Cache
- ttl: 5m
```cache
The project brief we are working from:
${brief}
The technical constraints:
${constraints.response}
```
Each chunk is one prose label followed by exactly one ${var} reference. The prose travels into the cached system prefix verbatim — what you write is what the LLM sees as the cache label.
LLM nodes opt in via prompt_cache:, listing chunks in the same order they appear in the ## Cache block:
### review
Review the proposed solution.
- type: llm
- prompt_cache: [brief, constraints.response]
- prompt: Review this approach: ${approach}
Out-of-order subsets are a hard error at validation time — providers cache by prefix, so order has to match. The error message shows the expected order so the fix is mechanical.
TTL
| Value | Meaning | When to use |
|---|
5m (default) | pflow’s default cache duration | Most workflows; matches typical run time |
1m-60m | Gemini minute-level cache duration, capped at 1 hour | Gemini workflows that need a specific retention window |
1h | Alias for 60m; Anthropic 1-hour cache, Gemini 3600s, OpenAI 24h retention | Long-running workflows or reruns within an hour |
Anthropic and OpenAI only support pflow’s discrete 5m and 1h behaviors today. pflow errors instead of silently rounding unsupported minute-level TTLs. 1h writes cost roughly 2× the standard write rate on Anthropic, so it pays off when the cached prefix gets read at least 3 times within the hour.
Minimum tokens
Provider caches only fire above a minimum token threshold:
| Provider | Threshold |
|---|
| Anthropic Sonnet 4.5, Opus 4.1, Sonnet 3.7 | 1024 |
| Anthropic Sonnet 4.6, Haiku 3.5 | 2048 |
| Anthropic Opus 4.5+, Haiku 4.5 | 4096 |
| Gemini explicit cached content | 4096 |
| OpenAI auto-cache | 1024 |
Below the threshold, pflow strips or treats the provider cache marker as ineffective before cost projection — no savings should be assumed. pflow analyze-cache shows below-min ready/upside rows with a blocker note explaining the threshold; only the active (provider-effective) portion feeds cost estimates.
Batch prefix caching
When a batch node fans out N parallel LLM calls that share a stable prefix (e.g., the same prompt template with ${item.X} substituted), pflow can cache that prefix automatically. This requires prewarm: true on the batch node:
### score-each
Score each candidate against the rubric.
- type: llm
- prompt: ./score.prompt.md
- batch:
items: ${candidates}
as: candidate
parallel: true
max_concurrent: 5
- prewarm: true
With prewarm: true, pflow makes one short LLM call before the batch dispatches to warm up the provider’s cache. All N batch items then run in parallel against the warm cache, paying cache-read prices instead of each writing the cache themselves. Without prewarm:, all N calls write the cache simultaneously — paying the write cost N times with no read benefit.
The warmup covers two kinds of cached content:
- Values declared in
## Cache (when the batch node lists them in prompt_cache:).
- The fixed portion of the prompt template that’s the same for every item — the text before the first
${item.X} reference.
The warmup is a real LLM call with a real cost. That cost shows up in the Cost: line, in trace cost totals, and in --report. It’s not counted in the “N calls” total because it’s setup work pflow does for you, not one of your workflow’s calls. For a 4,000-token cached prefix on Anthropic Sonnet 4.5, the warmup costs roughly 0.015andeachbatchitemthencostsabout0.0015 (10% of an uncached call). Savings grow with prefix size and batch count.
pflow run --dry-run and pflow analyze-cache recommend prewarm: true when the savings clear 5%. The trade-off: the warmup adds 5-10 seconds of wall time to the batch, in exchange for every real item running at cache-read prices.
Sub-workflows
Each .pflow.md file declares its own ## Cache block scoped to its own inputs and step outputs. Sub-workflows do not inherit the parent’s cache block — they declare independently so they can run standalone with caching.
When a parent passes a value into a child workflow, both files can cache it independently. If the rendered prose labels are byte-identical across the boundary, the provider’s cache fires across files (incidental, not orchestrated). pflow analyze-cache warns when prose labels diverge for the same logical value.
Discovering opportunities
pflow analyze-cache is the entry point for finding savings:
pflow analyze-cache workflow.pflow.md
It identifies LLM calls that share context, separates active cache from ready/upside opportunities, projects savings only from active cache, and emits a paste-ready ## Cache block for greenfield workflows. See the analyze-cache reference.
pflow run --dry-run emits a one-line nudge when actionable opportunities exist — silent on optimal plans.
What changes when caching is declared
- The system message your LLM call receives starts with the rendered cache content (prose + values), followed by your
prompt:.
- pflow’s memo cache key includes the rendered cache content, so changing a cached chunk’s value invalidates memo entries correctly.
- Trace files record cache token counts (
cache_creation_input_tokens, cache_read_input_tokens) so you can see actual vs predicted ratios via pflow analyze-cache --from-trace.
The bytes sent to the LLM match what you wrote in the workflow file — pflow doesn’t rewrite or restructure prompts. Caching is a metadata layer; the prompt content itself is unchanged.