Skip to main content
For the curious. Your AI agent declares cache blocks when a workflow has multiple LLM calls sharing context. This explains what those declarations do and why they reduce cost.
When a workflow chains many LLM calls that all reference the same context (a project brief, a long document, a fixed persona), the LLM provider re-tokenizes that context on every call. With provider-side prompt caching, that context gets tokenized once on the first call and read from a cached prefix on every subsequent call within the TTL window. Cached reads cost ~10% of normal input tokens on Anthropic and Gemini; OpenAI auto-caches at no extra rate. On a workflow with 15 sequential LLM calls sharing a 10k-token context, this typically cuts input cost 50-70% on the first run and 80%+ on reruns within the TTL.

Two cache layers

pflow has two independent cache layers. Don’t confuse them — they solve different problems.
LayerWhat it isHow to opt out
Memo cachepflow’s local re-execution cache. Skips a node entirely if its inputs match a prior run.cache: false per node, or --no-cache on pflow run
Provider prompt cacheAnthropic / OpenAI / Gemini server-side caching of the system prompt prefix. Reduces input cost; the LLM still runs.Don’t declare ## Cache / prompt_cache:
--no-cache only disables the memo layer. Provider prompt caching still fires when declared.

Declaring shared context

The ## Cache block sits alongside ## Inputs and ## Steps in a workflow file. It lists stable values that flow into multiple LLM calls — workflow inputs and upstream node outputs:
## Cache

- ttl: 5m

```cache
The project brief we are working from:

${brief}

The technical constraints:

${constraints.response}
```
Each chunk is one prose label followed by exactly one ${var} reference. The prose travels into the cached system prefix verbatim — what you write is what the LLM sees as the cache label. LLM nodes opt in via prompt_cache:, listing chunks in the same order they appear in the ## Cache block:
### review

Review the proposed solution.

- type: llm
- prompt_cache: [brief, constraints.response]
- prompt: Review this approach: ${approach}
Out-of-order subsets are a hard error at validation time — providers cache by prefix, so order has to match. The error message shows the expected order so the fix is mechanical.

TTL

ValueMeaningWhen to use
5m (default)pflow’s default cache durationMost workflows; matches typical run time
1m-60mGemini minute-level cache duration, capped at 1 hourGemini workflows that need a specific retention window
1hAlias for 60m; Anthropic 1-hour cache, Gemini 3600s, OpenAI 24h retentionLong-running workflows or reruns within an hour
Anthropic and OpenAI only support pflow’s discrete 5m and 1h behaviors today. pflow errors instead of silently rounding unsupported minute-level TTLs. 1h writes cost roughly 2× the standard write rate on Anthropic, so it pays off when the cached prefix gets read at least 3 times within the hour.

Minimum tokens

Provider caches only fire above a minimum token threshold:
ProviderThreshold
Anthropic Sonnet 4.5, Opus 4.1, Sonnet 3.71024
Anthropic Sonnet 4.6, Haiku 3.52048
Anthropic Opus 4.5+, Haiku 4.54096
Gemini explicit cached content4096
OpenAI auto-cache1024
Below the threshold, pflow strips or treats the provider cache marker as ineffective before cost projection — no savings should be assumed. pflow analyze-cache shows below-min ready/upside rows with a blocker note explaining the threshold; only the active (provider-effective) portion feeds cost estimates.

Batch prefix caching

When a batch node fans out N parallel LLM calls that share a stable prefix (e.g., the same prompt template with ${item.X} substituted), pflow can cache that prefix automatically. This requires prewarm: true on the batch node:
### score-each

Score each candidate against the rubric.

- type: llm
- prompt: ./score.prompt.md
- batch:
    items: ${candidates}
    as: candidate
    parallel: true
    max_concurrent: 5
- prewarm: true
With prewarm: true, pflow makes one short LLM call before the batch dispatches to warm up the provider’s cache. All N batch items then run in parallel against the warm cache, paying cache-read prices instead of each writing the cache themselves. Without prewarm:, all N calls write the cache simultaneously — paying the write cost N times with no read benefit. The warmup covers two kinds of cached content:
  • Values declared in ## Cache (when the batch node lists them in prompt_cache:).
  • The fixed portion of the prompt template that’s the same for every item — the text before the first ${item.X} reference.
The warmup is a real LLM call with a real cost. That cost shows up in the Cost: line, in trace cost totals, and in --report. It’s not counted in the “N calls” total because it’s setup work pflow does for you, not one of your workflow’s calls. For a 4,000-token cached prefix on Anthropic Sonnet 4.5, the warmup costs roughly 0.015andeachbatchitemthencostsabout0.015 and each batch item then costs about 0.0015 (10% of an uncached call). Savings grow with prefix size and batch count. pflow run --dry-run and pflow analyze-cache recommend prewarm: true when the savings clear 5%. The trade-off: the warmup adds 5-10 seconds of wall time to the batch, in exchange for every real item running at cache-read prices.

Sub-workflows

Each .pflow.md file declares its own ## Cache block scoped to its own inputs and step outputs. Sub-workflows do not inherit the parent’s cache block — they declare independently so they can run standalone with caching. When a parent passes a value into a child workflow, both files can cache it independently. If the rendered prose labels are byte-identical across the boundary, the provider’s cache fires across files (incidental, not orchestrated). pflow analyze-cache warns when prose labels diverge for the same logical value.

Discovering opportunities

pflow analyze-cache is the entry point for finding savings:
pflow analyze-cache workflow.pflow.md
It identifies LLM calls that share context, separates active cache from ready/upside opportunities, projects savings only from active cache, and emits a paste-ready ## Cache block for greenfield workflows. See the analyze-cache reference. pflow run --dry-run emits a one-line nudge when actionable opportunities exist — silent on optimal plans.

What changes when caching is declared

  • The system message your LLM call receives starts with the rendered cache content (prose + values), followed by your prompt:.
  • pflow’s memo cache key includes the rendered cache content, so changing a cached chunk’s value invalidates memo entries correctly.
  • Trace files record cache token counts (cache_creation_input_tokens, cache_read_input_tokens) so you can see actual vs predicted ratios via pflow analyze-cache --from-trace.
The bytes sent to the LLM match what you wrote in the workflow file — pflow doesn’t rewrite or restructure prompts. Caching is a metadata layer; the prompt content itself is unchanged.