Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pflow.run/llms.txt

Use this file to discover all available pages before exploring further.

For the curious. Your AI agent declares cache blocks when a workflow has multiple LLM calls sharing context. This explains what those declarations do and why they reduce cost.
When a workflow chains many LLM calls that all reference the same context (a project brief, a long document, a fixed persona), the LLM provider re-tokenizes that context on every call. With provider-side prompt caching, that context gets tokenized once on the first call and read from a cached prefix on every subsequent call within the TTL window. Cached reads cost ~10% of normal input tokens on Anthropic and Gemini; OpenAI auto-caches at no extra rate. On a workflow with 15 sequential LLM calls sharing a 10k-token context, this typically cuts input cost 50-70% on the first run and 80%+ on reruns within the TTL.

Two cache layers

pflow has two independent cache layers. Don’t confuse them — they solve different problems.
LayerWhat it isHow to opt out
Memo cachepflow’s local re-execution cache. Skips a node entirely if its inputs match a prior run.cache: false per node, or --no-cache on pflow run
Provider prompt cacheAnthropic / OpenAI / Gemini server-side caching of the system prompt prefix. Reduces input cost; the LLM still runs.Don’t declare ## Cache / prompt_cache:
--no-cache only disables the memo layer. Provider prompt caching still fires when declared.

Declaring shared context

The ## Cache block sits alongside ## Inputs and ## Steps in a workflow file. It lists stable values that flow into multiple LLM calls — workflow inputs and upstream node outputs:
## Cache

- ttl: 5m

```cache
The project brief we are working from:

${brief}

The technical constraints:

${constraints.response}
```
Each chunk is one prose label followed by exactly one ${var} reference. The prose travels into the cached system prefix verbatim — what you write is what the LLM sees as the cache label. LLM nodes opt in via prompt_cache:, listing chunks in the same order they appear in the ## Cache block:
### review

Review the proposed solution.

- type: llm
- prompt_cache: [brief, constraints.response]
- prompt: Review this approach: ${approach}
Out-of-order subsets are a hard error at validation time — providers cache by prefix, so order has to match. The error message shows the expected order so the fix is mechanical.

TTL

ValueMeaningWhen to use
5m (default)Provider’s standard cache durationMost workflows; matches typical run time
1hAnthropic 1-hour cache, Gemini 3600s, OpenAI 24h retentionLong-running workflows or reruns within an hour
1h writes cost roughly 2× the standard write rate on Anthropic, so it pays off when the cached prefix gets read at least 3 times within the hour.

Minimum tokens

Provider caches only fire above a minimum token threshold:
ProviderThreshold
Anthropic Sonnet 4.5, Opus 4.1, Sonnet 3.71024
Anthropic Sonnet 4.6, Haiku 3.52048
Anthropic Opus 4.5+, Haiku 4.54096
Gemini Flash1024
OpenAI auto-cache1024
Below the threshold, the cache marker is silently no-op’d by the provider — no error, just no savings. pflow analyze-cache warns when a declared subset is below threshold and suggests including more chunks to cross it.

Batch prefix caching

When a batch node fans out N parallel LLM calls that share a stable prefix (e.g., the same prompt template with ${item.X} substituted), pflow can cache that prefix automatically. This requires prewarm: true on the batch node:
### score-each

Score each candidate against the rubric.

- type: llm
- prompt: ./score.prompt.md
- batch:
    items: ${candidates}
    as: candidate
    parallel: true
    max_concurrent: 5
- prewarm: true
With prewarm: true, pflow runs the first item synchronously to write the cache, then fans out the remaining N-1 calls in parallel as cache reads. Without prewarm:, all N calls write the cache simultaneously — paying the write cost N times for no read benefit. pflow run --dry-run and pflow analyze-cache recommend prewarm: true when the savings ratio crosses 5%. The decision is the agent’s — prewarm: true adds one call’s latency to the batch in exchange for ~5-10× cost reduction on the remaining items.

Sub-workflows

Each .pflow.md file declares its own ## Cache block scoped to its own inputs and step outputs. Sub-workflows do not inherit the parent’s cache block — they declare independently so they can run standalone with caching. When a parent passes a value into a child workflow, both files can cache it independently. If the rendered prose labels are byte-identical across the boundary, the provider’s cache fires across files (incidental, not orchestrated). pflow analyze-cache warns when prose labels diverge for the same logical value.

Discovering opportunities

pflow analyze-cache is the entry point for finding savings:
pflow analyze-cache workflow.pflow.md
It identifies LLM calls that share context, projects savings, and emits a paste-ready ## Cache block for greenfield workflows. See the analyze-cache reference. pflow run --dry-run emits a one-line nudge when actionable opportunities exist — silent on optimal plans.

What changes when caching is declared

  • The system message your LLM call receives starts with the rendered cache content (prose + values), followed by your prompt:.
  • pflow’s memo cache key includes the rendered cache content, so changing a cached chunk’s value invalidates memo entries correctly.
  • Trace files record cache token counts (cache_creation_input_tokens, cache_read_input_tokens) so you can see actual vs predicted ratios via pflow analyze-cache --from-trace.
The bytes sent to the LLM match what you wrote in the workflow file — pflow doesn’t rewrite or restructure prompts. Caching is a metadata layer; the prompt content itself is unchanged.