Agentic Design

Patterns
🗂️

Stateful Graph Workflows

Graph-based workflow management with persistent state across nodes

Complexity: highPattern

Core Mechanism

Stateful Graph Workflows model a process as a directed graph of nodes (steps) and edges (control flow), with a persistent state carried across steps. Execution proceeds per session/thread, with checkpoints that allow pause/resume, replay, branching, and time-travel for debugging. Conditional edges and loops enable dynamic routing; parallel branches allow concurrent work. This pattern underpins modern agentic systems (e.g., LangGraph, LlamaIndex Workflows) and general workflow engines (e.g., Temporal) where reliability and auditability matter.

Workflow / Steps

  1. Define a typed state schema (inputs, working memory, outputs, metadata, provenance).
  2. Implement nodes as deterministic steps (tool/LLM calls, functions) that read/write state.
  3. Connect nodes with edges; add conditional routing and loop guards; enable parallel branches where safe.
  4. Configure persistence (checkpoint store) and thread/session keys for resumability and replay.
  5. Run per thread: advance node → write checkpoint → decide next edge; support human-in-the-loop stops.
  6. Observe and govern: logs/traces, metrics, cost/latency budgets, policy and safety checks.

Best Practices

Keep state compact and structured; persist artifacts by reference (URIs) with provenance.
Make nodes idempotent and side‑effect safe; store dedupe keys and retry metadata in state.
Version graphs and state schemas; support migrations and backward‑compatible checkpoints.
Add explicit termination criteria and loop guards; cap max steps/iterations per thread.
Separate orchestration from business logic; keep prompts/tools modular and testable.
Instrument per‑node KPIs (latency, cost/tokens, success/fallbacks) and enforce budgets.
Use checkpointing for resume/replay; redact or encrypt sensitive fields at rest.
Prefer parallel branches only for independent, side‑effect‑free steps; bound concurrency.
Write table‑driven tests for routing; fuzz edge conditions; record/replay tricky sessions.

When NOT to Use

Simple, linear request→response flows without branching or long‑lived context.

Ultra low‑latency paths where orchestration/LLM round‑trips would violate SLOs.

One‑off scripts or stateless batch jobs where checkpointing adds no value.

Strict cross‑service ACID transactions; use transactional workflow engines or sagas.

Domains better served by rules engines or SQL pipelines with minimal control flow.

Common Pitfalls

Unbounded loops or recursive retries without caps or convergence checks.

Global mutable state shared across threads; race conditions in parallel branches.

Missing idempotency → duplicate side effects when resuming/retrying.

Checkpointing large blobs instead of references → storage bloat and slow resumes.

Token blowups from over‑stuffed prompts and missing summarization of state/history.

Mixing orchestration and business logic tightly; hard‑to‑test, brittle graphs.

Key Features

Persistent state per thread/session with typed schema.
Checkpointing, resume, replay, and time‑travel debugging.
Conditional routing, loops with guards, and parallel branches.
Human‑in‑the‑loop steps and approvals; pause/resume points.
Pluggable storage/backends for checkpoints and artifacts.
Deterministic step execution with side‑effect controls.
Rich observability: traces, metrics, and audit logs.
Policy gates and budgets (latency, cost/tokens) per node.

KPIs / Success Metrics

Time‑to‑completion per thread; p50/p95/p99 per node; throughput.
Success rate vs. fallback/human‑assist; rollback/resume counts.
Tokens per run and per node; cost per run; cache hit rate.
Retry/DLQ rates; loop/iteration counts; parallelism utilization.
Policy/SLA adherence: budget breaches (latency, cost, tokens).
Quality metrics (task‑specific accuracy, evaluation pass rate).

Token / Resource Usage

Drivers: planning/analysis prompts, retrieval/context packing, per‑node LLM calls, self‑checks, and state growth.

Controls: per‑node token caps, summarize history/state, early‑exit on high confidence, batch parallelizable nodes, cache aggressively.

Storage/CPU: compact checkpoints (diffs), externalize large artifacts, dedupe state; bound concurrency to match budgets.

Best Use Cases

Multi‑step AI task automation (plan→act→verify→report) with resumability and audit.

RAG pipelines with routing/verification (e.g., CoVe/CRAG) and fallback paths.

Agent teams invoking tools/skills with human approvals and safe rollbacks.

Compliance/regulated workflows requiring deterministic replay and traceability.

Long‑running operations with external dependencies and intermittent failures.

References & Further Reading

Patterns

closed

Loading...