Patterns
πŸ—οΈ

Event-Driven Hierarchical Agents(EDHA)

Multi-level agent hierarchy with event-based coordination

Complexity: highPattern

Core Mechanism

Event-Driven Hierarchical Agents (EDHA) organize agents into supervisor→worker levels coordinated by events. Higher levels publish directives to level-specific topics; lower levels consume, decompose, and act, emitting status and results upward. Each level isolates concerns, uses consumer groups for scale, and applies policies (retries, backoff, dead-letter queues) with clear topic taxonomy and correlation/causation IDs for traceability. This blends hierarchical planning with event-driven architecture for resilient orchestration.

Workflow / Steps

  1. Define hierarchy: levels (e.g., executive β†’ manager β†’ team) and role capabilities per level.
  2. Design topic taxonomy and naming: level-prefixed topics (e.g., exec.directives, mgr.assignments, team.tasks).
  3. Provision transport: Kafka/RabbitMQ/NATS/Cloud Pub/Sub with partitions/FIFO groups and retention.
  4. Specify contracts: message schemas (JSON/Avro/Protobuf), headers (correlationId, causationId, tenant, ttl).
  5. Implement supervisors: publish directives, evaluate status events, handle escalation/approval gates.
  6. Implement workers: consume assignments, decompose tasks, invoke tools/LLMs, publish progress/results.
  7. Apply reliability: retries with jittered backoff, idempotency keys, DLQs, circuit breakers, timeouts.
  8. Observe and govern: tracing, metrics per level, quotas, cost/latency budgets, policy and safety checks.
  9. Reconfigure dynamically: scale consumer groups, change routing, or insert review agents without downtime.

Best Practices

Use stable hierarchical topic naming; segregate control vs data planes; keep messages small, reference large artifacts.
Include correlation/causation IDs and versioned schemas; validate with schema registry; ensure backward compatibility.
Design idempotent consumers; de-duplicate via message IDs and idempotency keys; avoid non-atomic side effects.
Bound fan-out and depth; enforce per-level SLAs, budgets, and termination criteria to prevent runaway cascades.
Apply backpressure and concurrency caps; tune prefetch/flow control; avoid hot partitions via well-chosen keys.
Implement DLQs with triage workflows; add human-in-the-loop for irrecoverable or policy-sensitive escalations.
Gate cross-level transactions with sagas/compensation; avoid global locks; prefer eventual consistency patterns.
Instrument with tracing (OpenTelemetry) and per-level KPIs; maintain audit logs and reproducible event replays.
Secure by level and tenant: authN/Z, encryption, PII redaction; isolate noisy tenants with quotas.

When NOT to Use

Single-team, short-lived tasks with simple synchronous call graphs and no need for hierarchical oversight.

Strict globally ordered workflows or ACID transactions spanning many agents without broker support.

Ultra low-latency paths where queuing/LLM round-trips break SLOs; prefer direct RPC or in-process coordination.

Domains where centralized optimization beats decomposed local decisions (noisy consensus, tight coupling).

Common Pitfalls

Feedback loops between levels causing churn; missing acyclic flow design and escalation rules.

Unbounded fan-out and hidden dependencies; no caps on depth, retries, or parallelism.

Consumer group misconfiguration leading to duplicate work or idle workers; hot keys causing skew.

No DLQ triage; mixing control and data payloads; large blobs in the bus inflating cost/latency.

Missing idempotency and exactly-once semantics where needed; lack of schema evolution strategy.

Key Features

Supervisor→worker hierarchy with cascading task decomposition and approvals.
Level-scoped topics and consumer groups; isolation domains per level/tenant.
Upward status aggregation, downward directives; event replay for audit and recovery.
Policy gates per level (SLAs, budgets, safety checks, review workflows).
Dynamic reconfiguration: hot-add agents/levels, reroute topics, scale consumer groups.
Interoperation with planning (HTN/options) and actor-style supervision trees.

KPIs / Success Metrics

Directive→completion latency per level (p50/p95/p99) and end-to-end time to resolution.
Throughput and backlog per level (consumer lag, oldest event age, SLA breach rate).
Escalation and rework rates; approval turnaround; successful decomposition rate.
Duplicate/ordering-violation rates; retry/DLQ rates; idempotency failure incidence.
Cost per completed task by level; token per task; cache hit rate; utilization of consumers.
Change safety: incident count after reconfiguration; policy violation/override counts.

Token / Resource Usage

LLM tokens scale with hierarchy depth and event verbosity. Prefer status summaries and references over full logs.

Broker costs: storage/retention, egress, partitions/FIFO groups. Tune message size, compression, and batching.

Compute/memory: consumer concurrency, serialization, and tool execution; cap parallelism per level.

Adopt caching/materialized views for rollups; log per-level token/cost budgets with early-exit heuristics.

Best Use Cases

Enterprise program/portfolio management with cross-team decomposition and approvals.

Tiered customer support and incident management with escalation and review gates.

Supply chain and operations orchestration across regions/business units with local autonomy.

Regulatory and safety workflows requiring hierarchical review and auditable event trails.

References & Further Reading

Patterns

closed

Loading...

Built by Kortexya