Agentic Design

Patterns
๐Ÿ“ˆ

Adaptive Compute Scaling(ACS)

Dynamically adjusts computational resources based on workload demands and performance requirements

Complexity: highPattern

Core Mechanism

Adaptive Compute Scaling automatically adjusts compute capacity (pods/replicas, nodes/VMs, GPU workers) in response to real-time demand and SLOs. It combines metric-driven policies (e.g., p95 latency, queue length, utilization) with stabilization windows, cooldowns, and predictive/scheduled scaling to maintain performance while minimizing cost. Typical stacks layer service autoscaling (HPA/KEDA/Ray Serve) with cluster/VM autoscaling (Cluster Autoscaler, AWS ASG, Azure VMSS, GCP MIG) and workload-aware batching.

Workflow / Steps

  1. Ingest signals: p95 latency, queue depth/wait time, CPU/memory/GPU utilization, error rate, RPS/TPS.
  2. Define targets: SLOs (latency, availability), budgets (cost/energy), and min/max replica bounds.
  3. Choose policies: target tracking vs. step scaling; separate scale-up and scale-down behaviors.
  4. Execute scaling: adjust replicas (HPA/KEDA/Ray Serve), then provision capacity (ASG/VMSS/MIG, Cluster Autoscaler).
  5. Stabilize: apply stabilizationWindow/cooldowns and rate limits to avoid thrashing.
  6. Pre-warm: keep warm pools or scheduled/predictive scale for diurnal/burst patterns.
  7. Observe and tune: monitor KPIs, attribution by stage, and refine signals/thresholds regularly.

Best Practices

Use workload-appropriate signals: for GPU LLM serving, prefer p95 latency and queue depth over CPU%.
Set distinct scale-up (fast, aggressive) vs. scale-down (slow, conservative) policies with stabilization windows.
Pre-warm workers for cold-start-heavy models; use warm pools or schedule-based and predictive scaling.
Layer autoscalers: service (HPA/KEDA/Ray Serve) + cluster/VM (Cluster Autoscaler/ASG/VMSS/MIG).
Constrain with min/max replicas and max surge/drain; use Pod Disruption Budgets and graceful drains.
Batch intelligently (vLLM/TGI/Triton dynamic batching) and bound per-request tokens to keep tail latency stable.
Right-size nodes; bin-pack GPUs; use MIG/MPS where appropriate; avoid resource fragmentation.
Use SLO-driven targets (e.g., keep p95 < target) and monitor per-tenant fairness in multi-tenant clusters.
Instrument decision logs and runbooks; simulate bursts and failure modes before production.

When NOT to Use

  • Ultra low-latency hard real-time paths where any cold-start or rebalancing is unacceptable.
  • Stateful, heavily session-affine workloads without sharding or sticky routing strategies.
  • Licensing or quota-limited services where additional replicas cannot serve more throughput.
  • Air-gapped or fixed-capacity environments with no elastic backing resources.

Common Pitfalls

  • Using CPU% for GPU-bound inference; choose queue latency/depth and GPU utilization instead.
  • Thrashing from missing stabilization windows/cooldowns and symmetric scale policies.
  • Ignoring warm-up times, image pulls, or large model load times โ†’ prolonged SLO breaches.
  • No min capacity or headroom; Cluster Autoscaler lag starves pod autoscalers.
  • Scaling to zero without request buffering/warmers โ†’ severe cold-start penalties.
  • Over-reliance on averages; target p95/p99 and protect against bursty arrivals.

Key Features

Metric-driven autoscaling (latency, queue depth, utilization, error rate).
Event-based scaling via external systems (KEDA: Kafka, SQS, Redis, Prometheus).
Predictive and scheduled scaling for diurnal patterns and known events.
Multi-layer scaling: service, node/VM, and cluster capacity coordination.
SLO-aware policies with distinct up/down behaviors and stabilization.
Cost/energy-aware constraints and budgets with per-tenant quotas.

KPIs / Success Metrics

  • Latency p50/p95/p99 vs. SLO; queue wait time; error/timeouts under burst.
  • Throughput (RPS/TPS, tokens/sec) and scaling latency (time-to-capacity).
  • Cost per successful request; instance/GPU-hours; over/under-provisioned time%.
  • Autoscaling event rate and stability (no flapping); saturation headroom.
  • Per-tenant fairness and quota adherence in multi-tenant clusters.

Token / Resource Usage

For LLM inference, track tokens/sec, concurrent requests, and KV cache memory. Use batching and context/token caps to bound tail latency; prefer queue depth + p95 latency as primary signals. For GPU nodes, monitor VRAM/SM utilization, model load times, and ensure autoscalers have capacity headroom.

  • Signals: queue depth, p95 latency, GPU util, request concurrency, token budgets.
  • Controls: dynamic batching, max new tokens, prompt length limits, warm pools.
  • Efficiency: cache KV/prefix; shard large models; bin-pack GPUs; avoid tiny underutilized pods.

Best Use Cases

  • Bursty, diurnal, or campaign-driven traffic (e-commerce, launches, live events).
  • GPU LLM serving with dynamic batching and strict latency/cost SLOs.
  • Event-driven processing (queues/streams) where backlog length is a clear signal.
  • Multi-tenant platforms needing cost control, fairness, and predictable SLOs.

References & Further Reading

Patterns

closed

Loading...