Design Patterns & Techniques

🔗

Prompt Chaining

🔀

Routing

⚡

Parallelization

🪞

Reflection

🔧

Tool Use

🎯

Planning

👥

Multi-Agent

🧠

Memory Management

📈

Learning and Adaptation

🏗️

Fault Tolerance Infrastructure

📚

Knowledge Retrieval (RAG)

🧠

Reasoning Techniques

🔐

Security & Privacy Patterns

📊

Evaluation and Monitoring

🧠

Context Management

🎨

UI/UX & Human-AI Interaction

Loading...

📈

Adaptive Compute Scaling(ACS)

Dynamically adjusts computational resources based on workload demands and performance requirements

Complexity: highPattern

Core Mechanism

Adaptive Compute Scaling automatically adjusts compute capacity (pods/replicas, nodes/VMs, GPU workers) in response to real-time demand and SLOs. It combines metric-driven policies (e.g., p95 latency, queue length, utilization) with stabilization windows, cooldowns, and predictive/scheduled scaling to maintain performance while minimizing cost. Typical stacks layer service autoscaling (HPA/KEDA/Ray Serve) with cluster/VM autoscaling (Cluster Autoscaler, AWS ASG, Azure VMSS, GCP MIG) and workload-aware batching.

Workflow / Steps

Ingest signals: p95 latency, queue depth/wait time, CPU/memory/GPU utilization, error rate, RPS/TPS.
Define targets: SLOs (latency, availability), budgets (cost/energy), and min/max replica bounds.
Choose policies: target tracking vs. step scaling; separate scale-up and scale-down behaviors.
Execute scaling: adjust replicas (HPA/KEDA/Ray Serve), then provision capacity (ASG/VMSS/MIG, Cluster Autoscaler).
Stabilize: apply stabilizationWindow/cooldowns and rate limits to avoid thrashing.
Pre-warm: keep warm pools or scheduled/predictive scale for diurnal/burst patterns.
Observe and tune: monitor KPIs, attribution by stage, and refine signals/thresholds regularly.

Best Practices

Use workload-appropriate signals: for GPU LLM serving, prefer p95 latency and queue depth over CPU%.

Set distinct scale-up (fast, aggressive) vs. scale-down (slow, conservative) policies with stabilization windows.

Pre-warm workers for cold-start-heavy models; use warm pools or schedule-based and predictive scaling.

Layer autoscalers: service (HPA/KEDA/Ray Serve) + cluster/VM (Cluster Autoscaler/ASG/VMSS/MIG).

Constrain with min/max replicas and max surge/drain; use Pod Disruption Budgets and graceful drains.

Batch intelligently (vLLM/TGI/Triton dynamic batching) and bound per-request tokens to keep tail latency stable.

Right-size nodes; bin-pack GPUs; use MIG/MPS where appropriate; avoid resource fragmentation.

Use SLO-driven targets (e.g., keep p95 < target) and monitor per-tenant fairness in multi-tenant clusters.

Instrument decision logs and runbooks; simulate bursts and failure modes before production.

When NOT to Use

Ultra low-latency hard real-time paths where any cold-start or rebalancing is unacceptable.
Stateful, heavily session-affine workloads without sharding or sticky routing strategies.
Licensing or quota-limited services where additional replicas cannot serve more throughput.
Air-gapped or fixed-capacity environments with no elastic backing resources.

Common Pitfalls

Using CPU% for GPU-bound inference; choose queue latency/depth and GPU utilization instead.
Thrashing from missing stabilization windows/cooldowns and symmetric scale policies.
Ignoring warm-up times, image pulls, or large model load times → prolonged SLO breaches.
No min capacity or headroom; Cluster Autoscaler lag starves pod autoscalers.
Scaling to zero without request buffering/warmers → severe cold-start penalties.
Over-reliance on averages; target p95/p99 and protect against bursty arrivals.

Key Features

Metric-driven autoscaling (latency, queue depth, utilization, error rate).

Event-based scaling via external systems (KEDA: Kafka, SQS, Redis, Prometheus).

Predictive and scheduled scaling for diurnal patterns and known events.

Multi-layer scaling: service, node/VM, and cluster capacity coordination.

SLO-aware policies with distinct up/down behaviors and stabilization.

Cost/energy-aware constraints and budgets with per-tenant quotas.

KPIs / Success Metrics

Latency p50/p95/p99 vs. SLO; queue wait time; error/timeouts under burst.
Throughput (RPS/TPS, tokens/sec) and scaling latency (time-to-capacity).
Cost per successful request; instance/GPU-hours; over/under-provisioned time%.
Autoscaling event rate and stability (no flapping); saturation headroom.
Per-tenant fairness and quota adherence in multi-tenant clusters.

Token / Resource Usage

For LLM inference, track tokens/sec, concurrent requests, and KV cache memory. Use batching and context/token caps to bound tail latency; prefer queue depth + p95 latency as primary signals. For GPU nodes, monitor VRAM/SM utilization, model load times, and ensure autoscalers have capacity headroom.

Signals: queue depth, p95 latency, GPU util, request concurrency, token budgets.
Controls: dynamic batching, max new tokens, prompt length limits, warm pools.
Efficiency: cache KV/prefix; shard large models; bin-pack GPUs; avoid tiny underutilized pods.

Best Use Cases

Bursty, diurnal, or campaign-driven traffic (e-commerce, launches, live events).
GPU LLM serving with dynamic batching and strict latency/cost SLOs.
Event-driven processing (queues/streams) where backlog length is a clear signal.
Multi-tenant platforms needing cost control, fairness, and predictable SLOs.

References & Further Reading

Academic Papers

Implementation Guides

Tools & Libraries

Kubernetes HPA/KEDA/Cluster Autoscaler; AWS ASG; Azure VMSS; GCP MIG.
vLLM, Text Generation Inference (TGI), NVIDIA Triton Inference Server.
Prometheus + Grafana; CloudWatch; Azure Monitor; GCP Cloud Monitoring.

Community & Discussions

📈

Adaptive Compute Scaling(ACS)

Dynamically adjusts computational resources based on workload demands and performance requirements

Complexity: highPattern

Core Mechanism

Workflow / Steps

Ingest signals: p95 latency, queue depth/wait time, CPU/memory/GPU utilization, error rate, RPS/TPS.
Define targets: SLOs (latency, availability), budgets (cost/energy), and min/max replica bounds.
Choose policies: target tracking vs. step scaling; separate scale-up and scale-down behaviors.
Execute scaling: adjust replicas (HPA/KEDA/Ray Serve), then provision capacity (ASG/VMSS/MIG, Cluster Autoscaler).
Stabilize: apply stabilizationWindow/cooldowns and rate limits to avoid thrashing.
Pre-warm: keep warm pools or scheduled/predictive scale for diurnal/burst patterns.
Observe and tune: monitor KPIs, attribution by stage, and refine signals/thresholds regularly.